Skip to content

Instantly share code, notes, and snippets.

@Ollie-Boyd
Last active June 12, 2023 13:34
Show Gist options
  • Select an option

  • Save Ollie-Boyd/4da51890919fc0b0b7d1ae6617cbeb70 to your computer and use it in GitHub Desktop.

Select an option

Save Ollie-Boyd/4da51890919fc0b0b7d1ae6617cbeb70 to your computer and use it in GitHub Desktop.
Python, Scrapy, ElasticSearch and OSINT learning

To do:

  • Figure out how to install Scrapy
  • Learn Scrapy basics
  • Learn full parsing rules
  • Learn to make pipeline, goal to pipe to ElasticSearch
  • Integrate Scrapy and Tor
  • Make multi-threaded Tor instances somehow, might need a Linux VM.
  • Figure everything about EasticSearch.
  • How to only save new/edited pages?
  • How long does it take to crawl a site?
  • Figure out how to structure Scrapy parsed data ready for EasticSearch. Just dump JSON?
  • Figure out how to defeat captcha.
  • How often does captcha need solved? How to we maintain client-side state...cookies?
  • Can DWMs detect spiders? Do they look like/cause DDOS? If so must slow spiders. It's vital to not affect DWMs

" We also provide as input to the scraper a session cookie that we obtain by manually logging into the marketplace and solving a CAPTCHA; and parameters such as the maximum desired scraping rate. In addition to being careful about what to request from a marketplace, we obfuscate how we request content. For each page request, the scraper randomly selects a Tor circuit out of 20 pre-built circuits. This strategy ensures that the requests are being distributed over several rendezvous points in the Tor network. This helps prevent triggering anti-DDoS heuristics certain marketplaces use.(However some marketplaces, e.g., Agora, use session cookies to bind requests coming from different circuits, and require additional attention.) This strategy also provides redundancy in the event that one of the circuits being used becomes unreliable and speeds up the time it takes to observe the entire site." Kyle Soska and Nicolas Christin

Papers on scraping DWMs using Scrapy.

Notes on installing on fresh machine. Would probs use VM in future.

This is the new shebang we need to use to avoid system Python path:

#!/usr/bin/env python

"Use "python -m pip" instead of running "pip" or "pip3" - that way the pip and python versions always match, whichever you currently have selected as "python". Couldn't get this to work, 'pip' seems fine.

https://github.com/mrt-kousha/scrapy/wiki/Implement-Tor-middleware-for-anonymous-data-scraping

https://github.com/WiliTest/Anonymous-scrapping-Scrapy-Tor-Privoxy-UserAgent

https://2019.www.torproject.org/docs/tor-doc-osx.html.en

https://stem.torproject.org/faq.html#how-do-i-request-a-new-identity-from-tor

https://medium.com/@amine.btt/a-crawler-that-beats-bot-detection-879888f470eb

https://www.kaggle.com/getting-started/45130

  • I've decided to host this on a DO Droplet in production as it makes little sense to run the scraper on my laptop.

Install Tor on OSX

brew install tor

You should see something like:

You will find a sample torrc file in /usr/local/etc/tor. You want to take the torrc.sample and make it your own torrc file.

First navigate to your torrc file using the path that they gave you:

cd /usr/local/etc/tor

Rename the torrc.sample to torrc so that it starts working as our torrc file (note you can also make a copy here):

mv torrc.sample torrc

Now you have a torrc file! This is your Tor configuration file. That means that every time you edit the file, you have to restart tor in order for any changes to take effect.

We need to generate a hashed password to control Tor using stem. In terminal type this command. Replace the example password with your own. Make a note of the original and the generated hash as we'll need both.

tor --hash-password "dz#X2nB%LJHGF0sB9DnZWv#87^"

Add the following lines to torrc file, replace the example hash with your own:

ControlPort 9051

HashedControlPassword 16:04C7A70H876B7BS6B69EE768NV7375CA2B749341437

CookieAuthentication 1

Restart Tor again to the configuration changes are applied.

brew services restart tor

It might take a while to establish your first connection, but after the first time it goes faster.

Now that you have tor running from source, install the Tor Browser. It is basically Firefox with Tor. It is more user-friendly and we will use it later to visit our hidden service. Simply download and follow the installation instructions.


Install privoxy on OSX

brew install privoxy

Now, tell privoxy to use TOR by routing all traffic through the SOCKS servers at localhost port 9050. To do that append /usr/local/etc/privoxy/config with the following

forward-socks5t / 127.0.0.1:9050 . # the dot at the end is important

Restart privoxy after making the change to the configuration file.

brew services restart privoxy

python dependancies

pip install pysocks

pip install scrapy-fake-useragent

pip install requests

Captcha solver

requirements

pip install pillow

brew install tesseract

pip install tesserocr

pip install scipy

pip install pytesseract

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment