Figure out how to install ScrapyLearn Scrapy basics- Learn full parsing rules
- Learn to make pipeline, goal to pipe to ElasticSearch
Integrate Scrapy and Tor- Make multi-threaded Tor instances somehow, might need a Linux VM.
- Figure everything about EasticSearch.
- How to only save new/edited pages?
- How long does it take to crawl a site?
- Figure out how to structure Scrapy parsed data ready for EasticSearch. Just dump JSON?
- Figure out how to defeat captcha.
- pillow and pytesseract
- or api like https://anti-captcha.com/mainpage
- How often does captcha need solved? How to we maintain client-side state...cookies?
- Can DWMs detect spiders? Do they look like/cause DDOS? If so must slow spiders. It's vital to not affect DWMs
" We also provide as input to the scraper a session cookie that we obtain by manually logging into the marketplace and solving a CAPTCHA; and parameters such as the maximum desired scraping rate. In addition to being careful about what to request from a marketplace, we obfuscate how we request content. For each page request, the scraper randomly selects a Tor circuit out of 20 pre-built circuits. This strategy ensures that the requests are being distributed over several rendezvous points in the Tor network. This helps prevent triggering anti-DDoS heuristics certain marketplaces use.(However some marketplaces, e.g., Agora, use session cookies to bind requests coming from different circuits, and require additional attention.) This strategy also provides redundancy in the event that one of the circuits being used becomes unreliable and speeds up the time it takes to observe the entire site." Kyle Soska and Nicolas Christin
- Look into Linear Discriminant Analysis to determine unique characteristics between forums https://sebastianraschka.com/Articles/2014_python_lda.html https://en.wikipedia.org/wiki/Linear_discriminant_analysis
- https://arxiv.org/pdf/2008.01585.pdf
- http://dline.info/fpaper/jdim/v17i2/jdimv17i2_1.pdf
- https://www.cybersecurity.ox.ac.uk/site-resources/uploads/2019/11/Cho-S-Showcase-Talk-2.pdf
- https://books.google.com/books?id=GWW-DwAAQBAJ&pg=PA112&lpg=PA112&dq=%22scrapy%22+dread+forum&source=bl&ots=cX3gXg9xXM&sig=ACfU3U1yrz7ohz-y-beeV_5PEO2YdQV4QQ&hl=en&sa=X&ved=2ahUKEwiZo_PXvLvrAhUkUt8KHWTPDacQ6AEwBXoECAoQAQ#v=onepage&q=%22scrapy%22%20dread%20forum&f=false - this one goes into detail on post-processing data ready for machine learning.
- https://www.andrew.cmu.edu/user/nicolasc/publications/SC-USENIXSec15.pdf
- Update OSX Python with this guide (check if outdated) https://opensource.com/article/19/5/python-3-default-mac
- install Scrapy https://docs.scrapy.org/en/latest/intro/install.html
This is the new shebang we need to use to avoid system Python path:
#!/usr/bin/env python
"Use "python -m pip" instead of running "pip" or "pip3" - that way the pip and python versions always match, whichever you currently have selected as "python". Couldn't get this to work, 'pip' seems fine.
- OK so now we need to install Tor on Mac and get Python to be able to control our connections. https://jarroba.com/anonymous-scraping-by-tor-network/
https://github.com/mrt-kousha/scrapy/wiki/Implement-Tor-middleware-for-anonymous-data-scraping
https://github.com/WiliTest/Anonymous-scrapping-Scrapy-Tor-Privoxy-UserAgent
https://2019.www.torproject.org/docs/tor-doc-osx.html.en
https://stem.torproject.org/faq.html#how-do-i-request-a-new-identity-from-tor
https://medium.com/@amine.btt/a-crawler-that-beats-bot-detection-879888f470eb
https://www.kaggle.com/getting-started/45130
- I've decided to host this on a DO Droplet in production as it makes little sense to run the scraper on my laptop.
brew install tor
You should see something like:
You will find a sample torrc file in /usr/local/etc/tor.
You want to take the torrc.sample and make it your own torrc file.
First navigate to your torrc file using the path that they gave you:
cd /usr/local/etc/tor
Rename the torrc.sample to torrc so that it starts working as our torrc file (note you can also make a copy here):
mv torrc.sample torrc
Now you have a torrc file! This is your Tor configuration file. That means that every time you edit the file, you have to restart tor in order for any changes to take effect.
We need to generate a hashed password to control Tor using stem. In terminal type this command. Replace the example password with your own. Make a note of the original and the generated hash as we'll need both.
tor --hash-password "dz#X2nB%LJHGF0sB9DnZWv#87^"
Add the following lines to torrc file, replace the example hash with your own:
ControlPort 9051
HashedControlPassword 16:04C7A70H876B7BS6B69EE768NV7375CA2B749341437
CookieAuthentication 1
Restart Tor again to the configuration changes are applied.
brew services restart tor
It might take a while to establish your first connection, but after the first time it goes faster.
Now that you have tor running from source, install the Tor Browser. It is basically Firefox with Tor. It is more user-friendly and we will use it later to visit our hidden service. Simply download and follow the installation instructions.
brew install privoxy
Now, tell privoxy to use TOR by routing all traffic through the SOCKS servers at localhost port 9050. To do that append /usr/local/etc/privoxy/config with the following
forward-socks5t / 127.0.0.1:9050 . # the dot at the end is important
Restart privoxy after making the change to the configuration file.
brew services restart privoxy
pip install pysocks
pip install scrapy-fake-useragent
pip install requests
pip install pillow
brew install tesseract
pip install tesserocr
pip install scipy
pip install pytesseract