-
-
Save powerexploit/cc40a40cccd69bd646aaa06b7a05046e to your computer and use it in GitHub Desktop.
| #!/usr/bin/python3 | |
| #Scraping wikipedia page according to your command line input | |
| import sys | |
| import requests | |
| import bs4 | |
| RED = '\033[31m' | |
| END = '\033[0m' | |
| ascii_art = RED \ | |
| + """ | |
| iiii kkkkkkkk iiii | |
| i::::i k::::::k i::::i | |
| iiii k::::::k iiii | |
| k::::::k | |
| wwwwwww wwwww wwwwwwwiiiiiii k:::::k kkkkkkkiiiiiiippppp pppppppppyyyyyyy yyyyyyy | |
| w:::::w w:::::w w:::::w i:::::i k:::::k k:::::k i:::::ip::::ppp:::::::::py:::::y y:::::y | |
| w:::::w w:::::::w w:::::w i::::i k:::::k k:::::k i::::ip:::::::::::::::::py:::::y y:::::y | |
| w:::::w w:::::::::w w:::::w i::::i k:::::k k:::::k i::::ipp::::::ppppp::::::py:::::y y:::::y | |
| w:::::w w:::::w:::::w w:::::w i::::i k::::::k:::::k i::::i p:::::p p:::::p y:::::y y:::::y | |
| w:::::w w:::::w w:::::w w:::::w i::::i k:::::::::::k i::::i p:::::p p:::::p y:::::y y:::::y | |
| w:::::w:::::w w:::::w:::::w i::::i k:::::::::::k i::::i p:::::p p:::::p y:::::y:::::y | |
| w:::::::::w w:::::::::w i::::i k::::::k:::::k i::::i p:::::p p::::::p y:::::::::y | |
| w:::::::w w:::::::w i::::::ik::::::k k:::::k i::::::ip:::::ppppp:::::::p y:::::::y | |
| w:::::w w:::::w i::::::ik::::::k k:::::k i::::::ip::::::::::::::::p y:::::y | |
| w:::w w:::w i::::::ik::::::k k:::::k i::::::ip::::::::::::::pp y:::::y | |
| www www iiiiiiiikkkkkkkk kkkkkkkiiiiiiiip::::::pppppppp y:::::y | |
| p:::::p y:::::y | |
| p:::::p y:::::y | |
| p:::::::p y:::::y | |
| p:::::::p y:::::y | |
| p:::::::p yyyyyyy | |
| ppppppppp | |
| [++] wikipy is simple wikipedia scraper [++] | |
| Coded By: Ankit Dobhal | |
| Let's Begin To Scrape..! | |
| ------------------------------------------------------------------------------- | |
| wikipy version 1.0 | |
| """ \ | |
| + END | |
| print(ascii_art) | |
| res = requests.get('https://en.wikipedia.org/wiki/' + ' '.join(sys.argv[1:])) | |
| res.raise_for_status() | |
| #Just to raise the status code | |
| wiki = bs4.BeautifulSoup(res.text,"lxml") | |
| elems = wiki.select('p') | |
| for i in range(len(elems)): | |
| print(elems[i].getText()) |
I get the error:
Traceback (most recent call last):
File "/data/user/0/ru.iiec.pydroid3/files/accomp_files/iiec_run/iiec_run.py", line 31, in
start(fakepyfile,mainpyfile)
File "/data/user/0/ru.iiec.pydroid3/files/accomp_files/iiec_run/iiec_run.py", line 30, in start
exec(open(mainpyfile).read(), main.dict)
File "", line 48, in
File "/data/user/0/ru.iiec.pydroid3/files/aarch64-linux-android/lib/python3.8/site-packages/bs4/init.py", line 242, in init
raise FeatureNotFound(
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
@heelrayner , just change the line 47. API are common across wikis (except wikidata). The question is more : where do we get the full list of wiki pages. See below:
Namespaces
0: (main)1: Talk:2: User:3: User_talk:
Dumps' & paths
- List of dumps
- /ngwiki/20200220 - manual (change the date)
- /ngwiki/latest - directory
- /ngwiki-latest-all-titles.gz
- /ngwiki-latest-all-titles-in-ns0.gz) - articles only
could this be used on other wiki?