Created
January 30, 2017 21:39
-
-
Save johndavidsimmons/794f559debf66c9246ff593e1bb11113 to your computer and use it in GitHub Desktop.
Return a set of the relative anchors on the given page
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| def findAllAnchors(url): | |
| # Use requests to get page source without using driver | |
| r = requests.get(url) | |
| # Turn the page source into soup for parsing | |
| soup = bs4(r.text, "html5lib") | |
| # Return a Set of relative anchors tags from a given | |
| return set([anchor for anchor in soup.find_all('a', href=True) if anchor['href'].startswith('/') and not anchor['href'].endswith(('.com', '.org', '.net'))]) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment