Last active
April 29, 2017 20:05
-
-
Save casey-chow/4671b7ce8167e552aec1f28038294037 to your computer and use it in GitHub Desktop.
Scraping Tutorial for Code@Night
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "source": [ | |
| "# Web Scraping in Python" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "source": [ | |
| "This is a tutorial that comes companion to my presentation about Web Scraping in Python. The slides are available [here](http://slides.com/casey_chow/web-scraping-with-python). For this demo, we're going to be doing the same scraping that [TigerMenus](http://tigermenus.herokuapp.com/) does." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "source": [ | |
| "## Helpers and Constants\n", | |
| "\n", | |
| "First, some helpers and constants. Let's first import our packages of the day:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 1, | |
| "metadata": { | |
| "collapsed": false, | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "import requests\n", | |
| "from bs4 import BeautifulSoup" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "source": [ | |
| "Now, let's start with a URL. This is the URL of the Butler/Wilson menu. If you want, you can look at it [here](https://campusdining.princeton.edu/dining/_foodpro/nutframe.asp?sName=Princeton+University+Campus+Dining&locationNum=02&locationName=Butler+%26+Wilson+Colleges&naFlag=1)." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 2, | |
| "metadata": { | |
| "collapsed": true, | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "url = \"https://campusdining.princeton.edu/dining/_foodpro/nutframe.asp?sName=Princeton+University+Campus+Dining&locationNum=02&locationName=Butler+%26+Wilson+Colleges&naFlag=1\"" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "collapsed": true, | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "source": [ | |
| "## Get the Page\n", | |
| "\n", | |
| "Now, we can try running our first request." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 3, | |
| "metadata": { | |
| "collapsed": false, | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "req = requests.get(url)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "source": [ | |
| "Two things we want to check. First, whether the request turned out ok. \n", | |
| "In HTTP parlance, we want to see if we got an [HTTP 200](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#2xx_Success)." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 4, | |
| "metadata": { | |
| "collapsed": false, | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "200\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "print(req.status_code)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "source": [ | |
| "Cool, we got the 200 we were hoping for. Now let's see if we get the menu items:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 5, | |
| "metadata": { | |
| "collapsed": false, | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "\r\n", | |
| "\r\n", | |
| "<!-- The following is required by Aurora Information Systems, DO NOT MODIFY OR REMOVE -->\r\n", | |
| " <!-- FIELDFILT.ASP, Version 2.2.1 -->\r\n", | |
| "<!-- End of Aurora Information Systems Required Text -->\r\n", | |
| "<html>\r\n", | |
| "<head>\r\n", | |
| "<title>This Week's Menus</title>\r\n", | |
| "</head>\r\n", | |
| "<frameset rows=\"295,70%\" frameborder=\"0\">\r\n", | |
| " <noframes>\r\n", | |
| " <body>A browser that supports frames is required</body>\r\n", | |
| " </noframes> \r\n", | |
| " <frame src=\"head.asp?sName=Princeton+University+Campus+Dining&locationNum=02&locationName=Butler+%26+Wilson+Colleges&WeeksMenus=This+Week%27s+Menus\" SCROLLING=\"NO\" NAME=\"AuroraBanner\" title=\"top page banner\">\r\n", | |
| " <frameset cols=\"*,278,726,*\">\r\n", | |
| " <noframes>\r\n", | |
| " <body>A browser that supports frames is required</body>\r\n", | |
| " </noframes> \r\n", | |
| " <frame src=\"blank.asp\" name=\"leftblank\" title=\"left blank page\">\r\n", | |
| " <frame src=\"date.asp?sName=Princeton+University+Campus+Dining&locationNum=02&locationName=Butler+%26+Wilson+Colleges&naFlag=1\" NAME=\"AuroraContents\" title=\"left navigation menu\">\r\n", | |
| " <frame src=\"menuSamp.asp?locationNum=02&locationName=Butler+%26+Wilson+Colleges&sName=Princeton+University+Campus+Dining&naFlag=1\" NAME=\"AuroraMain\" title=\"main content window\">\r\n", | |
| " <frame src=\"blank.asp\" name=\"rightblank\" title=\"right blank page\">\r\n", | |
| " </frameset>\r\n", | |
| "</frameset>\r\n", | |
| "</html>\r\n", | |
| "\r\n", | |
| "<!-- The following is required by Aurora Information Systems, DO NOT MODIFY OR REMOVE -->\r\n", | |
| " <!-- NUTFRAME.ASP, Version 2.2.1 -->\r\n", | |
| "<!-- End of Aurora Information Systems Required Text -->\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "print(req.text)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "source": [ | |
| "Something's up here--there's no menu options! To see why, we notice this line:\n", | |
| "\n", | |
| " <frame src=\"menuSamp.asp?locationNum=02&locationName=Butler+%26+Wilson+Colleges&sName=Princeton+University+Campus+Dining&naFlag=1\" NAME=\"AuroraMain\" title=\"main content window\">\n", | |
| "\n", | |
| "So it would seem that it's being included through an HTML frame. This isn't common to all pages, but is a common thing on older sites. So let's try using this URL instead. If you want to have a look, this frame is located [here](https://campusdining.princeton.edu/dining/_foodpro/menuSamp.asp?locationNum=02&locationName=Butler+%26+Wilson+Colleges&sName=Princeton+University+Campus+Dining&naFlag=1)." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 6, | |
| "metadata": { | |
| "collapsed": false, | |
| "deletable": true, | |
| "editable": true, | |
| "scrolled": false | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "200\n", | |
| " \r\n", | |
| "\r\n", | |
| " <!-- Recipe Name Is Displayed Here -->\r\n", | |
| " \r\n", | |
| " <div class=\"menusamprecipes\"><span style=\"color: #8000FF\"><a name=\"Recipe_Desc\" onMouseOver=\"javascript:openDescWin('','Omelet Bar with Pork Options')\" onMouseOut=\"javascript:closeDescWin()\">Omelet Bar with Pork Options</a></div>\r\n", | |
| " \r\n", | |
| "\r\n", | |
| " </td>\r\n", | |
| " \r\n", | |
| " <td width=\"10%\" valign=\"bottom\">\r\n", | |
| " <img src=\"LegendImages/e2logo.jpg\" alt=\"\" width=\"16\" height=\"16\" align=\"bottom\">\r\n", | |
| " </td>\r\n", | |
| " \r\n", | |
| " <td width=\"10%\" valign=\"bottom\">\r\n", | |
| " <img src=\"LegendImages/CFyellow.jpg\" alt=\"\" width=\"16\" height=\"16\" align=\"bottom\">\r\n", | |
| " </td>\r\n", | |
| " \r\n", | |
| " </tr>\r\n", | |
| " </table>\r\n", | |
| " </td>\r\n", | |
| " \r\n", | |
| " <td valign=\"top\" width=5%> \r\n", | |
| " \r\n", | |
| " </td>\r\n", | |
| " \r\n", | |
| " <td valign=\"top\" align=\"right\" width=10% colspan=\"1\">\r\n", | |
| " \r\n", | |
| " <div class=\"menusampprices\"><span style=\"color: #8000FF\"> </span></div>\r\n", | |
| " \r\n", | |
| " </td>\r\n", | |
| " <td width=5% valign=\"top\" colspan=\"1\"> \r\n", | |
| " \r\n", | |
| " </td>\r\n", | |
| " </tr>\r\n", | |
| " \r\n", | |
| " <tr>\r\n", | |
| " <td valign=\"top\" width=80%>\r\n", | |
| " <table width=80% cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\r\n", | |
| " <tr>\r\n", | |
| " <td>\r\n", | |
| " \r\n", | |
| "\r\n", | |
| " <!-- Recipe Name Is Displayed Here -->\r\n", | |
| " \r\n", | |
| " <div class=\"menusamprecipes\"><span style=\"color: #8000FF\"><a name=\"Recipe_Desc\" onMouseOver=\"javascript:openDescWin('','Pork Sausage Links')\" onMouseOut=\"javascript:closeDescWin()\">Pork Sausage Links</a></div>\r\n", | |
| " \r\n", | |
| "\r\n", | |
| " </td>\r\n", | |
| " \r\n", | |
| " <td width=\"10%\" valign=\"bottom\">\r\n", | |
| " <img src=\"LegendImages/e2logo.jpg\" alt=\"\" width=\"16\" height=\"16\" align=\"bottom\">\r\n", | |
| " </td>\r\n", | |
| " \r\n", | |
| " <td width=\"10%\" valign=\"bottom\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "url = \"https://campusdining.princeton.edu/dining/_foodpro/menuSamp.asp?locationNum=02&locationName=Butler+%26+Wilson+Colleges&sName=Princeton+University+Campus+Dining&naFlag=1\"\n", | |
| "req = requests.get(url)\n", | |
| "print(req.status_code)\n", | |
| "print(req.text[14500:17000])" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "source": [ | |
| "If you look inside the full thing (not included because it's really long), you'll notice some strings like \"Halved Grapefruit\" and \"Fresh Honeydew and Cantaloupe\". So chances are, we got what we needed! Let's star parsing this data." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "source": [ | |
| "# Scraping" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "source": [ | |
| "The first step after retrieving the right data is to parse it with, in our case, Beautiful Soup. So we'll run the following:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 7, | |
| "metadata": { | |
| "collapsed": false, | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "content = req.text\n", | |
| "soup = BeautifulSoup(content, 'html.parser')" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "source": [ | |
| "Note also that sometimes, the page may be really bad, and you'll have to use a stronger parser than what's used by default in BeautifulSoup, `html.parser`. To do so, you can check out the [section on parsers][parsers] in BS4's documentation.\n", | |
| "\n", | |
| "[parsers]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser\n", | |
| "\n", | |
| "Alright, cool. So after going into the page and inspecting with Dev Tools, we notice everything we care about is wrapped in a table. So let's narrow down our search to the first table." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 8, | |
| "metadata": { | |
| "collapsed": false, | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "<table align=\"center\" border=\"0\" cellpadding=\"0\" cellspacing=\"0\" width=\"100%\">\n", | |
| " <tr>\n", | |
| " <td valign=\"top\" width=\"50%\">\n", | |
| " <table border=\"0\" cellpadding=\"0\" cellspacing=\"0\" height=\"100%\" width=\"100%\">\n", | |
| " <tr>\n", | |
| " <td align=\"left\" height=\"15\">\n", | |
| " <table border=\"0\" cellpadding=\"0\" cellspacing=\"0\" width=\"100%\">\n", | |
| " <tr>\n", | |
| " <td align=\"left\" valign=\"bottom\">\n", | |
| " <div id=\"menusampmeals\">\n", | |
| " Lunch\n", | |
| " <a href=\"pickMenu.asp?locationNum=02&locationName=Butler+%26+Wilson+Colleges&dtdate=04%2F29%2F2017&mealName=Lunch&sName=Princeton+University+Campus+Dining\" name=\"Lunch\" onmouseout=\"window.status= ' ';\" onmouseover=\"window.status = 'Click for Nutritive Analysis.'; return true;\" target=\"_self\">\n", | |
| " <img border=\"0\" src=\"images/nutrition.jpg\"/>\n", | |
| " </a>\n", | |
| " </div>\n", | |
| " </td>\n", | |
| " <td>\n", | |
| " </td>\n", | |
| " </tr>\n", | |
| " </table>\n", | |
| " </td>\n", | |
| " </tr>\n", | |
| " <tr height=\"5\">\n", | |
| " <td valign=\"top\">\n", | |
| " <table border=\"0\" cellpadding=\"0\" cellspacing=\"1\" width=\"100%\">\n", | |
| " <tr>\n", | |
| " <td colspan=\"4\">\n", | |
| " <div class=\"menusampcats\">\n", | |
| " <span style=\"color: \">\n", | |
| " -- Starches --\n", | |
| " </span>\n", | |
| " </div>\n", | |
| " </td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td valign=\"top\" width=\"80%\">\n", | |
| " <table border=\"0\" cellpadding=\"0\" cellspacing=\"0\" width=\"80%\">\n", | |
| " <tr>\n", | |
| " <td>\n", | |
| " <!-- Recipe Name Is Displayed Here -->\n", | |
| " <div class=\"menusamprecipes\">\n", | |
| " <span style=\"color: #0000FF\">\n", | |
| " <a name=\"Recipe_Desc\" onmouseout=\"javascript:closeDescWin()\" onmouseover=\"javascript:openDescWin('','O~Brien Potatoes')\">\n", | |
| " O'Brien Potatoes\n", | |
| " </a>\n", | |
| " </span>\n", | |
| " </div>\n", | |
| " </td>\n", | |
| " <td valign=\"bottom\" width=\"10%\">\n", | |
| " <img align=\"bottom\" alt=\"\" height=\"16\" src=\"LegendImages/e2logo.jpg\" width=\"16\">\n", | |
| " </img>\n", | |
| " </td>\n", | |
| " <td valign=\"bottom\" width=\"10%\">\n", | |
| " <img align=\"bottom\"\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "table = soup.find('table')\n", | |
| "print(table.prettify()[:2000])" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "source": [ | |
| "This is a lot less junk than we had before, but still includes just the lunch and dinner. Again, fortunately for us, we can take a quick shortcut and note that lunch, what we're looking for, is the first table in here:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 9, | |
| "metadata": { | |
| "collapsed": false, | |
| "deletable": true, | |
| "editable": true, | |
| "scrolled": false | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "<table border=\"0\" cellpadding=\"0\" cellspacing=\"0\" height=\"100%\" width=\"100%\">\n", | |
| " <tr>\n", | |
| " <td align=\"left\" height=\"15\">\n", | |
| " <table border=\"0\" cellpadding=\"0\" cellspacing=\"0\" width=\"100%\">\n", | |
| " <tr>\n", | |
| " <td align=\"left\" valign=\"bottom\">\n", | |
| " <div id=\"menusampmeals\">\n", | |
| " Lunch\n", | |
| " <a href=\"pickMenu.asp?locationNum=02&locationName=Butler+%26+Wilson+Colleges&dtdate=04%2F29%2F2017&mealName=Lunch&sName=Princeton+University+Campus+Dining\" name=\"Lunch\" onmouseout=\"window.status= ' ';\" onmouseover=\"window.status = 'Click for Nutritive Analysis.'; return true;\" target=\"_self\">\n", | |
| " <img border=\"0\" src=\"images/nutrition.jpg\"/>\n", | |
| " </a>\n", | |
| " </div>\n", | |
| " </td>\n", | |
| " <td>\n", | |
| " </td>\n", | |
| " </tr>\n", | |
| " </table>\n", | |
| " </td>\n", | |
| " </tr>\n", | |
| " <tr height=\"5\">\n", | |
| " <td valign=\"top\">\n", | |
| " <table border=\"0\" cellpadding=\"0\" cellspacing=\"1\" width=\"100%\">\n", | |
| " <tr>\n", | |
| " <td colspan=\"4\">\n", | |
| " <div class=\"menusampcats\">\n", | |
| " <span style=\"color: \">\n", | |
| " -- Starches --\n", | |
| " </span>\n", | |
| " </div>\n", | |
| " </td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td valign=\"top\" width=\"80%\">\n", | |
| " <table border=\"0\" cellpadding=\"0\" cellspacing=\"0\" width=\"80%\">\n", | |
| " <tr>\n", | |
| " <td>\n", | |
| " <!-- Recipe Name Is Displayed Here -->\n", | |
| " <div class=\"menusamprecipes\">\n", | |
| " <span style=\"color: #0000FF\">\n", | |
| " <a name=\"Recipe_Desc\" onmouseout=\"javascript:closeDescWin()\" onmouseover=\"javascript:openDescWin('','O~Brien Potatoes')\">\n", | |
| " O'Brien Potatoes\n", | |
| " </a>\n", | |
| " </span>\n", | |
| " </div>\n", | |
| " </td>\n", | |
| " <td valign=\"bottom\" width=\"10%\">\n", | |
| " <img align=\"bottom\" alt=\"\" height=\"16\" src=\"LegendImages/e2logo.jpg\" width=\"16\">\n", | |
| " </img>\n", | |
| " </td>\n", | |
| " <td valign=\"bottom\" width=\"10%\">\n", | |
| " <img align=\"bottom\" alt=\"\" height=\"16\" src=\"LegendImages/CFgreen.jpg\" width=\"16\">\n", | |
| " </img>\n", | |
| " </td>\n", | |
| " </tr>\n", | |
| " </table>\n", | |
| " </td>\n", | |
| " <td valign=\"top\" width=\"5%\">\n", | |
| " </td>\n", | |
| " <td align=\"right\" colspan=\"1\" valign=\"top\" width=\"10%\">\n", | |
| " <div class=\"menusampprices\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "inner_table = table.find('table')\n", | |
| "print(inner_table.prettify()[:2000])" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "deletable": true, | |
| "editable": true, | |
| "raw_mimetype": "text/markdown" | |
| }, | |
| "source": [ | |
| "Just to speed things up, we're gonna take a quick shortcut on this. We notice that everything we care about is in the text of the item, so we just need to extract the text in the HTML! So let's do that:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 10, | |
| "metadata": { | |
| "collapsed": false, | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "['Lunch', '-- Starches --', \"O'Brien Potatoes\", '-- Fruit --', 'Grapefruit Half', 'Sliced Melons', '-- Entrees --', 'Bacon', 'Egg Whites to Order', 'French Texas Toast', 'Omelet Bar with Pork Options', 'Pork Sausage Links', 'Scrambled Eggs', 'Turkey Sausage', '-- Breakfast Bars --', 'Lox & Bagel Bar', '-- Vegetarian & Vegan Entree --', 'Bowties & Asparagus', '-- Specialty Bars --', 'Buffalo Chicken Wings']\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "lunch_menu = [x for x in inner_table.stripped_strings]\n", | |
| "print(lunch_menu)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "source": [ | |
| "And we have it!" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 11, | |
| "metadata": { | |
| "collapsed": false, | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "Lunch\n", | |
| "-- Starches --\n", | |
| "O'Brien Potatoes\n", | |
| "-- Fruit --\n", | |
| "Grapefruit Half\n", | |
| "Sliced Melons\n", | |
| "-- Entrees --\n", | |
| "Bacon\n", | |
| "Egg Whites to Order\n", | |
| "French Texas Toast\n", | |
| "Omelet Bar with Pork Options\n", | |
| "Pork Sausage Links\n", | |
| "Scrambled Eggs\n", | |
| "Turkey Sausage\n", | |
| "-- Breakfast Bars --\n", | |
| "Lox & Bagel Bar\n", | |
| "-- Vegetarian & Vegan Entree --\n", | |
| "Bowties & Asparagus\n", | |
| "-- Specialty Bars --\n", | |
| "Buffalo Chicken Wings\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "lunch_menu_clean = '\\n'.join(lunch_menu)\n", | |
| "print(lunch_menu_clean)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "source": [ | |
| "To get to the inner table, another option we had is to manually go into the page, find the element we cared about, and then right click on it in Dev Tools and choose \"Copy Selector\" to get a CSS selector that would also get you to the element. In our case, we get this for the inner table:\n", | |
| "\n", | |
| " body > table > tbody > tr > td:nth-child(1) > table\n", | |
| " \n", | |
| "There's a couple problems with this approach though. It's more fickle than what we did, since it makes more assumptions about the structure of the page--like that there's only one `tr` in that `tbody`, for example, or that BeautifulSoup doesn't support the `:nth-child` pseudo-class. So this approach is faster but also more error-prone than manually reasoning through the HTML on the page. A cleaned up version that works for this page is:\n", | |
| "\n", | |
| " table > tbody > tr > td:nth-of-type(1) > table\n", | |
| "\n", | |
| "Let's see this one in action:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 12, | |
| "metadata": { | |
| "collapsed": false, | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "Lunch\n", | |
| "-- Starches --\n", | |
| "O'Brien Potatoes\n", | |
| "-- Fruit --\n", | |
| "Grapefruit Half\n", | |
| "Sliced Melons\n", | |
| "-- Entrees --\n", | |
| "Bacon\n", | |
| "Egg Whites to Order\n", | |
| "French Texas Toast\n", | |
| "Omelet Bar with Pork Options\n", | |
| "Pork Sausage Links\n", | |
| "Scrambled Eggs\n", | |
| "Turkey Sausage\n", | |
| "-- Breakfast Bars --\n", | |
| "Lox & Bagel Bar\n", | |
| "-- Vegetarian & Vegan Entree --\n", | |
| "Bowties & Asparagus\n", | |
| "-- Specialty Bars --\n", | |
| "Buffalo Chicken Wings\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "new_inner_table = soup.select('table > tbody > tr > td:nth-of-type(1) > table')\n", | |
| "new_lunch_menu_clean = '\\n'.join([x for x in inner_table.stripped_strings])\n", | |
| "print(new_lunch_menu_clean)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "collapsed": true, | |
| "deletable": true, | |
| "editable": true | |
| }, | |
| "source": [ | |
| "Cool, same result. Your mileage may vary though, so always follow the number one rule of scrapers: only use them if you have to. And when you do, try to assume as little as possible about the structure of the page, because the owner of the page may change the structure at any time." | |
| ] | |
| } | |
| ], | |
| "metadata": { | |
| "celltoolbar": "Raw Cell Format", | |
| "kernelspec": { | |
| "display_name": "Python 3", | |
| "language": "python", | |
| "name": "python3" | |
| }, | |
| "language_info": { | |
| "codemirror_mode": { | |
| "name": "ipython", | |
| "version": 3 | |
| }, | |
| "file_extension": ".py", | |
| "mimetype": "text/x-python", | |
| "name": "python", | |
| "nbconvert_exporter": "python", | |
| "pygments_lexer": "ipython3", | |
| "version": "3.6.0" | |
| } | |
| }, | |
| "nbformat": 4, | |
| "nbformat_minor": 2 | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment