Created
February 22, 2020 18:56
-
-
Save srinivas946/e04b6c62ec3ffc75a1cef90497ecb81e to your computer and use it in GitHub Desktop.
Parse HTML data and store it in a CSV file
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<h2> Parse HTML Input and Store it in a CSV file </h2>" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "html_data = \"\"\"<table class=table><tr><th>ISP</th><td>Chunghwa Telecom Co. Ltd.</td></tr><tr><th>Usage Type</th><td>\n", | |
| "<span class=text-muted>Unknown</span></td></tr><tr><th>Hostname(s)</th><td>125-227-89-141.HINET-IP.hinet.net <br>\n", | |
| "</td></tr><tr><th>Domain Name</th><td>cht.com.tw</td></tr><tr><th>Country</th><td>\n", | |
| "<img src=\"/img/blank.gif\" class=\"flag flag-tw\"/>Taiwan</td></tr><tr><th>City</th><td>Taichung, Taichung</td></tr> </table>\"\"\"" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<li>To Parse HTML Data use python package Beautifulsoup4</li>\n", | |
| "<li>pip install beautifulsoup4</li>\n", | |
| "<li>Learn more about <b>beautifulsoup</b> package - <a href=\"https://pypi.org/project/beautifulsoup4/\">link</a></li>" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<b> Extract Table Infomation </b> - From the above input parse data related to IPAddress such as ISP, Usage Type, Hostname, Domain, Country & City" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "from bs4 import BeautifulSoup # import package" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "soup = BeautifulSoup(html_data, \"html.parser\") # create an object and provide html_data\n", | |
| "parser_dict = {}\n", | |
| "for table in soup.findAll(\"table\"): # loop the table for table contents\n", | |
| " for tr in table.findAll(\"tr\"): # loop the rows to fetch each and every row information\n", | |
| " parser_dict[tr.find(\"th\").text] = tr.find(\"td\").text # store details in dictionary format\n", | |
| "parser_dict" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<b>Real Time Process</b>" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "from bs4 import BeautifulSoup # import package\n", | |
| "import csv # load csv module to handle csv files\n", | |
| "\n", | |
| "# =====================================\n", | |
| "# CLASS TO PARSE HTML INPUT DATA\n", | |
| "# =====================================\n", | |
| "class Parse_HTML:\n", | |
| " \n", | |
| " # ===========================================================\n", | |
| " # LOAD HTML DATA AND WRITE FILE PATH WHILE OBJECT CREATION\n", | |
| " # ===========================================================\n", | |
| " def __init__(self, html_data, write_file_path):\n", | |
| " self._html_data = html_data\n", | |
| " self._write_file_path = write_file_path\n", | |
| " \n", | |
| " # =============================================\n", | |
| " # PARSER TO CONVERT HTML DATA TO DICTIONARY\n", | |
| " # =============================================\n", | |
| " def parser(self):\n", | |
| " soup = BeautifulSoup(html_data, \"html.parser\")\n", | |
| " parser_dict = {}\n", | |
| " for table in soup.findAll(\"table\"):\n", | |
| " for tr in table.findAll(\"tr\"):\n", | |
| " parser_dict[tr.find(\"th\").text.replace(\"\\n\", '')] = tr.find(\"td\").text.replace(\"\\n\", '')\n", | |
| " return parser_dict\n", | |
| " \n", | |
| " # ===============================================\n", | |
| " # WRITE PARSED DATA (DICTIONARY) TO A CSV FILE\n", | |
| " # ===============================================\n", | |
| " def write_csv(self, data):\n", | |
| " with open(self._write_file_path, 'w') as csvfile:\n", | |
| " data_writer = csv.DictWriter(csvfile, fieldnames=list(data.keys()), lineterminator='\\n')\n", | |
| " data_writer.writeheader()\n", | |
| " data_writer.writerows([data])\n", | |
| " return True\n", | |
| "\n", | |
| "# =================================\n", | |
| "# PROGRAM EXECUTION STARTS HERE\n", | |
| "# =================================\n", | |
| "ph = Parse_HTML(html_data=html_data, write_file_path='parsed_html_results.csv')\n", | |
| "parser_dict = ph.parser()\n", | |
| "confirm = ph.write_csv(data=parser_dict)\n", | |
| "if confirm is True: print('File Created')\n", | |
| "else: print('Not Able to Create a File')" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<b>Learn more Real time scenarion related to csv - Refer <a href=\"https://cybersecpy.in/handle-csv-files-using-python/\">cybersecpy.in</a></b>" | |
| ] | |
| } | |
| ], | |
| "metadata": { | |
| "kernelspec": { | |
| "display_name": "Python 3", | |
| "language": "python", | |
| "name": "python3" | |
| }, | |
| "language_info": { | |
| "codemirror_mode": { | |
| "name": "ipython", | |
| "version": 3 | |
| }, | |
| "file_extension": ".py", | |
| "mimetype": "text/x-python", | |
| "name": "python", | |
| "nbconvert_exporter": "python", | |
| "pygments_lexer": "ipython3", | |
| "version": "3.8.1" | |
| } | |
| }, | |
| "nbformat": 4, | |
| "nbformat_minor": 4 | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment