Skip to content

Instantly share code, notes, and snippets.

@srinivas946
Created February 22, 2020 18:56
Show Gist options
  • Select an option

  • Save srinivas946/e04b6c62ec3ffc75a1cef90497ecb81e to your computer and use it in GitHub Desktop.

Select an option

Save srinivas946/e04b6c62ec3ffc75a1cef90497ecb81e to your computer and use it in GitHub Desktop.
Parse HTML data and store it in a CSV file
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2> Parse HTML Input and Store it in a CSV file </h2>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"html_data = \"\"\"<table class=table><tr><th>ISP</th><td>Chunghwa Telecom Co. Ltd.</td></tr><tr><th>Usage Type</th><td>\n",
"<span class=text-muted>Unknown</span></td></tr><tr><th>Hostname(s)</th><td>125-227-89-141.HINET-IP.hinet.net <br>\n",
"</td></tr><tr><th>Domain Name</th><td>cht.com.tw</td></tr><tr><th>Country</th><td>\n",
"<img src=\"/img/blank.gif\" class=\"flag flag-tw\"/>Taiwan</td></tr><tr><th>City</th><td>Taichung, Taichung</td></tr> </table>\"\"\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<li>To Parse HTML Data use python package Beautifulsoup4</li>\n",
"<li>pip install beautifulsoup4</li>\n",
"<li>Learn more about <b>beautifulsoup</b> package - <a href=\"https://pypi.org/project/beautifulsoup4/\">link</a></li>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b> Extract Table Infomation </b> - From the above input parse data related to IPAddress such as ISP, Usage Type, Hostname, Domain, Country & City"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from bs4 import BeautifulSoup # import package"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"soup = BeautifulSoup(html_data, \"html.parser\") # create an object and provide html_data\n",
"parser_dict = {}\n",
"for table in soup.findAll(\"table\"): # loop the table for table contents\n",
" for tr in table.findAll(\"tr\"): # loop the rows to fetch each and every row information\n",
" parser_dict[tr.find(\"th\").text] = tr.find(\"td\").text # store details in dictionary format\n",
"parser_dict"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>Real Time Process</b>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from bs4 import BeautifulSoup # import package\n",
"import csv # load csv module to handle csv files\n",
"\n",
"# =====================================\n",
"# CLASS TO PARSE HTML INPUT DATA\n",
"# =====================================\n",
"class Parse_HTML:\n",
" \n",
" # ===========================================================\n",
" # LOAD HTML DATA AND WRITE FILE PATH WHILE OBJECT CREATION\n",
" # ===========================================================\n",
" def __init__(self, html_data, write_file_path):\n",
" self._html_data = html_data\n",
" self._write_file_path = write_file_path\n",
" \n",
" # =============================================\n",
" # PARSER TO CONVERT HTML DATA TO DICTIONARY\n",
" # =============================================\n",
" def parser(self):\n",
" soup = BeautifulSoup(html_data, \"html.parser\")\n",
" parser_dict = {}\n",
" for table in soup.findAll(\"table\"):\n",
" for tr in table.findAll(\"tr\"):\n",
" parser_dict[tr.find(\"th\").text.replace(\"\\n\", '')] = tr.find(\"td\").text.replace(\"\\n\", '')\n",
" return parser_dict\n",
" \n",
" # ===============================================\n",
" # WRITE PARSED DATA (DICTIONARY) TO A CSV FILE\n",
" # ===============================================\n",
" def write_csv(self, data):\n",
" with open(self._write_file_path, 'w') as csvfile:\n",
" data_writer = csv.DictWriter(csvfile, fieldnames=list(data.keys()), lineterminator='\\n')\n",
" data_writer.writeheader()\n",
" data_writer.writerows([data])\n",
" return True\n",
"\n",
"# =================================\n",
"# PROGRAM EXECUTION STARTS HERE\n",
"# =================================\n",
"ph = Parse_HTML(html_data=html_data, write_file_path='parsed_html_results.csv')\n",
"parser_dict = ph.parser()\n",
"confirm = ph.write_csv(data=parser_dict)\n",
"if confirm is True: print('File Created')\n",
"else: print('Not Able to Create a File')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>Learn more Real time scenarion related to csv - Refer <a href=\"https://cybersecpy.in/handle-csv-files-using-python/\">cybersecpy.in</a></b>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.1"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment