Skip to content

Instantly share code, notes, and snippets.

@oscarzapi
Last active June 15, 2020 09:56
Show Gist options
  • Select an option

  • Save oscarzapi/521e6e200c7809d4a6c1327cf5dc4823 to your computer and use it in GitHub Desktop.

Select an option

Save oscarzapi/521e6e200c7809d4a6c1327cf5dc4823 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Covid19 Data Analysis and Predictive Modeling (Machine Learning)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\n",
"\n",
"The world has never seen anything quite like the things are going on right now. Covid19 has been a changing player in the this game called life, and its effect has been devastating both in life-related and economical fields. The amount of people who has been infected is uncertained, the spread rate is not controlled so far and the deaths caused in a short period of time has caused to tackle the spreading infectious diseases as one of the main threats in the world from now on. \n",
"\n",
"With all that happening right now, there has been a lot of data that has been generated, and with that a lot of studies and researches have appeared to the public to try to understand, study and analyze the way this virus is affecting people and how it can be stopped as much as we (humans) can.\n",
"\n",
"So, in this work, I will go through some statistical analysis and predictive studies to try to get some better insights about covid19 outbreak."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[1. Exploratory Data Analysis (EDA):](#part1)\n",
"\n",
" - [1.1. Analysis of features](#part1_1)\n",
" - [1.2. Null values](#part1_2)\n",
" - [1.3. Most influential variables on new_deaths](#part1_3)\n",
" - [1.4. Correlation between Features](#part1_4)\n",
"\n",
"[2. Feature Engineering and Data Cleaning:](#part2)\n",
" - [2.1. Dummy variables](#part2_1)\n",
" - [2.2. Variable selection with low variances](#part2_2)\n",
" - [2.3. Univariant Variable Selection](#part2_3)\n",
" - [2.4. Variable selection depending on their percentile punctuation](#part2_4)\n",
"\n",
"[3. Predictive Modelling (Scikit-Learn)](#part3)\n",
" - [3.1 Linear Regression](#part3_1)\n",
" - [3.2. Ridge regression (regularization)](#part3_2)\n",
" - [3.3. Lasso regression (regularization)](#part3_3)\n",
" - [3.4. Elastic Net (regularization)](#part3_4)\n",
" - [3.5. Model evaluation through a polynomial grade function](#part3_5)\n",
" - [3.6. Validation curves for new_deaths](#part3_6)\n",
" - [3.7. Conclusions](#part3_7)\n",
"\n",
"[4. Predictive Modelling (TensorFlow)](#part4)\n",
" - [4.1. TensorFlow for Polynomial Linear Regressions to new_deaths](#part4_1)\n",
" \n",
"[5. Conclusions](#part5)\n",
"[6. References](#part6)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Exploratory Data Analysis (EDA):<a id='part1'></a>\n",
"\n",
"Let´s dig into the data structure, how it is distributed for all columns and get some statistics out of it."
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [],
"source": [
"# Importing libraries\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"plt.style.use(\"fivethirtyeight\")\n",
"import warnings\n",
"import pymongo\n",
"import pylab \n",
"import scipy.stats as stats\n",
"warnings.filterwarnings(\"ignore\")\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [],
"source": [
"# The following commands allow us to see all columns in the dataset.\n",
"pd.set_option('display.max_rows', 500)\n",
"pd.set_option('display.max_columns', 50)"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [],
"source": [
"# Setting up the connection to MongoDB as we will be connected to the local mongoDB DB.\n",
"from pymongo import MongoClient\n",
"client = MongoClient('mongodb://localhost:27017/')"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [],
"source": [
"# Connecting to Covid19 DB inside the MongoDB connection\n",
"db = client[\"covid19\"]"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [],
"source": [
"# Getting the collection of cases from the collections inside covid19 DB \n",
"cases = db.cases"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>_id</th>\n",
" <th>iso_code</th>\n",
" <th>location</th>\n",
" <th>date</th>\n",
" <th>total_cases</th>\n",
" <th>new_cases</th>\n",
" <th>total_deaths</th>\n",
" <th>new_deaths</th>\n",
" <th>total_cases_per_million</th>\n",
" <th>new_cases_per_million</th>\n",
" <th>total_deaths_per_million</th>\n",
" <th>new_deaths_per_million</th>\n",
" <th>total_tests</th>\n",
" <th>new_tests</th>\n",
" <th>total_tests_per_thousand</th>\n",
" <th>new_tests_per_thousand</th>\n",
" <th>new_tests_smoothed</th>\n",
" <th>new_tests_smoothed_per_thousand</th>\n",
" <th>tests_units</th>\n",
" <th>stringency_index</th>\n",
" <th>population</th>\n",
" <th>population_density</th>\n",
" <th>median_age</th>\n",
" <th>aged_65_older</th>\n",
" <th>aged_70_older</th>\n",
" <th>gdp_per_capita</th>\n",
" <th>extreme_poverty</th>\n",
" <th>cvd_death_rate</th>\n",
" <th>diabetes_prevalence</th>\n",
" <th>female_smokers</th>\n",
" <th>male_smokers</th>\n",
" <th>handwashing_facilities</th>\n",
" <th>__v</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>5ee30fbc82af1b3050d28c61</td>\n",
" <td>AFG</td>\n",
" <td>Afghanistan</td>\n",
" <td>2019-12-31</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td></td>\n",
" <td>NaN</td>\n",
" <td>38928341.0</td>\n",
" <td>54.422</td>\n",
" <td>18.6</td>\n",
" <td>2.581</td>\n",
" <td>1.337</td>\n",
" <td>1803.987</td>\n",
" <td>NaN</td>\n",
" <td>597.029</td>\n",
" <td>9.59</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>37.746</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>5ee30fbc82af1b3050d28c62</td>\n",
" <td>AFG</td>\n",
" <td>Afghanistan</td>\n",
" <td>2020-01-01</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td></td>\n",
" <td>0.0</td>\n",
" <td>38928341.0</td>\n",
" <td>54.422</td>\n",
" <td>18.6</td>\n",
" <td>2.581</td>\n",
" <td>1.337</td>\n",
" <td>1803.987</td>\n",
" <td>NaN</td>\n",
" <td>597.029</td>\n",
" <td>9.59</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>37.746</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>5ee30fbc82af1b3050d28c63</td>\n",
" <td>AFG</td>\n",
" <td>Afghanistan</td>\n",
" <td>2020-01-02</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td></td>\n",
" <td>0.0</td>\n",
" <td>38928341.0</td>\n",
" <td>54.422</td>\n",
" <td>18.6</td>\n",
" <td>2.581</td>\n",
" <td>1.337</td>\n",
" <td>1803.987</td>\n",
" <td>NaN</td>\n",
" <td>597.029</td>\n",
" <td>9.59</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>37.746</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>5ee30fbc82af1b3050d28c64</td>\n",
" <td>AFG</td>\n",
" <td>Afghanistan</td>\n",
" <td>2020-01-03</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td></td>\n",
" <td>0.0</td>\n",
" <td>38928341.0</td>\n",
" <td>54.422</td>\n",
" <td>18.6</td>\n",
" <td>2.581</td>\n",
" <td>1.337</td>\n",
" <td>1803.987</td>\n",
" <td>NaN</td>\n",
" <td>597.029</td>\n",
" <td>9.59</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>37.746</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5ee30fbc82af1b3050d28c65</td>\n",
" <td>AFG</td>\n",
" <td>Afghanistan</td>\n",
" <td>2020-01-04</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td></td>\n",
" <td>0.0</td>\n",
" <td>38928341.0</td>\n",
" <td>54.422</td>\n",
" <td>18.6</td>\n",
" <td>2.581</td>\n",
" <td>1.337</td>\n",
" <td>1803.987</td>\n",
" <td>NaN</td>\n",
" <td>597.029</td>\n",
" <td>9.59</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>37.746</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" _id iso_code location date total_cases \\\n",
"0 5ee30fbc82af1b3050d28c61 AFG Afghanistan 2019-12-31 0 \n",
"1 5ee30fbc82af1b3050d28c62 AFG Afghanistan 2020-01-01 0 \n",
"2 5ee30fbc82af1b3050d28c63 AFG Afghanistan 2020-01-02 0 \n",
"3 5ee30fbc82af1b3050d28c64 AFG Afghanistan 2020-01-03 0 \n",
"4 5ee30fbc82af1b3050d28c65 AFG Afghanistan 2020-01-04 0 \n",
"\n",
" new_cases total_deaths new_deaths total_cases_per_million \\\n",
"0 0 0 0 0.0 \n",
"1 0 0 0 0.0 \n",
"2 0 0 0 0.0 \n",
"3 0 0 0 0.0 \n",
"4 0 0 0 0.0 \n",
"\n",
" new_cases_per_million total_deaths_per_million new_deaths_per_million \\\n",
"0 0.0 0.0 0.0 \n",
"1 0.0 0.0 0.0 \n",
"2 0.0 0.0 0.0 \n",
"3 0.0 0.0 0.0 \n",
"4 0.0 0.0 0.0 \n",
"\n",
" total_tests new_tests total_tests_per_thousand new_tests_per_thousand \\\n",
"0 NaN NaN NaN NaN \n",
"1 NaN NaN NaN NaN \n",
"2 NaN NaN NaN NaN \n",
"3 NaN NaN NaN NaN \n",
"4 NaN NaN NaN NaN \n",
"\n",
" new_tests_smoothed new_tests_smoothed_per_thousand tests_units \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" stringency_index population population_density median_age \\\n",
"0 NaN 38928341.0 54.422 18.6 \n",
"1 0.0 38928341.0 54.422 18.6 \n",
"2 0.0 38928341.0 54.422 18.6 \n",
"3 0.0 38928341.0 54.422 18.6 \n",
"4 0.0 38928341.0 54.422 18.6 \n",
"\n",
" aged_65_older aged_70_older gdp_per_capita extreme_poverty \\\n",
"0 2.581 1.337 1803.987 NaN \n",
"1 2.581 1.337 1803.987 NaN \n",
"2 2.581 1.337 1803.987 NaN \n",
"3 2.581 1.337 1803.987 NaN \n",
"4 2.581 1.337 1803.987 NaN \n",
"\n",
" cvd_death_rate diabetes_prevalence female_smokers male_smokers \\\n",
"0 597.029 9.59 NaN NaN \n",
"1 597.029 9.59 NaN NaN \n",
"2 597.029 9.59 NaN NaN \n",
"3 597.029 9.59 NaN NaN \n",
"4 597.029 9.59 NaN NaN \n",
"\n",
" handwashing_facilities __v \n",
"0 37.746 0 \n",
"1 37.746 0 \n",
"2 37.746 0 \n",
"3 37.746 0 \n",
"4 37.746 0 "
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Conveterting previous Collection into Dataframe\n",
"df = pd.DataFrame(list(cases.find()))\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [],
"source": [
"# Setting date column as index in the dataframe\n",
"df.date = pd.to_datetime(df.date)\n",
"df.set_index('date', inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Total</th>\n",
" <th>Percent</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>new_tests</th>\n",
" <td>17495</td>\n",
" <td>0.754648</td>\n",
" </tr>\n",
" <tr>\n",
" <th>new_tests_per_thousand</th>\n",
" <td>17495</td>\n",
" <td>0.754648</td>\n",
" </tr>\n",
" <tr>\n",
" <th>total_tests</th>\n",
" <td>16841</td>\n",
" <td>0.726437</td>\n",
" </tr>\n",
" <tr>\n",
" <th>total_tests_per_thousand</th>\n",
" <td>16841</td>\n",
" <td>0.726437</td>\n",
" </tr>\n",
" <tr>\n",
" <th>new_tests_smoothed</th>\n",
" <td>16316</td>\n",
" <td>0.703792</td>\n",
" </tr>\n",
" <tr>\n",
" <th>new_tests_smoothed_per_thousand</th>\n",
" <td>16316</td>\n",
" <td>0.703792</td>\n",
" </tr>\n",
" <tr>\n",
" <th>handwashing_facilities</th>\n",
" <td>13852</td>\n",
" <td>0.597507</td>\n",
" </tr>\n",
" <tr>\n",
" <th>extreme_poverty</th>\n",
" <td>9297</td>\n",
" <td>0.401027</td>\n",
" </tr>\n",
" <tr>\n",
" <th>male_smokers</th>\n",
" <td>6389</td>\n",
" <td>0.275590</td>\n",
" </tr>\n",
" <tr>\n",
" <th>female_smokers</th>\n",
" <td>6199</td>\n",
" <td>0.267394</td>\n",
" </tr>\n",
" <tr>\n",
" <th>stringency_index</th>\n",
" <td>4633</td>\n",
" <td>0.199845</td>\n",
" </tr>\n",
" <tr>\n",
" <th>aged_65_older</th>\n",
" <td>2466</td>\n",
" <td>0.106371</td>\n",
" </tr>\n",
" <tr>\n",
" <th>gdp_per_capita</th>\n",
" <td>2398</td>\n",
" <td>0.103438</td>\n",
" </tr>\n",
" <tr>\n",
" <th>aged_70_older</th>\n",
" <td>2282</td>\n",
" <td>0.098434</td>\n",
" </tr>\n",
" <tr>\n",
" <th>median_age</th>\n",
" <td>2175</td>\n",
" <td>0.093819</td>\n",
" </tr>\n",
" <tr>\n",
" <th>cvd_death_rate</th>\n",
" <td>2153</td>\n",
" <td>0.092870</td>\n",
" </tr>\n",
" <tr>\n",
" <th>diabetes_prevalence</th>\n",
" <td>1480</td>\n",
" <td>0.063840</td>\n",
" </tr>\n",
" <tr>\n",
" <th>population_density</th>\n",
" <td>978</td>\n",
" <td>0.042186</td>\n",
" </tr>\n",
" <tr>\n",
" <th>new_deaths_per_million</th>\n",
" <td>282</td>\n",
" <td>0.012164</td>\n",
" </tr>\n",
" <tr>\n",
" <th>total_deaths_per_million</th>\n",
" <td>282</td>\n",
" <td>0.012164</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Total Percent\n",
"new_tests 17495 0.754648\n",
"new_tests_per_thousand 17495 0.754648\n",
"total_tests 16841 0.726437\n",
"total_tests_per_thousand 16841 0.726437\n",
"new_tests_smoothed 16316 0.703792\n",
"new_tests_smoothed_per_thousand 16316 0.703792\n",
"handwashing_facilities 13852 0.597507\n",
"extreme_poverty 9297 0.401027\n",
"male_smokers 6389 0.275590\n",
"female_smokers 6199 0.267394\n",
"stringency_index 4633 0.199845\n",
"aged_65_older 2466 0.106371\n",
"gdp_per_capita 2398 0.103438\n",
"aged_70_older 2282 0.098434\n",
"median_age 2175 0.093819\n",
"cvd_death_rate 2153 0.092870\n",
"diabetes_prevalence 1480 0.063840\n",
"population_density 978 0.042186\n",
"new_deaths_per_million 282 0.012164\n",
"total_deaths_per_million 282 0.012164"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Checking for null values by column\n",
"# Percentage of missing values.\n",
"total = df.isnull().sum().sort_values(ascending=False)\n",
"percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)\n",
"missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])\n",
"missing_data.head(20)"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x34b83dc588>"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# chart of missing values\n",
"total_missing = total[total > 0]\n",
"total_missing.sort_values(inplace=True)\n",
"total_missing.plot.bar()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the previous results we can see that there are multiple columns with a lot of null values. Let´s try to have a solution around this in the following steps."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### How many people died after getting being tested positive??"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The number of people who died after being tested positive was: 416430 / 7343562 ( 5.67 %)\n"
]
}
],
"source": [
"# We need to filter the data to the values corresponding only to the countries data, not including the world category data\n",
"new_deaths = np.sum(df[\"new_deaths\"][df[\"location\"] != \"World\"])\n",
"new_cases = np.sum(df[\"new_cases\"][df[\"location\"] != \"World\"])\n",
"deaths_percentage = (new_deaths/new_cases)*100\n",
"print(\"The number of people who died after being tested positive was: \", new_deaths, \"/\", new_cases, \"( \", np.round(deaths_percentage, 2), \"%)\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After seeing this number, we need to dig in some more to get better insights to guess what types of people died, their age, country, economical situation, etc..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.1 Analysis of features <a id='part1_1'></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1.1. Ordinal Features\n",
"\n",
"One ordinal feature is one that can be divided into a limited number of values and sorted by some proceedings. For example, sizes is a categorical varible whose values can be huge, big, medium and small. These values can be sorted by their meanings.\n",
"\n",
"In this dataset we don´t find any feature related to this category."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1.2. Continuous Features\n",
"\n",
"One continuous feature is one that can have a number of values between a min a max number, and they can be decimal, integer, flaot formats with/without ordering and giving measurable information.\n",
"\n",
"Continuos features in this dataset are: **'_id',\n",
" 'date',\n",
" 'total_cases',\n",
" 'new_cases',\n",
" 'total_deaths',\n",
" 'new_deaths',\n",
" 'total_cases_per_million',\n",
" 'new_cases_per_million',\n",
" 'total_deaths_per_million',\n",
" 'new_deaths_per_million',\n",
" 'total_tests',\n",
" 'new_tests',\n",
" 'total_tests_per_thousand',\n",
" 'new_tests_per_thousand',\n",
" 'new_tests_smoothed',\n",
" 'new_tests_smoothed_per_thousand',\n",
" 'stringency_index',\n",
" 'population',\n",
" 'population_density',\n",
" 'aged_65_older',\n",
" 'aged_70_older',\n",
" 'gdp_per_capita',\n",
" 'extreme_poverty',\n",
" 'cvd_death_rate',\n",
" 'diabetes_prevalence',\n",
" 'female_smokers',\n",
" 'male_smokers',\n",
" 'handwashing_facilities'**"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([dtype('O'), dtype('int64'), dtype('float64')], dtype=object)"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# data types in this dataset\n",
"df.dtypes.unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1.3. Categorical Features\n",
"\n",
"One categorical feature is one that can be divided into a limited number of values. For example, vehicles is a categorical varible whose values can be car, truck, motorbike, and cycle. These values can not be sorted or have any correspondence to each other.\n",
"\n",
"One way to get the categorical features from a dataset can be by getting their types and taking the ones corresponding the ones whose type is \"string\" or homologous to it.\n",
"\n",
"Categorical Features in this dataset are: **iso_code, location, tests_units,'median_age',**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Analyzing categorical features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Location -> Categorical Feature"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>new_deaths</th>\n",
" </tr>\n",
" <tr>\n",
" <th>location</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>United States</th>\n",
" <td>112924</td>\n",
" </tr>\n",
" <tr>\n",
" <th>United Kingdom</th>\n",
" <td>41128</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Brazil</th>\n",
" <td>39680</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Italy</th>\n",
" <td>34114</td>\n",
" </tr>\n",
" <tr>\n",
" <th>France</th>\n",
" <td>29319</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Spain</th>\n",
" <td>27136</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Mexico</th>\n",
" <td>15357</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Belgium</th>\n",
" <td>9629</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Germany</th>\n",
" <td>8755</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Iran</th>\n",
" <td>8506</td>\n",
" </tr>\n",
" <tr>\n",
" <th>India</th>\n",
" <td>8102</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Canada</th>\n",
" <td>7960</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Russia</th>\n",
" <td>6358</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Netherlands</th>\n",
" <td>6042</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Peru</th>\n",
" <td>5903</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Sweden</th>\n",
" <td>4795</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Turkey</th>\n",
" <td>4746</td>\n",
" </tr>\n",
" <tr>\n",
" <th>China</th>\n",
" <td>4638</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Ecuador</th>\n",
" <td>3720</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Chile</th>\n",
" <td>2475</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Pakistan</th>\n",
" <td>2356</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Indonesia</th>\n",
" <td>1959</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Ireland</th>\n",
" <td>1695</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Switzerland</th>\n",
" <td>1674</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Portugal</th>\n",
" <td>1495</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Colombia</th>\n",
" <td>1433</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Romania</th>\n",
" <td>1360</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Egypt</th>\n",
" <td>1342</td>\n",
" </tr>\n",
" <tr>\n",
" <th>South Africa</th>\n",
" <td>1210</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Poland</th>\n",
" <td>1206</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Philippines</th>\n",
" <td>1027</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Bangladesh</th>\n",
" <td>1012</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Japan</th>\n",
" <td>920</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Ukraine</th>\n",
" <td>833</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Saudi Arabia</th>\n",
" <td>819</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Argentina</th>\n",
" <td>735</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Algeria</th>\n",
" <td>732</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Austria</th>\n",
" <td>673</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Denmark</th>\n",
" <td>593</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Hungary</th>\n",
" <td>551</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Dominican Republic</th>\n",
" <td>550</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Bolivia</th>\n",
" <td>512</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Iraq</th>\n",
" <td>426</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Panama</th>\n",
" <td>413</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Afghanistan</th>\n",
" <td>405</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Sudan</th>\n",
" <td>389</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Nigeria</th>\n",
" <td>382</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Moldova</th>\n",
" <td>371</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Czech Republic</th>\n",
" <td>330</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Finland</th>\n",
" <td>324</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Guatemala</th>\n",
" <td>316</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Israel</th>\n",
" <td>299</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Honduras</th>\n",
" <td>290</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Belarus</th>\n",
" <td>288</td>\n",
" </tr>\n",
" <tr>\n",
" <th>United Arab Emirates</th>\n",
" <td>284</td>\n",
" </tr>\n",
" <tr>\n",
" <th>South Korea</th>\n",
" <td>276</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Kuwait</th>\n",
" <td>275</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Serbia</th>\n",
" <td>251</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Norway</th>\n",
" <td>239</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Armenia</th>\n",
" <td>227</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Morocco</th>\n",
" <td>210</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Cameroon</th>\n",
" <td>208</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Greece</th>\n",
" <td>183</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Bulgaria</th>\n",
" <td>167</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Macedonia</th>\n",
" <td>164</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Bosnia and Herzegovina</th>\n",
" <td>160</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Puerto Rico</th>\n",
" <td>143</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Yemen</th>\n",
" <td>129</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Malaysia</th>\n",
" <td>118</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Luxembourg</th>\n",
" <td>110</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Slovenia</th>\n",
" <td>109</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Croatia</th>\n",
" <td>106</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Australia</th>\n",
" <td>102</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Azerbaijan</th>\n",
" <td>102</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Mali</th>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Democratic Republic of Congo</th>\n",
" <td>95</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Kenya</th>\n",
" <td>89</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Somalia</th>\n",
" <td>85</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Oman</th>\n",
" <td>84</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Cuba</th>\n",
" <td>83</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Lithuania</th>\n",
" <td>74</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Chad</th>\n",
" <td>72</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Estonia</th>\n",
" <td>69</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Kazakhstan</th>\n",
" <td>67</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Qatar</th>\n",
" <td>66</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Niger</th>\n",
" <td>65</td>\n",
" </tr>\n",
" <tr>\n",
" <th>El Salvador</th>\n",
" <td>64</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Mauritania</th>\n",
" <td>61</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Haiti</th>\n",
" <td>58</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Thailand</th>\n",
" <td>58</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Nicaragua</th>\n",
" <td>55</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Senegal</th>\n",
" <td>54</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Burkina Faso</th>\n",
" <td>53</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Andorra</th>\n",
" <td>51</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Sierra Leone</th>\n",
" <td>50</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Tunisia</th>\n",
" <td>49</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Tajikistan</th>\n",
" <td>48</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Ghana</th>\n",
" <td>48</td>\n",
" </tr>\n",
" <tr>\n",
" <th>San Marino</th>\n",
" <td>42</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Cote d'Ivoire</th>\n",
" <td>41</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Ethiopia</th>\n",
" <td>35</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Djibouti</th>\n",
" <td>34</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Albania</th>\n",
" <td>34</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Bahrain</th>\n",
" <td>31</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Kosovo</th>\n",
" <td>31</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Liberia</th>\n",
" <td>31</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Jersey</th>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Lebanon</th>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Slovakia</th>\n",
" <td>28</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Latvia</th>\n",
" <td>26</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Kyrgyzstan</th>\n",
" <td>26</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Singapore</th>\n",
" <td>25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Congo</th>\n",
" <td>25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Isle of Man</th>\n",
" <td>24</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Uruguay</th>\n",
" <td>23</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Venezuela</th>\n",
" <td>23</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Guinea</th>\n",
" <td>23</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New Zealand</th>\n",
" <td>22</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Gabon</th>\n",
" <td>22</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Tanzania</th>\n",
" <td>21</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Uzbekistan</th>\n",
" <td>19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>South Sudan</th>\n",
" <td>19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Cyprus</th>\n",
" <td>18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Nepal</th>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Sint Maarten (Dutch part)</th>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Georgia</th>\n",
" <td>13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Guernsey</th>\n",
" <td>13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Togo</th>\n",
" <td>13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Equatorial Guinea</th>\n",
" <td>12</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Guyana</th>\n",
" <td>12</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Guinea-Bissau</th>\n",
" <td>12</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Sao Tome and Principe</th>\n",
" <td>12</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Costa Rica</th>\n",
" <td>12</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Sri Lanka</th>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Paraguay</th>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Bahamas</th>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Zambia</th>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Jamaica</th>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Madagascar</th>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Mauritius</th>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Iceland</th>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Bermuda</th>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Jordan</th>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Malta</th>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Montenegro</th>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Trinidad and Tobago</th>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Maldives</th>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Barbados</th>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Taiwan</th>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>International</th>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Central African Republic</th>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>United States Virgin Islands</th>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Syria</th>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Myanmar</th>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Cape Verde</th>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Monaco</th>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Palestine</th>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Libya</th>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Guam</th>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Angola</th>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Benin</th>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Zimbabwe</th>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Malawi</th>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Antigua and Barbuda</th>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Swaziland</th>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Aruba</th>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Belize</th>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Mozambique</th>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Brunei</th>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Suriname</th>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Rwanda</th>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Comoros</th>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Northern Mariana Islands</th>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>British Virgin Islands</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Western Sahara</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Cayman Islands</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Curacao</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Liechtenstein</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Montserrat</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Gambia</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Burundi</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Turks and Caicos Islands</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Botswana</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Grenada</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Uganda</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Anguilla</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Hong Kong</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Vietnam</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Laos</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Mongolia</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Lesotho</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Vatican</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Greenland</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Saint Lucia</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Gibraltar</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Namibia</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Saint Kitts and Nevis</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Papua New Guinea</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Seychelles</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Dominica</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New Caledonia</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Eritrea</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Faeroe Islands</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Timor</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Falkland Islands</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Cambodia</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Fiji</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Saint Vincent and the Grenadines</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>French Polynesia</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Bhutan</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Bonaire Sint Eustatius and Saba</th>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" new_deaths\n",
"location \n",
"United States 112924\n",
"United Kingdom 41128\n",
"Brazil 39680\n",
"Italy 34114\n",
"France 29319\n",
"Spain 27136\n",
"Mexico 15357\n",
"Belgium 9629\n",
"Germany 8755\n",
"Iran 8506\n",
"India 8102\n",
"Canada 7960\n",
"Russia 6358\n",
"Netherlands 6042\n",
"Peru 5903\n",
"Sweden 4795\n",
"Turkey 4746\n",
"China 4638\n",
"Ecuador 3720\n",
"Chile 2475\n",
"Pakistan 2356\n",
"Indonesia 1959\n",
"Ireland 1695\n",
"Switzerland 1674\n",
"Portugal 1495\n",
"Colombia 1433\n",
"Romania 1360\n",
"Egypt 1342\n",
"South Africa 1210\n",
"Poland 1206\n",
"Philippines 1027\n",
"Bangladesh 1012\n",
"Japan 920\n",
"Ukraine 833\n",
"Saudi Arabia 819\n",
"Argentina 735\n",
"Algeria 732\n",
"Austria 673\n",
"Denmark 593\n",
"Hungary 551\n",
"Dominican Republic 550\n",
"Bolivia 512\n",
"Iraq 426\n",
"Panama 413\n",
"Afghanistan 405\n",
"Sudan 389\n",
"Nigeria 382\n",
"Moldova 371\n",
"Czech Republic 330\n",
"Finland 324\n",
"Guatemala 316\n",
"Israel 299\n",
"Honduras 290\n",
"Belarus 288\n",
"United Arab Emirates 284\n",
"South Korea 276\n",
"Kuwait 275\n",
"Serbia 251\n",
"Norway 239\n",
"Armenia 227\n",
"Morocco 210\n",
"Cameroon 208\n",
"Greece 183\n",
"Bulgaria 167\n",
"Macedonia 164\n",
"Bosnia and Herzegovina 160\n",
"Puerto Rico 143\n",
"Yemen 129\n",
"Malaysia 118\n",
"Luxembourg 110\n",
"Slovenia 109\n",
"Croatia 106\n",
"Australia 102\n",
"Azerbaijan 102\n",
"Mali 96\n",
"Democratic Republic of Congo 95\n",
"Kenya 89\n",
"Somalia 85\n",
"Oman 84\n",
"Cuba 83\n",
"Lithuania 74\n",
"Chad 72\n",
"Estonia 69\n",
"Kazakhstan 67\n",
"Qatar 66\n",
"Niger 65\n",
"El Salvador 64\n",
"Mauritania 61\n",
"Haiti 58\n",
"Thailand 58\n",
"Nicaragua 55\n",
"Senegal 54\n",
"Burkina Faso 53\n",
"Andorra 51\n",
"Sierra Leone 50\n",
"Tunisia 49\n",
"Tajikistan 48\n",
"Ghana 48\n",
"San Marino 42\n",
"Cote d'Ivoire 41\n",
"Ethiopia 35\n",
"Djibouti 34\n",
"Albania 34\n",
"Bahrain 31\n",
"Kosovo 31\n",
"Liberia 31\n",
"Jersey 30\n",
"Lebanon 30\n",
"Slovakia 28\n",
"Latvia 26\n",
"Kyrgyzstan 26\n",
"Singapore 25\n",
"Congo 25\n",
"Isle of Man 24\n",
"Uruguay 23\n",
"Venezuela 23\n",
"Guinea 23\n",
"New Zealand 22\n",
"Gabon 22\n",
"Tanzania 21\n",
"Uzbekistan 19\n",
"South Sudan 19\n",
"Cyprus 18\n",
"Nepal 15\n",
"Sint Maarten (Dutch part) 15\n",
"Georgia 13\n",
"Guernsey 13\n",
"Togo 13\n",
"Equatorial Guinea 12\n",
"Guyana 12\n",
"Guinea-Bissau 12\n",
"Sao Tome and Principe 12\n",
"Costa Rica 12\n",
"Sri Lanka 11\n",
"Paraguay 11\n",
"Bahamas 11\n",
"Zambia 10\n",
"Jamaica 10\n",
"Madagascar 10\n",
"Mauritius 10\n",
"Iceland 10\n",
"Bermuda 9\n",
"Jordan 9\n",
"Malta 9\n",
"Montenegro 9\n",
"Trinidad and Tobago 8\n",
"Maldives 8\n",
"Barbados 7\n",
"Taiwan 7\n",
"International 7\n",
"Central African Republic 6\n",
"United States Virgin Islands 6\n",
"Syria 6\n",
"Myanmar 6\n",
"Cape Verde 5\n",
"Monaco 5\n",
"Palestine 5\n",
"Libya 5\n",
"Guam 5\n",
"Angola 4\n",
"Benin 4\n",
"Zimbabwe 4\n",
"Malawi 4\n",
"Antigua and Barbuda 3\n",
"Swaziland 3\n",
"Aruba 3\n",
"Belize 2\n",
"Mozambique 2\n",
"Brunei 2\n",
"Suriname 2\n",
"Rwanda 2\n",
"Comoros 2\n",
"Northern Mariana Islands 2\n",
"British Virgin Islands 1\n",
"Western Sahara 1\n",
"Cayman Islands 1\n",
"Curacao 1\n",
"Liechtenstein 1\n",
"Montserrat 1\n",
"Gambia 1\n",
"Burundi 1\n",
"Turks and Caicos Islands 1\n",
"Botswana 1\n",
"Grenada 0\n",
"Uganda 0\n",
"Anguilla 0\n",
"Hong Kong 0\n",
"Vietnam 0\n",
"Laos 0\n",
"Mongolia 0\n",
"Lesotho 0\n",
"Vatican 0\n",
"Greenland 0\n",
"Saint Lucia 0\n",
"Gibraltar 0\n",
"Namibia 0\n",
"Saint Kitts and Nevis 0\n",
"Papua New Guinea 0\n",
"Seychelles 0\n",
"Dominica 0\n",
"New Caledonia 0\n",
"Eritrea 0\n",
"Faeroe Islands 0\n",
"Timor 0\n",
"Falkland Islands 0\n",
"Cambodia 0\n",
"Fiji 0\n",
"Saint Vincent and the Grenadines 0\n",
"French Polynesia 0\n",
"Bhutan 0\n",
"Bonaire Sint Eustatius and Saba 0"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Getting the ranking of countries by number of deaths \n",
"pd.DataFrame(df[df[\"location\"] != \"World\"].groupby([\"location\"])[\"new_deaths\"].sum().sort_values(ascending=False))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"According to this data, people died most when their median ages were between 35 and 45 years old."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Median_age -> Categorical Feature"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let´s map the median_age column to an age status so that we can see better if the categories map accordingly to the data we think we have."
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [],
"source": [
"df[\"age_status\"] = df[\"median_age\"].apply(lambda x: \"Very old\" if x > 80 else \"Old\" if x > 60 else \"Adult\" if x > 40 else \"Millenial\" if x > 30 else \"Teenager\")"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>new_deaths</th>\n",
" </tr>\n",
" <tr>\n",
" <th>age_status</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Millenial</th>\n",
" <td>186868</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Adult</th>\n",
" <td>180766</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Teenager</th>\n",
" <td>48796</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" new_deaths\n",
"age_status \n",
"Millenial 186868\n",
"Adult 180766\n",
"Teenager 48796"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Getting the ranking of people by their median age when they died\n",
"pd.DataFrame(df[df[\"location\"] != \"World\"].groupby([\"age_status\"])[\"new_deaths\"].sum().sort_values(ascending=False))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"According to this analysis, people have died most where countries median age were 30 or below. Here, Millenial (30-40) and Teenager(<20) make up most of the deaths worlwide "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.2. Null values <a id='part1_2'></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, if we check if the values corresponding to the test-related columns have something to do with the data not being populated, then we will need to do so with zero´s. It will not have any effect into the total counts of these columns"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [],
"source": [
"values = {'total_tests':-1, 'new_tests':-1, 'total_tests_per_thousand':-1, 'new_tests_per_thousand':-1, 'new_tests_smoothed':-1, 'new_tests_smoothed_per_thousand':-1}\n",
"df = df.fillna(value=values)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Later, all columns related to cases, and deaths per million will be filled also with zeros so that we don´t have any null values in those. It makes sense to substitute this values by zero as they don´t affect other analysis."
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [],
"source": [
"values = {'total_deaths_per_million':-1, 'new_deaths_per_million':-1, 'new_cases_per_million':-1, 'total_cases_per_million':-1, }\n",
"df = df.fillna(value=values)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Columns related to facts that can not change over time fast can be filled with their mode values. In this dataset, we have: \n",
"- extreme_poverty\n",
"- male_smokers\n",
"- female_smokers\n",
"- stringency_index\n",
"- aged_65_older\n",
"- gdp_per_capita\n",
"- aged_70_older\n",
"- cvd_death_rate\n",
"- median_age\n",
"- diabetes_prevalence\n",
"- population_density\n",
"- population"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [],
"source": [
"cols = [\"extreme_poverty\", \"male_smokers\", \"female_smokers\", \"stringency_index\", \"aged_65_older\", \"gdp_per_capita\", \"aged_70_older\", \"cvd_death_rate\", \"median_age\", \"diabetes_prevalence\", \"population_density\", \"population\"]\n",
"df[cols]=df[cols].fillna(df.mode().iloc[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For handwashing_facilities, we can see the trends of this column, because we may suspect that data is being filled over time, and the values are increasing as people is installing these devices according to government laws."
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x34b8745fc8>"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"df.groupby([\"date\"])[\"handwashing_facilities\"].sum().plot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we were suspecting, this assumption is true. Now we will apply the mean value by country to the null ones, so that each country has a corresponding value according to the rest."
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [],
"source": [
"#df.groupby(\"location\")[\"handwashing_facilities\"].apply(lambda x: x.fillna(x.mean()))\n",
"df[\"handwashing_facilities\"] = df[\"handwashing_facilities\"].fillna(df[\"handwashing_facilities\"].mean())"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Total</th>\n",
" <th>Percent</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>age_status</th>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>new_tests_smoothed</th>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>iso_code</th>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>location</th>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>total_cases</th>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>new_cases</th>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>total_deaths</th>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>new_deaths</th>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>total_cases_per_million</th>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>new_cases_per_million</th>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>total_deaths_per_million</th>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>new_deaths_per_million</th>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>total_tests</th>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>new_tests</th>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>total_tests_per_thousand</th>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>new_tests_per_thousand</th>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>new_tests_smoothed_per_thousand</th>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>__v</th>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>tests_units</th>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>stringency_index</th>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Total Percent\n",
"age_status 0 0.0\n",
"new_tests_smoothed 0 0.0\n",
"iso_code 0 0.0\n",
"location 0 0.0\n",
"total_cases 0 0.0\n",
"new_cases 0 0.0\n",
"total_deaths 0 0.0\n",
"new_deaths 0 0.0\n",
"total_cases_per_million 0 0.0\n",
"new_cases_per_million 0 0.0\n",
"total_deaths_per_million 0 0.0\n",
"new_deaths_per_million 0 0.0\n",
"total_tests 0 0.0\n",
"new_tests 0 0.0\n",
"total_tests_per_thousand 0 0.0\n",
"new_tests_per_thousand 0 0.0\n",
"new_tests_smoothed_per_thousand 0 0.0\n",
"__v 0 0.0\n",
"tests_units 0 0.0\n",
"stringency_index 0 0.0"
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Checking for null values by column\n",
"# Percentage of missing values.\n",
"total = df.isnull().sum().sort_values(ascending=False)\n",
"percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)\n",
"missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])\n",
"missing_data.head(20)"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x34b87ae488>"
]
},
"execution_count": 84,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"df.groupby([\"date\"])[\"handwashing_facilities\"].sum().plot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.3. Most influential variables on new_deaths <a id='part1_3'></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.3.1. Target variable: new_deaths"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To know which variables affect most on the new_deaths column, it is necessary to address an analysis on new_deaths ranges, average death age, visualize outliers on this variable..."
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 23183.000000\n",
"mean 35.925463\n",
"std 332.225249\n",
"min -1918.000000\n",
"25% 0.000000\n",
"50% 0.000000\n",
"75% 1.000000\n",
"max 10520.000000\n",
"Name: new_deaths, dtype: float64"
]
},
"execution_count": 85,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"new_deaths\"].describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As this statistical information shows, this variable has large differences between min and max values, and the mean value is far from the minimum and maximum figures on deaths. It may mean that there are different values that may correspond to outliers. To check this fact, let´s show the boxplot below:"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x34b87aef08>"
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYYAAAEDCAYAAAAx/aOOAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAWaElEQVR4nO3df5wc9X3f8ddqDwtZP05wIB+gJK4x/iKUELsoSNRK4zRY1LFpbAQ6/QCLYBT6SEKjxyNuHm7jNHbaJmla22psk/jhIIhsP2wUB9UWaRyKkwAmSFRg96RL+FacMPLJEtEv7tBVdyfdbf+Y2fV87/Yk3Umr051fz390+53Zme93PrPzntldzZYqlQqSJFVNm+gOSJIuLAaDJClhMEiSEgaDJClhMEiSEk2nmtjd3e1XliRpimtubi4VH3vFIElKGAySpERDg2H37t2NXPyEcmyTz1QdFzi2yepCHZtXDJKkhMEgSUoYDJKkhMEgSUoYDJKkhMEgSUoYDJKkhMEgSUoYDJKkhMEgSUoYDJKkhMEgSUoYDJKkhMEgSUoYDJKkhMEgSUoYDJKkhMEgSUoYDJKkRNNEd2CyWb9+ffJ4w4YNE9QTSWoMrxgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUMBgkSQmDQZKUaGr0Crq7u9m0aRNr165lzpw5jV5dw1THMdz69esnoDeZ5uZmuru7k7YVK1bwta99jTe+8Y0cPXqUlpYWZsyYweDgIIcOHeLEiRO1561YsYKnnnqKV199laamJtavX8+sWbN45JFH6OnpYe7cubz22mvMmzePVatWsXnzZgYHBwEol8vccccdbNmyhbVr11KpVNi4cSODg4OUy2U+9KEPMWfOnLr17+7uZuPGjQC1+YrtJ0+epFQqJcvp6uri05/+NJdffjm33norDz30EG1tbTzyyCPcf//9zJo1i02bNvGBD3yALVu2cNttt/Hoo4+ybNkyNm7cSHNzM0ePHuXee+/l8ccfH7E/jtbPalulUqkt/8tf/jIHDx6kpaUFgMOHDzNv3jxuvfVWHnzwQSqVCm9605tYtWoVjz76aG2Zzz//PF/4whcol8sMDg6ydu1annjiCQ4dOkRbWxubN29OxlLsS3H8K1euTLb7pk2buOGGG/jGN77BbbfdxubNm2vbtjq9uqzitl+xYsWIeevV5UxVt1d121fX2dXVxWc+8xnuv/9+rrrqqtp81Vo18tgwvK5T7XjUqHGUP/axj406sb+/f/SJZ+DIkSM888wztLe3MzAwwMKFC89mcRNq69attLe3T3Q3Ev39/SPaOjo6GBwcpK+vD4Djx4/T09PDsWPHGBoaSp7X0dHBsWPHABgaGmLPnj0cPXqUGCODg4P09vYyODhIT08Pe/bsYd++ffT09NDT00N3dzd79uyhq6uLgYEBOjs72blzZ21atd7V7Vas/9atW9m5c2cyX7H99ddfH7GcBx54gO7ubnp6eujo6KCvr49du3Zx4sSJWr/b29trfers7KSrq4tdu3bR19dHb28vQ0NDdHR0cPDgwRH742j9rLZ1dnbWlr9//34GBwc5duwYx44dq22jjo4O+vv7GRoaqm2z6vZZuHAhn/jEJ6hUKlQqFQB27dpFd3c3g4ODdcdS7Etx/MO3e3t7O/v372f//v10dnayb9++2rarTi/Wo7rtqzUtzluvLmequr2q275Yu9dee409e/awdOnS2nzDt89ojhw5UgvhsRpe13p1nkjjHdu5HsfFF1/88eLjhr6V1Nvby/bt26lUKjz33HP09PQ0cnUN093dXRvHVHbgwAGeeeaZUafVa6tUKmzfvp1t27Yl07Zv386+fftG1L+6LYvz1WsvTn/xxReT9R8/fhygdvVy4MABnn32WSqVSq1P1X+r8xafO3x/LNZ3eD+r46v+XW87DO/X8O3z3HPP8a1vfavW36ri4+JYtm3blvSlq6srWW9xu1f7deTIkRH927ZtWzKuaj2Kyylu52effTZ5PJbXa3F7FcddrN2BAweIMdadrxHHhuF17erqmnLHo0aNo6FXDI899hgHDx6kUqlQKpXo7++/IFJ6rLZu3UpXV9eUD4bxGhoaGrFtqlcgvb29Sf07OzvZu3dvMl/1bLXYXpze0dHByZMnT9mHsdamuD8W61vsZ7Wt3vjGuq6Ojo4znr+6rmpfnnzyydqVXdHp+lW8OimVSrz00ku8/vrrdecdvqxKpTKms9F6r5FSqcSuXbuS2lWvaIfPd6pjw9mcVRfr2tnZOWJ/nOjj0XjGVm9/PdtxnNcrhupbEpCdEe3YsaORq2uYHTt2jDjb0+kdOHBgRP137Ngx4mA2WnvV8DPxc6G4PxbrW+xnseZnEwzj3XeqfTnVVcrp+lWdPjg4WDtDPxOVSmVMr9d6r5HBwcG6V2z15mvEsWF4Xevtj5NRvf31XGtoMIQQKJfLQPZh5aJFixq5uoZZtGhRbRw6c62trSPqv2jRIkqlUjLfaO1VM2bMOOd9K+6PxfoW+1ms+Wh9O9N1nU0fW1tbR53ndP2qTi+Xy7S2tp7xOEql0pher/VeI+VyeUTtZsyYUXe+Rhwbhte13v44GdXbX8+1hgZD8cU+bdo0brnllkaurmGWLVt2VgeGqa6pqWnEi72pqYm77rprRP2XLVuWzNvU1FS3vTh97dq1p+3DtGlj25WL+2OxvsV+VtvqjW+s67r99tvPeP7quqp9ufPOO+vOd7p+lcvlZFl33XXXqPM3NTUl27BcLo/p9VrvNTJt2rQRtbv77rvrzteIY8Pwut55551T7njUqHE0NBhmzpzJ4sWLKZVK3HjjjZP262HNzc21cUxlra2tvPOd7xx1Wr22UqnE4sWLWbJkSTJt8eLFXHXVVSPqX92WxfnqtRenX3vttcn6q2eh1YNca2srN910E6VSqdan6r/1zliH74/F+g7vZ3V81b9PdfY+fF3VPtx4440sXbq07pny8L9bW1tZsmRJ0pf58+cn6y1u92q/Lr300hH9W7JkSTKuaj2Kyylu55tuuil5PJbXa3F7FcddrF1rayshhLrzNeLYMLyu8+fPn3LHo0aNo+FfV73++uvZu3cvy5cvZ/r06WezuAk1f/589u7dy9GjRye6KzXNzc0jvrK6YsUKXnrpJebMmUNfXx8tLS20tLQwc+bM2lcpq89bsWIFhw8fpre3l4suuoj77ruPBQsW1D4wbGlp4cSJE1xxxRWsWbOGffv2MXv2bObMmcMll1zCqlWrePXVV1m+fDlXX301L7/8MrNnz+aSSy7hjjvuYPr06bXtVqz//Pnzefnll5k7d25tvmL7rFmzaG5uTpbz5je/meeff54rrriCtrY2du3axerVq4kxsm7dOhYsWMDevXtpa2vj1VdfZeXKlRw4cIDly5ezc+dOWlpaGBgY4J577uHIkSMj9sfR+lltu/rqq2vLf+WVV+jv72fevHnMnj2bgYGBWr/a29uZNm0aV155JWvWrKn1Yfr06Vx22WW0t7dTLpepVCp88IMf5NChQwwMDLBq1aoRYyn2pTj+1atXJ9t97969LF26lL6+PlauXMm+fftq27Y6vbqs4rZfvXr1iHnr1eVMVbdXcdtXa/fCCy+wbt26WtAVa3W6Y8PZfF11eF3r1XkijXds53ocwz98Lp3qw6ju7u6z+hrO7t27ueaaa85mERec4f+hbcOGDRPUk8aZinWDqTsucGyT1YUytubm5uTtEG+JIUlKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKGAySpITBIElKNE10ByabDRs2ALB7926uueaaCe6NJJ17XjFIkhIGgyQpYTBIkhIGgyQpYTBIkhIGgyQpYTBIkhIGgyQpYTBIkhIGgyQpYTBIkhIGgyQpYTBIkhIGgyQpYTBIkhIGgyQpYTBIkhIGgyQpYTBIkhKlSqUy6sTu7u7RJ0qSpoTm5uZS8bFXDJKkhMEgSUqc8q0kSdIPH68YJEmJpvE+MYTwTuBTwBuAw8A9McZXQghzgS8BbwEOAitijAdCCG8AHgQWAceB1THGF0MIJeC/Au8DhoB1McZnzmZQjRRCWA18FLgI2BBj/OwEd+mMhBB+G1iRP/yLGONvhBBuBj4JzAAeiTF+NJ/37cCfAHOAp4B/HWM8GUL4UeCLwDwgAmtijMfO81DqCiH8N+CyGOPdY+3/aPvshAxkmBDCrcBvAzOBx2OMvzZV6hZCuBP4d/nDv4wxfngy1y6EMAf4O+B9Mcbvnqs6TcQYz+aK4UvAvTHGt+d//2He/p+Ap2OMC4DPA/89b/83QG/evh54OG9fDiwArgPeDzwcQhh3YDVSCOEq4D8DS4G3A78UQrhuYnt1evkOugx4B1m/bwghrAI2Ar9Atv1/KoTwnvwpXwR+Ncb4NqAErMvbHwAeiDFeC+wAfuv8jWJ0IYSfA9YWmsba/9H22QkVQngL8Mdkr4vrgX+a12jS1y2E8EayY8bPAD8J/HS+n07K2oUQFgPfAt6WP57BuavTeR/juIIhhDAd+GiMsT1vagd+NP/7vWRBAfBl4D0hhIuK7THGp4DL84R8L/CVGONQjPH/AnuBfzaefp0HNwN/HWM8EmPsBb4K3D7BfToT+4FfjzEOxBhPAP9AtgPvjjG+HGM8Sbaz3hFC+DFgRoxxW/7ch/P2i4B/TjbmWvt5HENdIYRLycL6d/PH4+n/aPvsRPsA2ZlmV163NuD/MQXqBpTJjj8zya6+LwJOMHlrtw74FeD7+eMbOXd1Ou9jHFcwxBj7Y4xfBAghTAM+BvyPfPKVZAci8g3SA1xebM/tB+afov1CNJn6WhNj7KjuiCGEa8jeUhpibPW4DOjJa1psn2ifA34TOJo/Hk//R9tnJ9pbgXII4eshhO8Av8zYX0cXZN1ijK+TnRG/CHQB3wUGmKS1izHeG2N8utB0Lut03sd42rdsQgh3kH2WUPRijPHm/HODP82X87v5tNKweUtkB6FpQGUM7ReiydTXEUIIC4G/AP4tcJL8sjc31jrBBI89hHAv8L0Y4zdDCHfnzePp/2j77ERrIjuLfBdwDPg62edzY3kdXXB1AwghXA/cA/wY0E12Rr2MqVO7sR7vLqgxnjYYYox/BvzZ8PYQwiyyHfUw8Av5pS7APqAV6Mo/K5idz9MFXAF05vO1kl12VdsZ1n4h6gJ+uvD4Qu5rIv+ywJ8D62OMXwkh/Az1t/to9fhHoDmEUI4xDubzTPTY24Ar8rPpS4FZZC+usfZ/tH12oh0AnogxHgQIIWwhe3thsDDPZKwbwC3AN2OM/wgQQngY+DBTp3aj1WM8dTrvYzybD5+/CLwEtMUY+wvt/xP4YP53G9mHJieK7SGEpUBfjHFv3r4mhFAOIbyV7Cz2f59FvxrpCeDnQgiX5x+eLQe+McF9Oq0Qwo+QvdW3Osb4lbx5ezYpvDWEUAZWk30z5BWgLw8SgLvy9hPA02Q1hayWf3neBlFHjPHdMcYfz78A8R+Ar8cYf5Gx93+0fXaiPQbcEkKYm9foPWTvQU/quuX+D3BzCGFm/s3EW4EnmTq1O5evr/M+xnF9+yeE8A6yT9v/HnghhADw/Rjjz5O9b/hwCKEDeA1Ykz/t08Dn8vZ+sg0C2Y6+mOwDbIAPxRiPj6dfjRZj3BdC+E3gb8i+pvsnMcbnJrhbZ+LDwMXAJ/NaQfZtl7vJriIuJtv5qh98rQE+n3/97gV+8I2zXwb+NITwUbIvCaw6H50fh7H2f7R9dkLFGLeHEP6A7NsuFwH/C/gjsvflJ3XdYoyP58eR58k+dH4O+H1gC1Ojdn3525vnok7nfYz+z2dJUsL/+SxJShgMkqSEwSBJShgMkqSEwSBJShgMkqSEwSAVhBAeK9xeYzzP/6kQwh/nf78rhLDrnHVOOk8MBuncWsgFcJM66WxckL97oB9eIYR3kd1Gew/w42T/4/c+stuk/Bey+/eXgW+T/cbHPcANMca78lsRHwZ+Lcb4UH7rlU/EGBefYn1Xkt0I8krgFbIfSalOW0B27/uWfJ1/GGPcmN9R+FPAErL71pSAe8n+t+rvkN3z5qF8ubNCCF8BriX7H7DrYoxP5337ZL7cCvB7McY/P5ttJ50rXjHoQrSY7ID+DuAhsjv3foTsjrA3xBh/kuwGY78PPEp2P6FpZD+g1Au8O1/OvyK7JcGpfBbYFmNcSBY01wLkNyv7KvCRGOMNZIH04RDCkrx/VwI3xRivIwuAj8QYv0d2z6an83s2QXb18Kn8fk6fI7tFPcDHgU/my74H+Bdj3kpSgxgMuhC9EmP8Tv73C2R3Tn0f2f25vp3fTfX9wHX5jRi/B9wA/Evg94CfzW/MdibBcDP5rwnGGF8C/jpvfxtwNbAxX9+TZD/R+I4Y47NkP+96X8h+UvR2sju71tMZY9ye//0dfnBFshn4bAjhS3nf//1p+imdNwaDLkTFmyhWyN6qKZO9RfT2/Oz7Rn7w63lbgJ8nu5//V8neEmoDjscYOzm16vKrqj+UUga6q+vL17kEeCiE8F6y37UA+BrZDQmH3zO/qngXzNq6YoyfA36C7MZ4twDtIYSLT9NX6bwwGDRZ/BXwqyGEN+RvG32e7OoAsreTVgPTYozfBx4H/oDTXy1Adtv0XwLIf2r2Z/P2CBwP2Q/WV29dvovs7P7dwNYY4x+R/Tbv+8mCBLJgOe3PLoYQ/o7s6uPhfP1zye65L004g0GTxX8k+/nHb5Pd7r0E/DpAjPHvyc7Gv5nP+1fAj3BmwfArwHUhhH8AHiR7u4cY4wDZW1f3hhDaycLmt2KMz5BdIbwrhLCT7K2uTuCf5IG1DXhLCOHR06z3N4DfCSF8G/hb4OMxxu+eQX+lhvO225KkhF9X1ZQWsl8memSUyTHG2DbKNOmHllcMkqSEnzFIkhIGgyQpYTBIkhIGgyQpYTBIkhL/H3yRK1TMzDi9AAAAAElFTkSuQmCC\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.boxplot(df[\"new_deaths\"], color=\"y\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"According to the values shown previously, we will delete the register where the number of deaths is lower than zero, it does not make sense to have negative values for new deaths."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Outlier analysis will be based over the most important variables. A dataframe will be created to select the observations to be deleted from the original dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.3.2. Outliers of \"new_deaths\" variable"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [],
"source": [
"# outliers removal in original data\n",
"df = df.drop(df[df[\"new_deaths\"]<0].index)"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x34b884e508>"
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAaUAAAEDCAYAAACVlxtdAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAacElEQVR4nO3de5RdZZnn8e+pSoUkIIlAroSbbXgCNBhEQAzQ4RIRvHTLRZSIYmNoR1nDcppRHEBbZxxtpgVkDajLqHFWzEhz62bsBJBby+gQOtMktAIPNTZEQoCImCA0udf8sXcVpyqV1KmkktpV5/tZi8XZz7v32e+7d+CXvc979ql1dHQgSVIVtAx2ByRJ6mQoSZIqw1CSJFWGoSRJqgxDSZJUGSMGuwO9Wbt2rVMCJWmYGzt2bK1nzSslSVJlGEqSpMpoulBqb28f7C4MGsfevJp5/I59aGm6UJIkVZehJEmqDENJklQZhpIkqTIMJUlSZRhKkqTKMJQkSZVRyccM7Uqv1fYgf79xq/o+o1oYP7p1EHokSerUdKG0dlONZas3bFU/YcJIQ0mSBpm37yRJlWEoSZIqw1CSJFWGoSRJqgxDSZJUGYaSJKkyDCVJUmUYSpKkyjCUJEmVYShJkirDUJIkVYahJEmqDENJklQZhpIkqTIMJUlSZRhKkqTKMJQkSZVhKEmSKsNQkiRVhqEkSaoMQ0mSVBmGkiSpMgwlSVJljGhkpYi4ALgKaAOuz8wbe7TPAOYBewM/Az6VmZsi4kBgATABSGBOZr5at91U4DHg7Zn5zM4PR5I0lPV5pRQR+wNfBU4EZgCXRMThPVZbAFyamYcCNWBuWb8JuCkzpwNLgavr3reFIshG7uwgJEnDQyO3704H7s/MlzPzNeBW4NzOxog4CBidmQ+XpfnAeRHRBpxcrt9Vr3vfzwH3Ai/tzAAkScNHI6E0BXi+bvl5YGoD7fsBr2Tmpp7bRcQxwKnAtTvWbUnScNTIZ0otQEfdcg3Y0kB7zzrAlogYQ3Fb77zM3BIR2915e3t7A13sj1G8uPrFraovdLTR8tK6Ad5X9Qz88Rw6mnns0Nzjd+zVMW3atO22NxJKK4GT6pYnAat6tE/upX01MDYiWjNzc7nOqvK9JgJ3loE0BVgUER/MzOzvAPpr1ZPPMnHCxK3qkyaMZNqb2wZ0X1XT3t4+4MdzqGjmsUNzj9+xD62xN3L77l7gtIgYX17lnAPc1dmYmSuAdRExsyxdCCzOzI3AQ8D5Zf1jZf3uzDw4M2dk5gyKoDqrt0CSJDWXPkMpM58DrgQeAJYBCzPzkYhYFBHvKFebA1wXEU8CewE3lPVPU8zWe5ziCumqgR6AJGn4aOh7Spm5EFjYo3ZW3evlwHG9bLcCmNXHex/cSB8kScOfT3SQJFWGoSRJqgxDSZJUGYaSJKkyDCVJUmUYSpKkyjCUJEmVYShJkirDUJIkVYahJEmqDENJklQZhpIkqTIMJUlSZRhKkqTKMJQkSZVhKEmSKsNQkiRVhqEkSaoMQ0mSVBmGkiSpMgwlSVJlGEqSpMowlCRJlWEoSZIqw1CSJFWGoSRJqgxDSZJUGYaSJKkyDCVJUmUYSpKkyjCUJEmVYShJkirDUJIkVYahJEmqDENJklQZhpIkqTJGNLJSRFwAXAW0Addn5o092mcA84C9gZ8Bn8rMTRFxILAAmAAkMCczX42Iw8v19wReBi7KzBUDNCZJ0hDV55VSROwPfBU4EZgBXFKGSr0FwKWZeShQA+aW9ZuAmzJzOrAUuLqs3wh8JTPfBtwMfG1nByJJGvoauX13OnB/Zr6cma8BtwLndjZGxEHA6Mx8uCzNB86LiDbg5HL9rnr5enZm3hURLcBBwO93diCSpKGvkdt3U4Dn65afB47ro30qsB/wSmZu6lGnvLU3DngcGAPM2pHOS5KGl0ZCqQXoqFuuAVsaaO9Zp367zFwDTImI9wB3RsQhmbm5587b29sb6GJ/jOLF1S9uVX2ho42Wl9YN8L6qZ+CP59DRzGOH5h6/Y6+OadOmbbe9kVBaCZxUtzwJWNWjfXIv7auBsRHRWobN5M7tIuJDwC2Z2VHexhsNvBl4qb8D6K9VTz7LxAkTt6pPmjCSaW9uG9B9VU17e/uAH8+hopnHDs09fsc+tMbeyGdK9wKnRcT4iBgDnAPc1dlYzppbFxEzy9KFwOLM3Ag8BJxf1j8GLC5fXw58ECAiTgFeysytAkmS1Fz6DKXMfA64EngAWAYszMxHImJRRLyjXG0OcF1EPAnsBdxQ1j9NMVvvcYqrravK+kXAf4iIZcBfUTdxQpLUvBr6nlJmLgQW9qidVfd6Od0nP3TWV9DLJIbMfJxiirkkSV18ooMkqTIMJUlSZRhKkqTKMJQkSZVhKEmSKsNQkiRVhqEkSaoMQ0mSVBmGkiSpMgwlSVJlGEqSpMowlCRJlWEoSZIqw1CSJFWGoSRJqgxDSZJUGYaSJKkyDCVJUmUYSpKkyjCUJEmVYShJkirDUJIkVYahJEmqDENJklQZhpIkqTIMJUlSZRhKkqTKMJQkSZVhKEmSKsNQkiRVhqEkSaoMQ0mSVBmGkiSpMgwlSVJlGEqSpMowlCRJlTGikZUi4gLgKqANuD4zb+zRPgOYB+wN/Az4VGZuiogDgQXABCCBOZn5akQcBnynXP914N9l5rIBGpMkaYjq80opIvYHvgqcCMwALomIw3ustgC4NDMPBWrA3LJ+E3BTZk4HlgJXl/XvAn+dmTOAK4Ef7uxAJElDXyO3704H7s/MlzPzNeBW4NzOxog4CBidmQ+XpfnAeRHRBpxcrt9VL1/PA+4qXz8GHLgTY5AkDRON3L6bAjxft/w8cFwf7VOB/YBXMnNTjzqZOb9u/a8Af7etnbe3tzfQxf4YxYurX9yq+kJHGy0vrRvgfVXPwB/PoaOZxw7NPX7HXh3Tpk3bbnsjodQCdNQt14AtDbT3rFO/XUTUgP8GvBM4ZVs772sA/bXqyWeZOGHiVvVJE0Yy7c1tA7qvqmlvbx/w4zlUNPPYobnH79iH1tgbuX23EphctzwJWNVA+2pgbES0lvXJndtFxAjgR8CxwCmZuXaHei9JGlYaCaV7gdMiYnxEjAHO4Y3Pg8jMFcC6iJhZli4EFmfmRuAh4Pyy/jFgcfn6byhm3r3bQJIkdeozlDLzOYoZcg8Ay4CFmflIRCyKiHeUq80BrouIJ4G9gBvK+qcpZus9DpwEXBUR44FLgQCWRMSyiHA6uCSpse8pZeZCYGGP2ll1r5fTffJDZ30FMGtH9ytJai4+0UGSVBmGkiSpMgwlSVJlGEqSpMowlCRJlWEoSZIqw1CSJFWGoSRJqgxDSZJUGYaSJKkyDCVJUmUYSpKkyjCUJEmVYShJkirDUJIkVYahJEmqDENJklQZhpIkqTIMJUlSZRhKkqTKMJQkSZVhKEmSKsNQkiRVhqEkSaoMQ0mSVBmGkiSpMgwlSVJlGEqSpMoYMdgd2F2effZZ5s2bxwMPPsjocftxwFHHceTssxk7ccpgd02SVGqKK6U77riDD3/4w9x9991sWL+etS8+xy9/egd/e+XFPPfEssHuniSpNOxD6ZFHHuGaa65h8+bNW7Vt3riBu66/mt8+89Qg9EyS1NOwDqXVq1fzxS9+kY6Ojm2us3Hdv3H3N7/E+nWv78aeSZJ6M6xD6dprr2XNmjVdy7VajTPPm8MRp/1pt/VefXk1i2778e7uniSph2E70eGpp57iwQcf7FabO3cuf/Su2Txd2wfo4Ff33dnVtuiWhVx8/gcZP3787u2oJKnLsL1SmjdvXrfl6dOnc9FFF3UtH3fuxYx609iu5Q3r1/G9731vd3VPktSLhkIpIi6IiMcjoj0iPtNL+4yIWBoRT0XEvIgYUdYPjIifRcSTEfH3EbFXj+0ujoj5AzKSOmvWrOHRRx/tVrvkkktoaXljuHuM2Yt3fPDj3dZZvHgxa9euHejuSJIa1GcoRcT+wFeBE4EZwCURcXiP1RYAl2bmoUANmFvWbwJuyszpwFLg6vI9R0XE14HrB2QUPYwbN47bb7+diy++mD333JMjjjiCE044Yav1Dp/1Pvbab2LX8vr167nzzju3Wk+StHs0cqV0OnB/Zr6cma8BtwLndjZGxEHA6Mx8uCzNB86LiDbg5HL9rnr5+uRy35/b2QFsy5ve9Cbmzp3L7bffztVXX02tVttqnZbWVo449QPdarfddluv08clSbteI6E0BXi+bvl5YGoD7fsBr2Tmpp7bZeY9mfk5YJfPwx47diwHH3zwNtsP+5MzaW0b2bX8wgsv8POf/3xXd0uS1ItGZt+1APVf9KkBWxpo71mnx3YNaW9v7+8mfRjFi6tf7FaZOuNdrPinB7uWb775ZiZPnjzA+62GgT+eQ0czjx2ae/yOvTqmTZu23fZGQmklcFLd8iRgVY/2yb20rwbGRkRrZm4u16nfriF9DaC/Vj35LBMnTOxWe/uZ53YLpeXLlzNhwgTGjh3LcNLe3j7gx3OoaOaxQ3OP37EPrbE3cvvuXuC0iBgfEWOAc4C7OhszcwWwLiJmlqULgcWZuRF4CDi/rH8MWDxgPR9AE94ynYn7v3FHctOmTdx///2D2CNJak59hlJmPgdcCTwALAMWZuYjEbEoIt5RrjYHuC4ingT2Am4o65+mmK33OMXV1lUDPYCBUKvVeNcp7+5WW7y4kvkpScNaQ090yMyFwMIetbPqXi8HjutluxXArO2873yKWXmD7l2nnsEdC77ftfzYY4+xatUqpkzxpy0kaXcZtk906K8Jk6dw5JFHdqvdd999g9QbSWpOhlKd2bNnd1v2cyVJ2r0MpTqnnHJKty/ZPvHEEzz33HOD2CNJai6GUp3x48dz1FFHdat5tSRJu4+h1MNpp53WbdlQkqTdx1DqobdbeKtW9fs7v5KkHWAo9dDbLTxn4UnS7mEo9cJbeJI0OAylXngLT5IGh6HUC2/hSdLgMJS24dRTT+227C08Sdr1DKVt8BaeJO1+htI2TJgwwS/SStJuZihtR89beH6uJEm7lqG0Hd7Ck6Tdy1Dajt5u4d1zzz2D1BtJGv4MpT6cfvrp3ZYXL15MR0fHIPVGkoY3Q6kPs2fPprW1tWt5xYoVPPHEE4PYI0kavgylPowbN46ZM2d2qy1atGiQeiNJw5uh1IAzzzyz2/JPf/pTNmzYMEi9kaThy1BqwMyZM9l77727lteuXcuDDz44eB2SpGHKUGrAyJEjOeOMM7rVbrvttkHqjSQNX4ZSg84+++xuy8uXL6e9vX2QeiNJw5Oh1KBDDjmEY445plvNqyVJGliGUj+cc8453ZYXLVrESy+9NEi9kaThx1Dqh5NPPpmJEyd2LW/YsIGFCxcOYo8kaXgxlPphxIgRXHjhhd1qt99+O2vWrBmkHknS8GIo9dP73/9+9t13367ldevWMX/+/MHrkCQNI4ZSP+2xxx7MmTOnW+2WW27h6aefHqQeSdLwYSjtgLPPPptJkyZ1LW/evJlvfOMbPqhVknaSobQDRo0axWWXXdattnTpUqeIS9JOMpR20KxZszj22GO71b75zW/y1FNPDVKPJGnoM5R2UK1W4/Of/zxjxozpqm3cuJHLL7/cX6eVpB1kKO2EqVOn8oUvfKFbbfXq1Vx66aWsXLlykHolSUNX04fSlo4ONm7Z8QkKs2fP5txzz+1WW7VqFZ/4xCd8krgk9dOIwe7AYFry4nrueOZ11m2Gffdo4a1jR3DkPm2cNHkPTpw0kn1Htfb9JsBnP/tZXnnlFe65556u2h/+8AeuuOIKjj/+eD7+8Y9z9NFHU6vVdtVQJGlYaCiUIuIC4CqgDbg+M2/s0T4DmAfsDfwM+FRmboqIA4EFwAQggTmZ+WpEjAN+BLwF+C3wocx8YYDG1KeOjg7uXrmeu55d11X73fot/G71Bpas3sC8J18DiqA6ZO9WJo9pZdzIFsaObGHcHi2MG1ljZGuNVzd28NrGLYweUePQCy5n1Wub+OXP7++2ryVLlrBkyRImT5nC2497J0cefhgHH3gABxxwAPvss49BJUl1an19tyYi9gf+N3AMsB74BfCRzHy8bp1fAp/MzIcj4nvA0sz8VkT8BFiQmT+OiKuBvTLz8xHx34GVmfn1iLgQeF9mnt/5fmvXrt0lX/hZs34LR9+yirYRrax+fcvA76Cjgzf/y0+Y8PD/oNbR9/tvaW1jy8gxMHIMI0aNptbSyiZa2EgLLa2t7DGilbbWFjbTwqbyiLTWarTWoPMAdXQUr3sesFrP1zXYsnkzra2tW69Xq99m4EJyW3nba7n2Rn1LR/lPuW5nvVbXz65+lwPvOh71/65r27JlM60trd060PW+vfWt80WPA9vzOHdsq20bJ6Sz7x3A5i3Undfin85NOzroOiadY+3tuDX6H8rmXs59w29Q2/pY1Xppq6pNmzYzYkRjdz2Gm50de0utxox927rVLrvsMg488MCd7RoAY8eO3eqPTyNXSqcD92fmywARcStwLvCVcvkgYHRmPlyuPx/4ckTMA04G/qyu/o/A54H3lm0A/xO4MSLaMnNj/4fVuHF7tPD0R6fuyl0Anyr/kST1VyMTHaYAz9ctPw9MbaB9P+CVzNzUy3Zd25TtrwDj+9t5SdLw0kgotdD9Ir9GcWelr/aedeq263nJ1vM9JUlNqJHbdyuBk+qWJwGrerRP7qV9NTA2Ilozc3O5Tud2z5XrrYyIEcCbgN91vkFv9xklScNfI1dK9wKnRcT4iBgDnAPc1dmYmSuAdRExsyxdCCwuPx96COicwPAxYHH5elG5TNn+0K7+PEmSVH19Xill5nMRcSXwADASmJeZj0TEIuCLmbkUmAN8NyL2Bv4ZuKHc/NPADyPiKuA3wEfK+tXA/Ij4FbCm3H6X6mta+1AVEV8CPlQu/kNmfi4iTgeuBUYDN2fmVeW6/Zq6v5uHssMi4m+A/TLzoqH29YSdERHvB74E7Anck5mXNcu5j4iPAp2PU1mcmZcP93Nf/v/1FxSzlZ8ZqHNdtePQ0BMdMnNhZv5xZh6amdeUtbPKQCIzl2fmcZk5PTMvyMz1ZX1FZs7KzMMz8z2Z+fuy/nJmfiAzj8jMmZn5zC4aH9A1rf2rwInADOCSiDh8V+5zdyj/UL4bOJpiXMdExEeA7wN/ChwGHBsRZ5abLAAuzcxDKT7Hm1vWbwJuyszpwFKKvzQMCRFxGvDxulJ/x/hfKK7UDwO+C3xzt3R8J0XEW4BvU8xuPQp4e3meh/25L+/Y3AD8CfA24KTyv4Vhe+4j4niKr+YcWi6PZuDOdaWOQ7M8ZqhrWntmvgZ0Tmsf6p4H/jIzN5S3P5+g+EPbnplPlzMbFwDnbWPq/nkR0UYxPf/W+vpuHMMOi4h9KP6y8V/L5R0Z43sp/pYIxdcTzizXr7oPUvzteGV57s8H/o3mOPetFP/v2pPizkcbsJHhfe7nAp/hjc/lj2PgznWljkOzhFJf09qHpMz8VecfvoiYRnEbbwu9j3VHpu5X3XeAK4Hfl8vN9PWEtwKtEXFnRCyjuFW+rfEPq3OfmX+g+Fv+kxQTrZ4BNjCMz31mfjIzH6orDeS5rtRxaJZQ6mta+5AWEUcAPwX+I/CvNDZFv5Gp+5UVEZ8Ens3M++rKzfT1hBEUdwAuBk4Ajqf4TKAZzv1RwJ8DB1H8D3UzxW3sZjn30Pg5HXLHoVlCaVvT1oe8ctbjfcAVmflDtj3WPqful/X6qftVdj7w7vIq4SvAB4BP0v8xdn49gd6+nlBhLwD3ZuZvM/N14A6KkGqGc38GcF9mri4/v54PzKJ5zj0M7H/nlToOzRJK253WPlRFxAHA3wEXZOaPy/KSoineWv4BvIBidtKOTN2vrMycXU6+mQF8EbgzMz9B83w94SfAGRExrjzPZ1J8XjDszz2wHDg9IvaMiBrwfopHmDXLuYeB/e+8UsehKX66YlvT2ge5WwPhcmAUcG1EdNa+DVwE3Fa2LeKNDzf7O3V/KKr81xMGQmYuiYhrKGZktVHcvv0Wxecsw/rcZ+Y9EXE08H8pJjg8Anyd4mpx2J97gMxcFxEXMTDnulLHoc+nhEuStLs0y+07SdIQYChJkirDUJIkVYahJEmqDENJklQZhpIkqTIMJakiIuIn5XdPdnT7YyPi2+XrWRHxywHrnLSbGErS8HEEQ+CBqtL2NMUTHaRGRMQsip/C+FfgjymelPAXwD8Bf03x+z2twKPAv6d4KOgxmXlh+aj/3wGXZeYPIuJE4BuZefx29jcF+CHFQ0VXUPz4WmfbYRS/a7Nvuc8bMvP7EdECXAe8k+IZZTWKZ/79huIZgGMj4gfl++4VET8GplN8639uZj5U9u3a8n07gK9l5m07c+ykgeKVktTd8RRhcjTwA4rfaroC2EQRQG+jeJDl14HbKZ4/10LxA5KvAbPL9/kAxSNgtudG4OHMPIIi5KZD10Mxb6V4yO4xFGF4eUS8s+zfFOCEzDycInyuyMxnKZ4B+FD5DEAorpquK58P+B3gr8r6l4Fry/f+c+DUfh8laRcxlKTuVmTmsvL1PwP7AO+j+IXPR8unkv8ZcHhm/gZ4FjgGeA/wNeCU8iGhjYTS6RRPuCYz/x9wf1k/FPgj4Pvl/v6R4ievj87M/wNcBfxF+TPw5wJ7beP9f52ZS8rXy3jjSuxvgRsj4kdl3/9TH/2UdhtDSeru9brXHRS3x1opbsvNKK86juONXy6+AziL4vd8bqW4DXc+8Hpm/rqPfXW+f6fOH2BrBdZ27q/c5zuBH0TEe4F/KNf7e4oH8Pb8PZxO9U967tpXZn4HOJLiIa5nAI9FxKg++irtFoaS1Le7gUsjYmR5q+67FFdFUNzCuwBoycxVwD3ANfR9lQTFz6dcAhARBwKnlPUEXo+Ij5ZtBwC/pLiqmQ38r8z8FrCU4qqt8zdyNlF8DrZdEfELiquu+eX+x1H+no402AwlqW//meIntx8FHqe44vhLgMx8nOIqpPMXcO8GDqCxUPoMcHhEPAF8j+IWG5m5geJ24Scj4jGKoLs6M39OcWU0KyL+heL24q+BQ8qwfBh4S0Tc3sd+Pwd8JSIeBR4EvpyZzzTQX2mX86crJEmV4ZRwaReJ4pcXb95Gc2bm+dtok5qWV0qSpMrwMyVJUmUYSpKkyjCUJEmVYShJkirDUJIkVcb/B00Dd6nMySLXAAAAAElFTkSuQmCC\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"from scipy.stats import norm\n",
"sns.distplot(df['new_deaths'], fit=norm)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The distribution chart for new_deaths doesn´t seem to follow a normal distribution, so we will try to study its properties through probabilistic methods as Shapiro"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.08056902885437012, 0.0)"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import scipy\n",
"from scipy import stats\n",
"\n",
"# let´s apply Shapiro-Wilks test supposing a null hypothesis where the distribution of this variable is normal.\n",
"x = df[\"new_deaths\"]\n",
"scipy.stats.shapiro(x)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Clearly, Shapiro shows this data doesn´t follow a normal distribution as its values are very close to 0."
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAbYAAAEUCAYAAABH3ROVAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAgAElEQVR4nO3deZhcVbX38W91p7sJhDTXMAYEiSQL0EgQBDSCGEBBUMQrQYkgs4gIDoCKIMMLiBMIXoPKkChRMMJVuBAcICoYBBkEIoFFaxgEQoAgCQmhx3r/2Kc6p6trON1dc/8+z9NP6uyz65xVJfbqPZy9U+l0GhERkUbRVO0ARERESkmJTUREGooSm4iINBQlNhERaShKbCIi0lCU2EREpKGMqXYAIrXEzJ4CtokV9QLPAb8CznH3NcO87lxgnLt/fJjvPxc4yN13zXP+KOC77r6xme0N/BHY0N1XR5/pu+7+P2Y2DjjM3a8eRgyZ68atBR4Gvu/uv4zq9d8vwTVbgOPdffZQ4xHJRy02kcHOBLaIfrYBjgE+BVxWzaCK+CWwY55z7wKuiV5/GfjsCO/1VtZ9PzsCtwLXm9mew7jW4cD5I4xHZAC12EQGe83dX4gdP2dmlwFfAY6rUkwFuftaQusp17mXYoepEtzuRXdfHTu+wMyOAP4buGuI1ypFPCIDKLGJJNMDdEJ/t+DuhB6PPYBTgJ8BnwdOBrYGHgfOdPcFsWusb2Y/Bz4GvAB8xd3nR9ccA5wHzAK2BF4Brge+5O690fubogR7DLAa+Ka7Xx69/yiirsjswDNdg9F7zonK0tFnuBeY7O7/jMrHAi8CH3H37G7HRN9PjvsfDnwVmAw8A1zk7j+NujbnxOJ5v7v/aQj3FMlJXZEiBZhZk5ntRkhav4md2h+4k5DYFhC6L88DvgG8I6p7s5ntFHvPAcAKYGfgUuA6M3tvdO504Ajg04QEcDrwOeCjsffvTOj+2x04DbjYzGYN4eP8EvgeYUxsC+AB4AngE7E6BwMrgT8nuaCZrW9mXwR2AG7KcX4WIXldQfhefgBcaWYHAncDXyAk8S2iY5ERU4tNZLDvmdnF0es2IA3cTOiKzFhLaDH1mVmK8Av6Qne/Pjp/rpntDpxBaIUBOHCqu6eBx83s/YTxrr8AjwJHuXsmoTxlZqcTxrBujMr+AxwdTWBZYma7AicBP0/yodx9rZmtBnoyXa1RC/ITwAVRtcOB69y9r8ClXjCzzOsNCC28U9w9V2L6InClu18RHXeY2dsIrdlbzWwlkM7q+hUZESU2kcG+CcyLXncBy909u5vtydgv/02BjYG/ZtX5C3Bo7PjeKKll3AfMBHD3m83s/Wb2bWAKoXXzFqA5Vn9x1qzM+4Cjh/LBcpgHnGdmbweeBz5IaHUW8h7gdULCX1MkKb2N0DqN+wvwyeGFK1KcEpvIYC9nxpwKWJvndVyKgd39vVnnmwiJMzNudyph9uKNhNbh9Vn1875/uNx9qZndDRxGGP/6p7s/VORtS7MmjxSS67vJ/l5ESkr/cYmMkLuvIrR23p116j2ESSQZ78g6vwewJHr9OeDL7v5ld78WeIrwqEF81uCOZtac5/1J5dqnah7w4egnUbfmEDxG4e9F+2ZJyanFJlIaFwPnm9mzhEkZhxG69faO1dk5GrubCxwUnd85OrcCONDM7gTGA+cC/0UY48vYBLg66q7cjfDoQXxySRKrgc3NbBLwjLv3ECaVfJ/QBXrKEK9XzMXADWb2KHA7sA9wLGFmZyaecWa2I6El+EaJ7y+jkFpsIqXxP8C3o5/FhNmFB7l7/LmueYABDwEnAIe6+z+ic0cB20bv/Q2hxXY1sEvs/X8Auglja/+PMGHjt0OM8wbgNUJLb2cAd38F+B3woLs/NcTrFeTu/0eY4PIl4B+E2aXHu3umZXgH8Pfo58BS3ltGr5R20BYRM7sPuMrdf1ztWERGSl2RIqOYmR1AGPPaDriuyuGIlIQSm8jo9nmi8bpoEoxI3VNXpIiINBRNHhERkYbSsF2RK1euVFNURKTBtbe3D9ohQi02ERFpKEpsIiLSUJTYEujo6Kh2CInUS5ygWMuhXuKE+om1XuIExRqnxCYiIg1FiU1ERBqKEpuIiDQUJTYREWkoDfscm4iI1JZFi5czf+FSVqzsZEL7y8ycMYnpUzcr+X2U2EREpOwWLV7O1bc4Xd19AKxY2cnVtzhAyZObuiJFRKTs5i9c2p/UMrq6+5i/cGnJ76XEJiIiZbdiZeeQykdCiU1ERMpuQnvbkMpHQolNRETKbuaMSbS2DEw5rS1NzJwxqeT30uQREREpu8wEkXWzIts0K1JEROrb9KmbMX3qZnR0dDB58uSy3UddkSIi0lDK2mIzs/HA3cBB7v6Ume0LXAKMBX7p7mdF9aYBVwHjgTuBE929x8y2BuYBmwIOzHL31Wa2EfBzYBLwEjDT3V8o52cREZH6ULYWm5ntDvwFmBIdjwWuAQ4GdgDeZWYHRNXnASe7+xQgBRwflc8GZrv79sD9wNlR+QXAXe6+A3AlcFm5PoeIiNSXcnZFHg98Dng+Ot4N6HD3J929h5DMDjWzbYCx7n5PVG9uVN4C7AXcEC+PXh9IaLEBXAccENUXEZFRrmxdke5+HICZZYomAstiVZYBWxUo3xhYFSXBePmAa0VdlquATViXRAcoxaZ29bKJX73ECYq1HOolTqifWOslThg9sRabeFLJWZFNQDp2nAL6hlBOVJ6pE5eKnRtkpLNvyj2Dp1TqJU5QrOVQL3FC/cRaL3GCYo2r5KzIZ4EtYsebE1pY+cpfBNrNrDkq34J1LbLnonqY2RhgQ2BF2SIXEZG6UcnEdi9gZrZdlKwOB25z96eBN8xselTviKi8G7gLOCwqPxK4LXq9IDomOn9XVF9EREa5iiU2d38DOAq4EVgCPM66iSGzgEvN7HFgHHB5VH4ScIKZLQH2BM6Kys8G9jCzR6M6n6vEZxARkdpX9jE2d39L7PUdwE456jxMmDWZXf40sHeO8leAj5QyThERaQxaeURERBqKEpuIiDQUJTYREWkoSmwiItJQlNhERKShKLGJiEhDUWITEZGGosQmIiINRYlNREQaihKbiIg0FCU2ERFpKEpsIiLSUJTYRESkoSixiYhIQ1FiExGRhlL2/dhERKTxLVq8nPkLl7JiZScT2tuYOWMS06duVpVYlNhERGREFi1eztW3OF3dfQCsWNnJ1bc4QFWSmxKbiIgMSXbrrLOrtz+pZXR19zF/4VIlNhERqW25Wmf5FDpXTpo8IiIiic1fuHRQ6yyfCe1tZY4mNyU2ERFJLGkrrLWliZkzJpU5mtyU2EREJLF8rbANxjb3n5vQ3saxB5lmRYqISO2bOWPSgDE2CK2zI/efUrVElk2JTUREEsskr1p5Zi0XJTYRERmS6VM3q6lElk1jbCIi0lCU2EREpKFUpSvSzD4FfC06vM3dTzOzacBVwHjgTuBEd+8xs62BecCmgAOz3H21mW0E/ByYBLwEzHT3Fyr9WUREpLZUvMVmZusDlwPvA3YC9jSzfQnJ62R3nwKkgOOjt8wGZrv79sD9wNlR+QXAXe6+A3AlcFnlPoWIiNSqanRFNkf33QBoiX66gbHufk9UZy5wqJm1AHsBN8TLo9cHElpsANcBB0T1RURkFKt4YnP31witrseBZ4GngC5gWazaMmArYGNglbv3ZJUDTMy8Jzq/CtikzOGLiEiNq/gYm5m9AzgG2AZYSeiC/ACQjlVLAX2ExJvOukRfrE5cKnZugI6OjpEFXaJrVEK9xAmKtRzqJU6on1jrJU4YPbFOnjy54PlqTB75IHCHu78IYGZzgdOALWJ1NgeeB14E2s2s2d17ozrPR3Wei+o9a2ZjgA2BFbluWOxLKKajo2PE16iEeokTFGs51EucUD+x1kucoFjjqjHG9jCwr5ltYGYp4MPAn4E3zGx6VOcIwmzJbuAu4LCo/Ejgtuj1guiY6PxdUX0RERnFKt5ic/ffm9nOwAOESSN/Ay4Gfg1caWbjgQcJMycBTgJ+amZnAc8An4zKzwbmmtmjwKvArMp9ChGR0SN7Y9FaW0IrW1WeY3P3bwHfyip+GNgtR92ngb1zlL8CfKQc8YmISEhoP/vtE6xZ29tftmJlJ1ff4gA1m9y08oiIiAyS2Sk7ntQyurr7mL9waRWiSkaJTUREBim2U3bSDUerQav7i4gIMHAsrZh8G47WAiU2ERHp73os1ErLaG1pYuaMSRWIaniU2EREpGjXY8a4sWM4Yv/JNTtxBJTYRERGvUWLlyfqfvzsITvUdELLUGITERnF5ix4gjvuf75ovQntbXWR1CDBrEgz28zMPhK9/paZ3WFmO5U/NBERKadFi5cnSmq1PqaWLcl0/7nAW81sBrA/cC3rVgUREZE6leRZtAntbRx7kNVNaw2SdUVOcPdLzew7wC/cfa6Zfa7cgYmISHkVGleb0N7GZae+u4LRlE6SFltrtIHnAcDt0Q7Y48obloiIlFuhZ9HqqesxW5LEdhPwEvCyuz9AWLT4F2WNSkREym7mjEm0tgxOA/vsOrGuuh6zFe2KdPdzzOxKd382Kjrc3R8pc1wiIlJmmeRVTyv3J1E0sZlZE/AJM3s78HngQDN7NNr4U0RE6tj0qZvVfSLLlmTyyHeATYB3ASnCzMgtgFPKGJeIiMiwJBlj2wc4CnjD3VcBHwD2K2dQIiIiw5Wkxdbt7n1mBoC7d5pZT3nDEhGRcsneQLQe1n8ciiSJ7R/Rc2vNFrLbl4CHyhuWiIiUw6LFy/nJTY/RG1vvePXaHq68+XGgdnfFHookXZGnAu8ENgMWEZ5h+0I5gxIRkfKYv3DpgKSW0dObruldsYciyXT/VcCxFYhFRETKqNgq/rW8K/ZQJJnun3NdSHfXrEgRkTqRZBX/Wt4VeyiSdEWuiP28BrwPSJczKBERKZ0kSW1Mc6qul9GKS9IVeV782MwuBm4uW0QiIlIySbemOf4j2zfExBFI1mIbwN1fA7YsQywiIlJiSbemaZSkBkMfY0sBuwCPlS0iEREpmWITQhqpCzIjyXNsK2Kv04SNRn9ennBERKSUUilI55kVsV5rM0cfOKWhWmswjDE2ERGpD3MWPJE3qQFc9dU9KxdMBeVNbGb2GrlnP6aAtLuPH+5NzezDwDnABsDv3f1UM9sXuAQYC/zS3c+K6k4DrgLGA3cCJ7p7j5ltDcwDNgUcmOXuq4cbk4hIIyk2E7JRpvbnUmjyyNuBqTl+MuXDYmaTgB8BHwXeAbzTzA4ArgEOBnYA3hWVQUheJ7v7FEJSPT4qnw3MdvftgfuBs4cbk4hII0kyE7LRxtXi8iY2d3868wO8Cdga2AaYRFjhf7gOIbTInnX3buAw4HWgw92fdPceQjI71My2Aca6+z3Re+dG5S3AXsAN8fIRxCQi0jCKzYTcYGxzw42rxSWZFXkloSW1HvA8sB3wF+DKYd5zO6DLzG4mJMtbgEeBZbE6y4CtgIl5yjcGVkVJMF4uIjLqFZsJeeT+UyoUSXUkmRW5H7AtoevvfODNwBkjvOdewN7AasLD3msZOJ6XAvoILcok5UTlOXV0dIwg3NJdoxLqJU5QrOVQL3FC/cRaL3HCulhT5F8eatftxrLpeqvo6FhVsbhyGcn3Onny5ILnkyS2Ze6+xsweB6a6+2/yrR+Z0AvA7e7+EoCZ/ZrQjdgbq7M5oXX4LGG37uzyF4F2M2t2996oTt4O5WJfQjEdHR0jvkYl1EucoFjLoV7ihPqJtV7ihHWxLlq8nDTP5a33hcN3r2BUuZX7e02y8kiXme0FLAH2N7N2wtY1w3UL8EEz28jMmoEDCGNlZmbbRWWHA7dF43tvmNn06L1HROXdwF2E8TmAI4HbRhCTiEjdW7R4OVf8Ov/6GY08EzIuSWL7CvAZYAEwDXiZMLljWNz9XuDbhHG6JcDTwBXAUcCNUdnjrJsYMgu4NGoxjgMyrcWTgBPMbAmwJ3DWcGMSEal3i59aw09uKrwoVCPPhIwr9BzbTu7+cDQjMTMrcQ8za3f3lSO5qbtfQ5jeH3cHsFOOug8Du+Uof5owTiciMurd8fBrOTcQjWvkmZBxhcbYbjczB34A3JiZgTjSpCYiIqW38vXeguc3GNtcoUiqr1BX5JbADwndkE+b2XlmNrEyYYmIyFCMbU0VPN/oU/zjCj2g3eXu17n7DEKX31jgPjObH00mERGRGtHVnX9RyH12nThquiEh4X5s7t7h7mcQVh15FlhY1qhERCSxM2bfS2+BxY6P/tDoaa1BsufYiJa2Opowc3Ep8IkyxiQiIgnNWfAEz7+8Nu/50TLFP67QrMg24GPAsYTNRecBH3L3JRWKTUREChjtix3nU6jFtozQ7XgF8FFtCSMiUjuKbUuTMZrG1jIKJbaPuvudFYtEREQSuejah1jy5KtF6+2z6+icyF5oVqSSmohIjZmz4IlESS3F6Js0kpFoVqSIiNSGPz5QvPsR4MRDdihzJLVLiU1EpI70FZjWn7HjthuNyrG1jEKzIgs+hK2uShGR2rPjthtx5hHTqh1GVRWaPPLD6N/1gW0Iu1z3AFMJK/CP7m9ORKTCTr5kUcHzSmpBockjU919KnA/sJe77+TuuwB7AP+qVIAiIhImjby6ujvv+ZYmlNQiScbYzN3vzhy4+4PAduULSUREshV7Zu3Du29UoUhqX5Iltdaa2VHAtYQZpMcBxeeaiohIScxZ8ETROlPfskEFIqkPSVpsxwCnAJ3AWsJ6kUeXMSYREYkp1lrbcVu11uKKttjc/THgnWb2puj4lbJHJSIiQLLW2plHTKOjo6MC0dSHoi02M9vczG4F7gHGmNnvzGyL8ocmIiILizyQ3VR4f9FRKUlX5GzgN4RuyP8ADwFXlTMoEREJ0kUeyH7/LqNzPchCkiS2t7j7lUCfu3e7+1eArcscl4iIFNHakhq160EWkiSx9ZlZfz0z2zDh+0REZISaC/y2Pfag7SsXSB1JkqD+F/g50G5mnwEWAvPLGpWIiACwXltzzvLWltSoXg+ykKKJzd0vAhYA9wH7AT8Bzi9zXCIiAqxZ25uzvKs7wWrIo1TR6f5m9jN3P5LwgLaIiFTIosXL856b0N5WwUjqS5KuyGlmpgmlIiIVNn/h0rznZs6YVMFI6kuSJbWeBx41s3uA1ZlCdz+lbFGJiAgrVnbmPafxtfySJLa/Rj8iIlJBE9rbciY3dUMWlmRJrfPMbCxhRf9HgfXc/fWR3tjMvgts7O5Hmdk0wkPf44E7gRPdvcfMtgbmAZsCDsxy99VmthFhpuYk4CVgpru/MNKYRERqybTJE3KuEzlt8oQqRFM/kiyptTth/7VbgYnAv83sPSO5qZntA3w6VjQPONndpxB2EDg+Kp8NzHb37Qn7wp0dlV8A3OXuOwBXApeNJB4RkVp076MvDqlcgiSTR74L7AuscPdngSMYQSKJFlO+ELgoOt4GGOvu90RV5gKHmlkLsBdwQ7w8en0gocUGcB1wQFRfRKRhrF7bM6RyCZKMsa3v7kvMDAB3X2BmF47gnj8Gvg68OTqeCCyLnV8GbAVsDKxy956s8gHvibosVwGbECa6DFKKVa/rZeXseokTFGs51EucUD+xVivOn91RuFWWK656+U5hZLFOnjy54Pkkia3bzP4LSANYJsMNg5kdB/zb3e+INi+F0GqMP2mYAvpylBOVZ+rEpWLnBin2JRTT0dEx4mtUQr3ECYq1HOolTqifWKsZ55PXPVfwfHZc9fKdQvljTZLYLgD+DGxuZtcBHwBOGOb9DgO2MLOHgDcB4wjJK74NzuaElteLhGW8mt29N6qTaZE9F9V71szGABsCK4YZk4hIXdlgbO5ltiRIsqTWLcDHgHOARcB73f3G4dzM3fdz97e7+zTgG8DN7n408IaZTY+qHQHc5u7dwF2EZAhwJHBb9HpBdEx0/q6ovohIwztyf63oX0jeFls01T6jizArsv+cuz9TwjhmAVea2XjgQeDyqPwk4KdmdhbwDPDJqPxsYK6ZPQq8Gr1fRKQhFFpKC/RwdjGFuiIfJXQTNgFjgdeAXmAjQjfhiHbRdve5hJmOuPvDwG456jwN7J2j/BXgIyO5v4hIrSq0lJYUl7cr0t03dPfxhGn1s9x9I3efABzCui5BEREpsUJLaWl8rbgkz7Ht6u7XZw7c/WZgWvlCEhGRfDS+VlySxNZkZntnDsxsfwpMrRcRkeHT+NrIJZnu/3ngV2bWRXheLAV8tKxRiYiMUnNufaLaIdS9JIltArA1MDU6fiS2GoiIiJTQG125d8wGaG3R1phJJElsF7n7TYRp+CIiUiXHHrR9tUOoC0kS22Iz+zrhYen4RqNKdCIiJaTxtdJIkth2j36Oi5WlCXuhiYhIiVz72/pZxLiWJdlodNtKBCIiMtppO5rSKJjYzGwi8DXgvYRW2iLgW9G+bCIiUiET2tuqHULdyPscm5m9GfgbYRmtswmbg6aAv0Wbg4qISIXMnKHRn6QKtdguAL7m7tfGym40sweic0eUNTIRkVGmrSVFZ3f2NpTQ3KSJI0NRaOWRd2YlNQDcfQ45FiwWEZHyGNOs59eGolBiK/RN5l+hU0REhiVXa61QueRWKLH1RJNHBojKlNhERKQmFUpsPwLmRJt/AmBmmwLXArPLHZiIyGjTlmfJrHzlkluh/dh+BCwGnjOze83sQeBfwD3ROJuIiJTQmDG5fyXnK5fcCj7H5u6nmdmlhJVHICS158sflojI6LNmbe4FkPOVS25JVh55DvjfCsQiIjJqzVmQf7saPZw9NGrfiojUgIUP5O8M08PZQ6PEJiJSA9IFZvTr4eyhUWITEZGGosQmIiINRYlNRKTKim0wKkOjxCYiUmVzbs0/I1KGTolNRKTK3ujK/5zaBmObKxhJY1BiExGpYUfuP6XaIdSdog9ol4OZnQPMjA5vdfczzGxf4BJgLPBLdz8rqjsNuAoYD9wJnOjuPWa2NTAP2BRwYJa7r67wRxERGZFi42ua6j90FW+xRQnsA8DOwDRgFzP7JHANcDCwA/AuMzsgess84GR3n0LYSuf4qHw2MNvdtwfuJ+zyLSJSVzS+VnrV6IpcBnzZ3bvcvRt4DJgCdLj7k+7eQ0hmh5rZNsBYd78neu/cqLwF2Au4IV5ewc8gIlIShcbXZHgq3hXp7o9mXpvZZEKX5A8ICS9jGbAVMDFP+cbAqigJxstFRBpGq7arGZaqjLEBmNnbgFuB04EeQqstIwX0EVqU6QTlROU5dXR0jDjeUlyjEuolTlCs5VAvcUL9xFrOOG+975WC5w/atX1I96+X7xRGFuvkyZMLnq/W5JHpwI3AF9z9ejN7H7BFrMrmwPPAs3nKXwTazazZ3XujOnlXEC32JRTT0dEx4mtUQr3ECYq1HOolTqifWMsd5wPX/6ng+Y/tNy3xterlO4Xyx1qNySNvBn4DHO7u10fF94ZTtp2ZNQOHA7e5+9PAG1EiBDgiKu8G7gIOi8qPBG6r2IcQESmBQgsfy/BVo8V2GrAecImZZcp+BBxFaMWtByxg3cSQWcCVZjYeeBC4PCo/CfipmZ0FPAN8shLBi4iUwhmz7y14XuNrw1eNySOnAqfmOb1TjvoPA7vlKH8a2LukwYmIVMjzL68teP7Yg7avUCSNRyuPiIjUID2YPXxKbCIiFXbRtQ8VPD9x47EViqQxKbGJiFTYkidfLXj+2yftXqFIGpMSm4hIBRVbG1KTRkZOiU1EpIKuvPnxguc1aWTklNhERCpk0eLl9PQWfnhNk0ZGTolNRKRCfvTrxwqeVydkaSixiYhUSLGFRk48ZIeKxNHolNhERCrgU+f/qWgddUOWhhKbiEiZJUlqG41rKX8go4QSm4hIGSVJagD/86XpxStJIkpsIiJlcvIlixLV+6zG1kpKiU1EpAzmLHiCV1d3J6qrsbXSqtoO2iIijeqiax8qumxWhlprpafEJiJSQknH1CBMGFFrrfTUFSkiUiJDSWpNKU0YKRclNhGREhhKUgP42dl7lyUOUVekiMiIHH3hn+juHdp75n1j77LEIoESm4jIMA21lQaw47YblT4QGUBdkSIiwzDcpHbmEdNKH4wMoBabiMgQDGUqf5y6HytHiU1EJKHhtNJgdCS1dDpNuquLvq4u+jq76OvsjB130tfZFR13kt5yYlljUWITEUlgOEmtKVUbsx/T6TTpnp51ySZKPH1dUbKJEk8mCWWXpbsyrzsLJq2kWr96Whk/rRKbiEhei59aw3nX/WlY7/3sITskfvg63dOzLkl0dZGOJZ7+ZBNLPIPrdNL9yn94sqUllpAGJijSxXaDq6DuZEuNDZcSm4hIDpkWWirdR0u6lzF9PbSke2jp66El3Rv928OYdA+tfeHflr5eWtI9HPLuifT9/TmevSdXyybW+olaT/QO8XmBPNaU5CoVoMQmIpJbuq+PdHf3upZLjpZNaN3k6mpbV+eRx17oT1pjoqR1ajqTuPqGHNfLtz5chk9b+1ItLTS1tpJqa6WprY2m1tbw09YWylrbaGprZU1zeVOPEpuIlEU6nSbd3RMby1k3PtP35JOsWrlqwESDAeM9UXm6WHfcEMZ1Ctm2JFepbanmZlJtIbH0J5tY4mlqaw3HmYTUX5Z9HK8z8DqppmRPkHV0dJT1syqxiYxC6XSadG/voHGaAcd5xnuyWz+5JiBkElKhcZ1nKvh5a14qFSWOTMtmYIsnZ0KKl7W2sWzFy2y17bZZdaJ6LS2kxoyeX/d1/UnN7HDgLKAF+L67/7BU1160eDnzFy5lxcpO2tdv5vAPjs85EByvN6G9jZkzJgH0l20wtpkUKVav7RnwuikFfWUZy32uHBctE8WaTxjXGTyes66rrDtn+d3971k33pM9NpQZE2qihiYT1LA00J0aQ3fTmOjfZnpSY/rLdn7b5v1dbNmtmVR/cmob2LrJqpNqbiaVSo0ozhc7Ohg3eXJpPnSdq9vEZmZbAhcCuwCdwN1m9kd3XzLSay9avJyrb3G6ukPf+srXe7n6FgcGbgiYXW/Fyk5+ctNjpFIpenrDL401a9cNCsdflyepSSWk0kPnLGAAAAtHSURBVH394zCZBDJgAkFWwlmXXGLlWeM52UmsmaGP64xW3anm/oTTnWqhp6l5YCJKNdPdNCYko1hyytTpidXJlcB6Us2QI+kMZdajVFbdJjZgX2Chu78CYGY3AB8Hzh/phecvXNqfrDK6uvuYv3DpgP+Qc9Xr7QP0l3D1pNMhWQxIKgOPB8xoy2rdDCrP1VpKl2YG22jQQ1N/ohiUcJpCQunpT0A5WkV5ElFXU0g43akxOZNOOY2Gh63rXT0ntonAstjxMmC3XBWHOlC5YmVn3vL4tfLVkzzSaZrpG5BwWtPdgxJPKB/chbYuuQxORJmk05ruqfanrBt9pOjqTzpjYi2frOSRlYjWtXIKtICiuulU4yxHe84ntwTKP/FhJGo5tmwjiXVykS7Xek5sTQxsGqUgd/9NsS8h24T2l3MmrQntbQOula9evWpK9xXtOsvZAoqN3WSP76xLRKGOxnWSSQNdgxJFsi60wV1xoayrP4GFc30NlHTKYd439qajo2PIvz+qRbGuU8+J7Vlgz9jx5sDzpbjwzBmTBoydAbS2NPVPDClUr7mJAWNspTKch0QHTzgY3AKKX69ZSSexrlSu1kyOLrRYnZ6mMVFyaR6UiLqyxoZ6aap4F9topG7FxlTPie124Fwz24TwwP1/AyeU4sKZcbSBsyKnDBoo7q93x79Y9eoaNh3XzMF7TCTV3cXtf32K119by4YtfbT09dLzRifjxoTXfV1d/V1tY3IkqUGJaJgPiY5W3anm3F1oWS2WeAuoJ0fdfC2gfJMJpLYoaY1eqXQtrR82RNF0/zOBVuAqd/925tzKlStL8sFW/P52lt93P+u3tpb9IdHRIPOQaF9zE63rr1/Vh0STqpcunnqJE+on1nqJE0ZvrO3t7YP+yqznFhvu/gvgF+W8x9onn6LPn2B1OW9SK0rwkOiApXQKPCRaT/8nFJH6UteJrRKa2lqrHUKQSoXkkpVMUrFWy+rOTjbaeOOqPiQqIlJtSmxFpFqTJbb+7rQBC36uSx7Zi4KmsrrRQtJqG3gcTzwtLUWTTkdHBxPVChKRUU6JrYiNpr+HleM2YMttthmYpOKtn5aWko/riIjI8CixFTF2m61p7upkQ7WERETqgpoZIiLSUJTYRESkoSixiYhIQ1FiExGRhqLEJiIiDaWul9QqpFRLaomISO3KtaSWWmwiItJQlNhERKShNGxXpIiIjE5qsYmISEPRklpDYGY7A/e4e1u1Y8nHzPYEvk/Yo+5J4NPu/p/qRjWYmU0HLiXEuQI4xt2frm5UhZnZ/wN63f3caseSLdqb8CygBfi+u/+wyiHlZWbjgbuBg9z9qSqHk5eZnQPMjA5vdfczqhlPPmZ2PvBxIA1c7e6XVDmkoszsu8DG7n5UOa6vFltCZrY+8APCL+JaNgc4wt2nAkuA06scTz4/B45z92nR68urHE9eZtZuZlcDX652LLmY2ZbAhcB7gWnACWa2Y3Wjys3Mdgf+AkypdiyFmNm+wAeAnQnf6S5mdkh1oxrMzN4HzADeAewKfN7MrLpRFWZm+wCfLuc9lNiS+x6hJVTrdnD3JWbWAmwJ1GJrrQ04y90fiYoeAbauYkjFHAx0EP4bqEX7Agvd/RV3XwPcQPgLvhYdD3wOeL7agRSxDPiyu3e5ezfwGDX436i7/xl4v7v3AJsSeuHWVDeq/MzsTYQ/wi4q533UFZmAmX0EWN/db6jxP4Zw924zmwrcDnQDZ1Y5pEHcvROYB2BmTcC5wG+qGVMh7v4zADM7t8qh5DOR8Is4YxmwW5ViKcjdjwOog/8fPZp5bWaTCV2S06sXUX7R/+fPA04DfgU8V+WQCvkx8HXgzeW8iRJbjJkdShj3iXscGE/4q7hm5IvV3fd198XAZmb2GeCXwHsqHmCkUJxm1gr8lPDfYVn/gkuiUKzViGcImgjjKxkpoK9KsTQUM3sbcCtwurt3VDuefNz9HDP7FvB/hFbxT6oc0iBmdhzwb3e/w8yOKue9lNhi3P1XhL94+kX/Y3wNuDPzV6aZPQTs6e6vVTzISJ5Y1zOzj7p7pvUzjyp3n+WKE8DMxgE3EyaOHBx191RVvljrwLPAnrHjzan9rr6aF01wuhH4grtfX+14cjGz7YH13P0hd3/dzP6XMN5Wiw4Dtoh+f74JGGdml7r7F0t9IyW2Itz9KuCqzLGZpaMJD7WoG/ihmf3b3R8gdJ/8pcox5TMP+CdworurdTEytwPnmtkmhPGV/wZOqG5I9c3M3kzoHj/M3RdWO54CJgHnmdl7Ca32g4FrqhtSbu6+X+Z11GLbuxxJDTR5pKG4ey/hr6KfRH8VfRw4rrpRDRY9NnEwYcziQTN7yMwWVDmsuuXuzxHGLf4IPAT8wt3/Vt2o6t5pwHrAJdF/nw+Z2YnVDiqbuy8gdJX+HXgAuLtWW5eVpJVHRESkoajFJiIiDUWJTUREGooSm4iINBQlNhERaShKbCIi0lD0HJuMKmZ2ObBXdLgjYQeEtdHxu4HXgU3c/eUqxPZ74HB3fzl6/OE0d18yjOvMBf7h7t8tdYxDiOEbwMPuflO0+vw/3f1nZpamSt+vjB5KbDKquPspmddm9hQwy93vj5VVIap+/Q+wuvuHqhlICcwg7C6Bu3+jyrHIKKPEJjLYeWa2BzAB+E5mbzMzOxY4idCFvwI42d0fN7N24IeE7U3SwG3Ame7eY2adwE3ATsAswsogl0XXbgYud/drzGxOdO8/mtmHgLuAj7v7/WZ2DGHLnF7gZcKWH88R1rXcA9iQsD7kce6+KN+Hira3mUtYNPnp6Ho3uvvc7JZU5hh4Jd99opbhKmAqYVHbR4Ajo/h2Bb5jZr2Eh/EHtSALfJ/vBS6Jvp808E13vzHv/1oiWTTGJjLYUnffBTgE+J6ZtUT7Xn2asEbozsC3gV9H9S8n/GKeSviFvhNh5QoI+/f9n7sbYVWQG4CvRtd/H3Came3h7kdH9d/v7v/OBGJmOwHfAvZ393cQ1tf8OrA7IUG92913JCwm/dUin+sKwka5bwO+QGhVFVPsPrsA+wM7AG8BDo3+ELifsHDwr8mhyPd5HnBJ9B0dkzBOkX5qsYkM9ovo34eANsLuDgcC2wF3x7or/yvaX+oAYLq7p4FOM/sRIXFcHNW7K/p3CvBW4JrYNcYSNrO8J08s+wC/yyQ7d+/fE9DMzgI+Y2ZvBfYGii3KPQP4YnSdJ8zsD0Xq4+5/LXKf30bbEGFmiwmL2yZR6PucT1jz9MOEdTBrbuslqW1qsYkM1g0QJSoI3W/NwLXuPi1aBPudhNbZfxi8bUwT0BI7Xh392wyszFwjus4ehF3P8+mJX9vMxprZ9mZ2IGGNQAhdnT+K4ixkbVadrqzzqege/bvEJ7jP2tjrdIIYMvJ+n+7+Y0Lr9w/AB4FHzGy9hNcVUWITSeh3wCfNbIvo+ETgjti5k80sFe0OfgLhl3I2B9aa2aegfwX5fxC68yCMebVkveePwL6x+36G0G23H6GL8wpCt99HCcmikFujuDGzrRi4x+BLhMQCcHisfDj3gZCQsz9LXN7v08zuBnZ297mE73IjwlY8IokosYkk4O6/J4x1/cHMHiH88v9Y1Ko7BdgUWBz9OHBhjmt0ESZSHBdd4/fA2bEJH78C/mxmb4+9ZzFwOvBbM3uYMJ51IqHltHfU/fcg8C9g22hH8ny+CLw5es9c4JnYuVMI3X8PEsbLMjtyD+c+EMYCv2lmn851ssj3eQZwvpn9HfgTcJ67P1XkfiL9tLq/yChlZrcAN0QtI5GGoRabiIg0FLXYRESkoajFJiIiDUWJTUREGooSm4iINBQlNhERaShKbCIi0lCU2EREpKH8f4/B2wGnwP6nAAAAAElFTkSuQmCC\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# q-norm normality test\n",
"\n",
"stats.probplot(x, dist=\"norm\", plot=pylab)\n",
"pylab.show()"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [],
"source": [
"# Creating an auxiliar column which is a copy of new_deaths to compare the real values against the logarithm of those.\n",
"train_outliers = pd.DataFrame()\n",
"train_outliers[\"new_deaths\"] = df[\"new_deaths\"]"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Count of outliers over Ls: 0\n"
]
}
],
"source": [
"# observations to be deleted\n",
"print(\" Count of outliers over Ls: \" + str(train_outliers[\"new_deaths\"][train_outliers[\"new_deaths\"]<0].count()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By looking at the chart above, there are not enough evidences to accept the normality of this data as the p-value is very small, declining the null hypothesis. In this chart, the edges are far from the theoretical quantiles of the normal distribution.\n",
"\n",
"so, we will proceed with previous process of transforming the data into its logarithm form."
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [],
"source": [
"# data logarithm\n",
"train_outliers['new_deaths'] = np.log(df['new_deaths'])"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# q-norm normality test\n",
"\n",
"y = train_outliers[\"new_deaths\"]\n",
"stats.probplot(y, dist=\"norm\", fit=True, plot=pylab)\n",
"pylab.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, we can see a better trend in the data points corresponding to the new_deaths column, but can not be sure to classify it as a normal distribution."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.4. Correlation between Features <a id='part1_4'></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"let´s divide first the variables between categorical and numerical columns as:"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [],
"source": [
"# categorical and numerical variables\n",
"categorical = df.select_dtypes(include = ['object']).copy()\n",
"numerical = df.select_dtypes(include = ['int64','float64']).copy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.4.1. Numerical variables analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A study will be done to measure the correlation between all these variables together."
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [],
"source": [
"num_corr = numerical.corr()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There will be a first exploration of data between \"new_deaths\" and the rest"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"new_deaths 1.000000\n",
"new_cases 0.923135\n",
"total_deaths 0.794581\n",
"total_cases 0.783721\n",
"population 0.625588\n",
"new_deaths_per_million 0.099934\n",
"new_tests_smoothed 0.087396\n",
"total_deaths_per_million 0.078711\n",
"total_tests 0.069418\n",
"new_tests 0.058344\n",
"aged_70_older 0.045385\n",
"aged_65_older 0.044243\n",
"total_cases_per_million 0.040277\n",
"new_cases_per_million 0.039541\n",
"female_smokers 0.036656\n",
"median_age 0.034299\n",
"gdp_per_capita 0.028032\n",
"handwashing_facilities 0.019772\n",
"diabetes_prevalence 0.014399\n",
"male_smokers 0.004023\n",
"extreme_poverty -0.003281\n",
"total_tests_per_thousand -0.003870\n",
"new_tests_smoothed_per_thousand -0.007553\n",
"new_tests_per_thousand -0.009656\n",
"population_density -0.017467\n",
"stringency_index -0.026154\n",
"cvd_death_rate -0.032237\n",
"__v NaN\n",
"Name: new_deaths, dtype: float64"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# correlation with new_deaths\n",
"num_corr[\"new_deaths\"].sort_values(ascending=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To make variable selection let´s pick the most three significative variables that have a correlation higher than $±0.5$ to make a correlation matrix limited to the variables which meet this criteria"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 648x360 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"cm = numerical[[\"new_deaths\",\"new_cases\",\"total_deaths\",\"total_cases\", \"population\"]].corr()\n",
"sns.set(font_scale=1)\n",
"f, ax = plt.subplots(figsize=(9, 5))\n",
"hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2g')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Furthermore, let´s do an analysis more in detail about each of them: "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1.4.1.1 new_cases"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The number of new cases seems to be the one more obvious to have relation with the number of deaths. The graphical representation of this variables looks like the following:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By applying the ANOVA statistical proceeding, it indicates how the line fits correctly the relation between two variables, calculating $R^2$ which is a value between 0 and 1, where 0 indicates that there is no correlation at all, and 1 there is perfect fitting, or saying how good the model would fit through regression line over the data:"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"simpletable\">\n",
"<caption>OLS Regression Results</caption>\n",
"<tr>\n",
" <th>Dep. Variable:</th> <td>new_deaths</td> <th> R-squared: </th> <td> 0.852</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Model:</th> <td>OLS</td> <th> Adj. R-squared: </th> <td> 0.852</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Method:</th> <td>Least Squares</td> <th> F-statistic: </th> <td>1.312e+05</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Date:</th> <td>Mon, 15 Jun 2020</td> <th> Prob (F-statistic):</th> <td> 0.00</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Time:</th> <td>13:48:12</td> <th> Log-Likelihood: </th> <td>-1.4277e+05</td>\n",
"</tr>\n",
"<tr>\n",
" <th>No. Observations:</th> <td> 22762</td> <th> AIC: </th> <td>2.855e+05</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Df Residuals:</th> <td> 22760</td> <th> BIC: </th> <td>2.856e+05</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Df Model:</th> <td> 1</td> <th> </th> <td> </td> \n",
"</tr>\n",
"<tr>\n",
" <th>Covariance Type:</th> <td>nonrobust</td> <th> </th> <td> </td> \n",
"</tr>\n",
"</table>\n",
"<table class=\"simpletable\">\n",
"<tr>\n",
" <td></td> <th>coef</th> <th>std err</th> <th>t</th> <th>P>|t|</th> <th>[0.025</th> <th>0.975]</th> \n",
"</tr>\n",
"<tr>\n",
" <th>Intercept</th> <td> 1.2759</td> <td> 0.855</td> <td> 1.492</td> <td> 0.136</td> <td> -0.400</td> <td> 2.952</td>\n",
"</tr>\n",
"<tr>\n",
" <th>new_cases</th> <td> 0.0556</td> <td> 0.000</td> <td> 362.229</td> <td> 0.000</td> <td> 0.055</td> <td> 0.056</td>\n",
"</tr>\n",
"</table>\n",
"<table class=\"simpletable\">\n",
"<tr>\n",
" <th>Omnibus:</th> <td>36755.892</td> <th> Durbin-Watson: </th> <td> 0.357</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Prob(Omnibus):</th> <td> 0.000</td> <th> Jarque-Bera (JB): </th> <td>282581302.648</td>\n",
"</tr>\n",
"<tr>\n",
" <th>Skew:</th> <td> 9.644</td> <th> Prob(JB): </th> <td> 0.00</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Kurtosis:</th> <td>548.508</td> <th> Cond. No. </th> <td>5.61e+03</td> \n",
"</tr>\n",
"</table><br/><br/>Warnings:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.<br/>[2] The condition number is large, 5.61e+03. This might indicate that there are<br/>strong multicollinearity or other numerical problems."
],
"text/plain": [
"<class 'statsmodels.iolib.summary.Summary'>\n",
"\"\"\"\n",
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: new_deaths R-squared: 0.852\n",
"Model: OLS Adj. R-squared: 0.852\n",
"Method: Least Squares F-statistic: 1.312e+05\n",
"Date: Mon, 15 Jun 2020 Prob (F-statistic): 0.00\n",
"Time: 13:48:12 Log-Likelihood: -1.4277e+05\n",
"No. Observations: 22762 AIC: 2.855e+05\n",
"Df Residuals: 22760 BIC: 2.856e+05\n",
"Df Model: 1 \n",
"Covariance Type: nonrobust \n",
"==============================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"Intercept 1.2759 0.855 1.492 0.136 -0.400 2.952\n",
"new_cases 0.0556 0.000 362.229 0.000 0.055 0.056\n",
"==============================================================================\n",
"Omnibus: 36755.892 Durbin-Watson: 0.357\n",
"Prob(Omnibus): 0.000 Jarque-Bera (JB): 282581302.648\n",
"Skew: 9.644 Prob(JB): 0.00\n",
"Kurtosis: 548.508 Cond. No. 5.61e+03\n",
"==============================================================================\n",
"\n",
"Warnings:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"[2] The condition number is large, 5.61e+03. This might indicate that there are\n",
"strong multicollinearity or other numerical problems.\n",
"\"\"\""
]
},
"execution_count": 99,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import statsmodels.formula.api as smf\n",
"est = smf.ols(formula='new_deaths ~ new_cases', data=numerical).fit()\n",
"est.summary()\n",
"# Take a look at R"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$R^2$ shows a significant value to make new_cases as a strong variable to determine the number of deaths"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1.4.1.2 total_deaths"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The number of total_deaths seems to be other obvious variable to have relation with the number of deaths. The statistics of this variable looks like the following:"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"simpletable\">\n",
"<caption>OLS Regression Results</caption>\n",
"<tr>\n",
" <th>Dep. Variable:</th> <td>new_deaths</td> <th> R-squared: </th> <td> 0.631</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Model:</th> <td>OLS</td> <th> Adj. R-squared: </th> <td> 0.631</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Method:</th> <td>Least Squares</td> <th> F-statistic: </th> <td>3.898e+04</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Date:</th> <td>Mon, 15 Jun 2020</td> <th> Prob (F-statistic):</th> <td> 0.00</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Time:</th> <td>13:48:12</td> <th> Log-Likelihood: </th> <td>-1.5317e+05</td>\n",
"</tr>\n",
"<tr>\n",
" <th>No. Observations:</th> <td> 22762</td> <th> AIC: </th> <td>3.063e+05</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Df Residuals:</th> <td> 22760</td> <th> BIC: </th> <td>3.064e+05</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Df Model:</th> <td> 1</td> <th> </th> <td> </td> \n",
"</tr>\n",
"<tr>\n",
" <th>Covariance Type:</th> <td>nonrobust</td> <th> </th> <td> </td> \n",
"</tr>\n",
"</table>\n",
"<table class=\"simpletable\">\n",
"<tr>\n",
" <td></td> <th>coef</th> <th>std err</th> <th>t</th> <th>P>|t|</th> <th>[0.025</th> <th>0.975]</th> \n",
"</tr>\n",
"<tr>\n",
" <th>Intercept</th> <td> 10.1511</td> <td> 1.348</td> <td> 7.530</td> <td> 0.000</td> <td> 7.509</td> <td> 12.793</td>\n",
"</tr>\n",
"<tr>\n",
" <th>total_deaths</th> <td> 0.0169</td> <td> 8.58e-05</td> <td> 197.434</td> <td> 0.000</td> <td> 0.017</td> <td> 0.017</td>\n",
"</tr>\n",
"</table>\n",
"<table class=\"simpletable\">\n",
"<tr>\n",
" <th>Omnibus:</th> <td>45511.193</td> <th> Durbin-Watson: </th> <td> 0.177</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Prob(Omnibus):</th> <td> 0.000</td> <th> Jarque-Bera (JB): </th> <td>220606812.373</td>\n",
"</tr>\n",
"<tr>\n",
" <th>Skew:</th> <td>16.141</td> <th> Prob(JB): </th> <td> 0.00</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Kurtosis:</th> <td>484.210</td> <th> Cond. No. </th> <td>1.58e+04</td> \n",
"</tr>\n",
"</table><br/><br/>Warnings:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.<br/>[2] The condition number is large, 1.58e+04. This might indicate that there are<br/>strong multicollinearity or other numerical problems."
],
"text/plain": [
"<class 'statsmodels.iolib.summary.Summary'>\n",
"\"\"\"\n",
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: new_deaths R-squared: 0.631\n",
"Model: OLS Adj. R-squared: 0.631\n",
"Method: Least Squares F-statistic: 3.898e+04\n",
"Date: Mon, 15 Jun 2020 Prob (F-statistic): 0.00\n",
"Time: 13:48:12 Log-Likelihood: -1.5317e+05\n",
"No. Observations: 22762 AIC: 3.063e+05\n",
"Df Residuals: 22760 BIC: 3.064e+05\n",
"Df Model: 1 \n",
"Covariance Type: nonrobust \n",
"================================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"--------------------------------------------------------------------------------\n",
"Intercept 10.1511 1.348 7.530 0.000 7.509 12.793\n",
"total_deaths 0.0169 8.58e-05 197.434 0.000 0.017 0.017\n",
"==============================================================================\n",
"Omnibus: 45511.193 Durbin-Watson: 0.177\n",
"Prob(Omnibus): 0.000 Jarque-Bera (JB): 220606812.373\n",
"Skew: 16.141 Prob(JB): 0.00\n",
"Kurtosis: 484.210 Cond. No. 1.58e+04\n",
"==============================================================================\n",
"\n",
"Warnings:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"[2] The condition number is large, 1.58e+04. This might indicate that there are\n",
"strong multicollinearity or other numerical problems.\n",
"\"\"\""
]
},
"execution_count": 100,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"est = smf.ols(formula='new_deaths ~ total_deaths', data=numerical).fit()\n",
"est.summary()\n",
"# Take a look at R"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Still, $R^2$ shows a medium-high value so we will consider this variable as a good candidate to get in the model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1.4.1.3 total_cases"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The number of total_cases follows the relation with the number of new_cases. The statistics of this variable looks like the following:"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"simpletable\">\n",
"<caption>OLS Regression Results</caption>\n",
"<tr>\n",
" <th>Dep. Variable:</th> <td>new_deaths</td> <th> R-squared: </th> <td> 0.614</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Model:</th> <td>OLS</td> <th> Adj. R-squared: </th> <td> 0.614</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Method:</th> <td>Least Squares</td> <th> F-statistic: </th> <td>3.624e+04</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Date:</th> <td>Mon, 15 Jun 2020</td> <th> Prob (F-statistic):</th> <td> 0.00</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Time:</th> <td>13:48:12</td> <th> Log-Likelihood: </th> <td>-1.5369e+05</td>\n",
"</tr>\n",
"<tr>\n",
" <th>No. Observations:</th> <td> 22762</td> <th> AIC: </th> <td>3.074e+05</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Df Residuals:</th> <td> 22760</td> <th> BIC: </th> <td>3.074e+05</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Df Model:</th> <td> 1</td> <th> </th> <td> </td> \n",
"</tr>\n",
"<tr>\n",
" <th>Covariance Type:</th> <td>nonrobust</td> <th> </th> <td> </td> \n",
"</tr>\n",
"</table>\n",
"<table class=\"simpletable\">\n",
"<tr>\n",
" <td></td> <th>coef</th> <th>std err</th> <th>t</th> <th>P>|t|</th> <th>[0.025</th> <th>0.975]</th> \n",
"</tr>\n",
"<tr>\n",
" <th>Intercept</th> <td> 10.4463</td> <td> 1.379</td> <td> 7.575</td> <td> 0.000</td> <td> 7.743</td> <td> 13.149</td>\n",
"</tr>\n",
"<tr>\n",
" <th>total_cases</th> <td> 0.0011</td> <td> 5.62e-06</td> <td> 190.360</td> <td> 0.000</td> <td> 0.001</td> <td> 0.001</td>\n",
"</tr>\n",
"</table>\n",
"<table class=\"simpletable\">\n",
"<tr>\n",
" <th>Omnibus:</th> <td>44157.847</td> <th> Durbin-Watson: </th> <td> 0.175</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Prob(Omnibus):</th> <td> 0.000</td> <th> Jarque-Bera (JB): </th> <td>211754889.465</td>\n",
"</tr>\n",
"<tr>\n",
" <th>Skew:</th> <td>15.018</td> <th> Prob(JB): </th> <td> 0.00</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Kurtosis:</th> <td>474.561</td> <th> Cond. No. </th> <td>2.47e+05</td> \n",
"</tr>\n",
"</table><br/><br/>Warnings:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.<br/>[2] The condition number is large, 2.47e+05. This might indicate that there are<br/>strong multicollinearity or other numerical problems."
],
"text/plain": [
"<class 'statsmodels.iolib.summary.Summary'>\n",
"\"\"\"\n",
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: new_deaths R-squared: 0.614\n",
"Model: OLS Adj. R-squared: 0.614\n",
"Method: Least Squares F-statistic: 3.624e+04\n",
"Date: Mon, 15 Jun 2020 Prob (F-statistic): 0.00\n",
"Time: 13:48:12 Log-Likelihood: -1.5369e+05\n",
"No. Observations: 22762 AIC: 3.074e+05\n",
"Df Residuals: 22760 BIC: 3.074e+05\n",
"Df Model: 1 \n",
"Covariance Type: nonrobust \n",
"===============================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"-------------------------------------------------------------------------------\n",
"Intercept 10.4463 1.379 7.575 0.000 7.743 13.149\n",
"total_cases 0.0011 5.62e-06 190.360 0.000 0.001 0.001\n",
"==============================================================================\n",
"Omnibus: 44157.847 Durbin-Watson: 0.175\n",
"Prob(Omnibus): 0.000 Jarque-Bera (JB): 211754889.465\n",
"Skew: 15.018 Prob(JB): 0.00\n",
"Kurtosis: 474.561 Cond. No. 2.47e+05\n",
"==============================================================================\n",
"\n",
"Warnings:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"[2] The condition number is large, 2.47e+05. This might indicate that there are\n",
"strong multicollinearity or other numerical problems.\n",
"\"\"\""
]
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"est = smf.ols(formula='new_deaths ~ total_cases', data=numerical).fit()\n",
"est.summary()\n",
"# Take a look at R"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Still, $R^2$ shows a medium-high value so we will consider this variable as a good candidate to get in the model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1.4.1.4 population"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The population of the country may have an effect on new_cases. The statistics of this variable looks like the following:"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"simpletable\">\n",
"<caption>OLS Regression Results</caption>\n",
"<tr>\n",
" <th>Dep. Variable:</th> <td>new_deaths</td> <th> R-squared: </th> <td> 0.391</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Model:</th> <td>OLS</td> <th> Adj. R-squared: </th> <td> 0.391</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Method:</th> <td>Least Squares</td> <th> F-statistic: </th> <td>1.463e+04</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Date:</th> <td>Mon, 15 Jun 2020</td> <th> Prob (F-statistic):</th> <td> 0.00</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Time:</th> <td>13:48:12</td> <th> Log-Likelihood: </th> <td>-1.5888e+05</td>\n",
"</tr>\n",
"<tr>\n",
" <th>No. Observations:</th> <td> 22762</td> <th> AIC: </th> <td>3.178e+05</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Df Residuals:</th> <td> 22760</td> <th> BIC: </th> <td>3.178e+05</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Df Model:</th> <td> 1</td> <th> </th> <td> </td> \n",
"</tr>\n",
"<tr>\n",
" <th>Covariance Type:</th> <td>nonrobust</td> <th> </th> <td> </td> \n",
"</tr>\n",
"</table>\n",
"<table class=\"simpletable\">\n",
"<tr>\n",
" <td></td> <th>coef</th> <th>std err</th> <th>t</th> <th>P>|t|</th> <th>[0.025</th> <th>0.975]</th> \n",
"</tr>\n",
"<tr>\n",
" <th>Intercept</th> <td> 3.8596</td> <td> 1.744</td> <td> 2.213</td> <td> 0.027</td> <td> 0.440</td> <td> 7.279</td>\n",
"</tr>\n",
"<tr>\n",
" <th>population</th> <td> 3.097e-07</td> <td> 2.56e-09</td> <td> 120.975</td> <td> 0.000</td> <td> 3.05e-07</td> <td> 3.15e-07</td>\n",
"</tr>\n",
"</table>\n",
"<table class=\"simpletable\">\n",
"<tr>\n",
" <th>Omnibus:</th> <td>29764.648</td> <th> Durbin-Watson: </th> <td> 0.117</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Prob(Omnibus):</th> <td> 0.000</td> <th> Jarque-Bera (JB): </th> <td>32479463.590</td>\n",
"</tr>\n",
"<tr>\n",
" <th>Skew:</th> <td> 6.684</td> <th> Prob(JB): </th> <td> 0.00</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Kurtosis:</th> <td>187.573</td> <th> Cond. No. </th> <td>6.89e+08</td> \n",
"</tr>\n",
"</table><br/><br/>Warnings:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.<br/>[2] The condition number is large, 6.89e+08. This might indicate that there are<br/>strong multicollinearity or other numerical problems."
],
"text/plain": [
"<class 'statsmodels.iolib.summary.Summary'>\n",
"\"\"\"\n",
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: new_deaths R-squared: 0.391\n",
"Model: OLS Adj. R-squared: 0.391\n",
"Method: Least Squares F-statistic: 1.463e+04\n",
"Date: Mon, 15 Jun 2020 Prob (F-statistic): 0.00\n",
"Time: 13:48:12 Log-Likelihood: -1.5888e+05\n",
"No. Observations: 22762 AIC: 3.178e+05\n",
"Df Residuals: 22760 BIC: 3.178e+05\n",
"Df Model: 1 \n",
"Covariance Type: nonrobust \n",
"==============================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"Intercept 3.8596 1.744 2.213 0.027 0.440 7.279\n",
"population 3.097e-07 2.56e-09 120.975 0.000 3.05e-07 3.15e-07\n",
"==============================================================================\n",
"Omnibus: 29764.648 Durbin-Watson: 0.117\n",
"Prob(Omnibus): 0.000 Jarque-Bera (JB): 32479463.590\n",
"Skew: 6.684 Prob(JB): 0.00\n",
"Kurtosis: 187.573 Cond. No. 6.89e+08\n",
"==============================================================================\n",
"\n",
"Warnings:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"[2] The condition number is large, 6.89e+08. This might indicate that there are\n",
"strong multicollinearity or other numerical problems.\n",
"\"\"\""
]
},
"execution_count": 102,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"est = smf.ols(formula='new_deaths ~ population', data=numerical).fit()\n",
"est.summary()\n",
"# Take a look at R"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The value obtained for $R^2$ is low compared to the other variables, so we will avoid including this variable into the models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.4.2 Categorical variables analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The proceeding consists of transforming the categorical value into a numerical one, study the correlations and then isolate the variables more correlated with each other.\n",
"\n",
"**note**: the data to work with needs to have no-null values, as explained before. For this reason, we can apply the proceeding to numerical categories with no problem at all."
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {},
"outputs": [],
"source": [
"# transformation to numerical values:\n",
"for i in categorical.columns:\n",
" categorical[i] = categorical[i].astype('category')\n",
"\n",
"columns = []\n",
"for i in categorical.columns:\n",
" columns.append(i)\n",
"\n",
"for i in columns:\n",
" categorical[i] = categorical[i].cat.codes"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>_id</th>\n",
" <th>iso_code</th>\n",
" <th>location</th>\n",
" <th>tests_units</th>\n",
" <th>age_status</th>\n",
" </tr>\n",
" <tr>\n",
" <th>date</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2019-12-31</th>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2020-01-01</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2020-01-02</th>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2020-01-03</th>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2020-01-04</th>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" _id iso_code location tests_units age_status\n",
"date \n",
"2019-12-31 0 2 0 0 2\n",
"2020-01-01 1 2 0 0 2\n",
"2020-01-02 2 2 0 0 2\n",
"2020-01-03 3 2 0 0 2\n",
"2020-01-04 4 2 0 0 2"
]
},
"execution_count": 104,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"categorical.head(5)"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [],
"source": [
"# adding the new_deaths column\n",
"new_deaths = numerical[['new_deaths']]\n",
"categorical['new_deaths'] = new_deaths"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"_id 0.118170\n",
"iso_code 0.049627\n",
"location 0.116461\n",
"tests_units -0.010107\n",
"age_status -0.053804\n",
"new_deaths 1.000000\n",
"Name: new_deaths, dtype: float64\n"
]
}
],
"source": [
"# significative correlations with the new_deaths column\n",
"cat_corr = categorical.corr()\n",
"corr_sp = cat_corr['new_deaths']\n",
"print(corr_sp)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The only variable which shows some correlation values is location, which is the country itself. Let´s get the statistics from this variable:"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"simpletable\">\n",
"<caption>OLS Regression Results</caption>\n",
"<tr>\n",
" <th>Dep. Variable:</th> <td>new_deaths</td> <th> R-squared: </th> <td> 0.014</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Model:</th> <td>OLS</td> <th> Adj. R-squared: </th> <td> 0.014</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Method:</th> <td>Least Squares</td> <th> F-statistic: </th> <td> 312.9</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Date:</th> <td>Mon, 15 Jun 2020</td> <th> Prob (F-statistic):</th> <td>1.46e-69</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Time:</th> <td>13:48:13</td> <th> Log-Likelihood: </th> <td>-1.6437e+05</td>\n",
"</tr>\n",
"<tr>\n",
" <th>No. Observations:</th> <td> 22762</td> <th> AIC: </th> <td>3.287e+05</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Df Residuals:</th> <td> 22760</td> <th> BIC: </th> <td>3.288e+05</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Df Model:</th> <td> 1</td> <th> </th> <td> </td> \n",
"</tr>\n",
"<tr>\n",
" <th>Covariance Type:</th> <td>nonrobust</td> <th> </th> <td> </td> \n",
"</tr>\n",
"</table>\n",
"<table class=\"simpletable\">\n",
"<tr>\n",
" <td></td> <th>coef</th> <th>std err</th> <th>t</th> <th>P>|t|</th> <th>[0.025</th> <th>0.975]</th> \n",
"</tr>\n",
"<tr>\n",
" <th>Intercept</th> <td> -30.9271</td> <td> 4.380</td> <td> -7.062</td> <td> 0.000</td> <td> -39.511</td> <td> -22.343</td>\n",
"</tr>\n",
"<tr>\n",
" <th>location</th> <td> 0.6372</td> <td> 0.036</td> <td> 17.690</td> <td> 0.000</td> <td> 0.567</td> <td> 0.708</td>\n",
"</tr>\n",
"</table>\n",
"<table class=\"simpletable\">\n",
"<tr>\n",
" <th>Omnibus:</th> <td>44721.657</td> <th> Durbin-Watson: </th> <td> 0.077</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Prob(Omnibus):</th> <td> 0.000</td> <th> Jarque-Bera (JB): </th> <td>87127003.641</td>\n",
"</tr>\n",
"<tr>\n",
" <th>Skew:</th> <td>15.985</td> <th> Prob(JB): </th> <td> 0.00</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Kurtosis:</th> <td>304.403</td> <th> Cond. No. </th> <td> 243.</td> \n",
"</tr>\n",
"</table><br/><br/>Warnings:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
],
"text/plain": [
"<class 'statsmodels.iolib.summary.Summary'>\n",
"\"\"\"\n",
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: new_deaths R-squared: 0.014\n",
"Model: OLS Adj. R-squared: 0.014\n",
"Method: Least Squares F-statistic: 312.9\n",
"Date: Mon, 15 Jun 2020 Prob (F-statistic): 1.46e-69\n",
"Time: 13:48:13 Log-Likelihood: -1.6437e+05\n",
"No. Observations: 22762 AIC: 3.287e+05\n",
"Df Residuals: 22760 BIC: 3.288e+05\n",
"Df Model: 1 \n",
"Covariance Type: nonrobust \n",
"==============================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"Intercept -30.9271 4.380 -7.062 0.000 -39.511 -22.343\n",
"location 0.6372 0.036 17.690 0.000 0.567 0.708\n",
"==============================================================================\n",
"Omnibus: 44721.657 Durbin-Watson: 0.077\n",
"Prob(Omnibus): 0.000 Jarque-Bera (JB): 87127003.641\n",
"Skew: 15.985 Prob(JB): 0.00\n",
"Kurtosis: 304.403 Cond. No. 243.\n",
"==============================================================================\n",
"\n",
"Warnings:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"\"\"\""
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"est = smf.ols(formula='new_deaths ~ location', data=categorical).fit()\n",
"est.summary()\n",
"# Take a look at R"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The low value of $R^2$ corresponding to this variable shows no correlation at all. So the country is not a definitive variable to look at in terms of prediction."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Getting to this point, we will select for the moment the following variables to include into the prediction models: \n",
"\n",
"- \"new_deaths\"\n",
"- \"new_cases\"\n",
"- \"total_deaths\"\n",
"- \"total_cases\"\n",
"\n",
"Maybe later we can include \"population\" or other variables to see if we can improve our models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.4.3 Plotting each of these variables to see visually any correlation\n",
"\n",
"Now we will use scatter plots to try to detect any potential correlations between variables"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {},
"outputs": [],
"source": [
"features = [\"new_deaths\",\"new_cases\",\"total_deaths\",\"total_cases\"]"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1008x1008 with 16 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from pandas.plotting import scatter_matrix\n",
"\n",
"scatter_matrix(df[features], figsize = (14,14), diagonal = 'kde');\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.4.4. WoE (Weight of Evidence)\n",
"\n",
"This factor shows the weight that each category of a variable has in terms of relation with the target variable to predict, and therefore can be used to reduce the number of categories and make simpler the prediction model with less categories to take into account"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>new_cases</th>\n",
" </tr>\n",
" <tr>\n",
" <th>new_cases</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>-2461</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>-1480</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>-713</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>-525</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>-209</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>126684</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>127662</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>127796</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>132786</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>133510</th>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2060 rows × 1 columns</p>\n",
"</div>"
],
"text/plain": [
" new_cases\n",
"new_cases \n",
"-2461 1\n",
"-1480 1\n",
"-713 1\n",
"-525 1\n",
"-209 1\n",
"... ...\n",
" 126684 1\n",
" 127662 1\n",
" 127796 1\n",
" 132786 1\n",
" 133510 1\n",
"\n",
"[2060 rows x 1 columns]"
]
},
"execution_count": 110,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(df[\"new_cases\"].groupby(df[\"new_cases\"]).count())"
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df[\"new_cases\"].hist(bins=200)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The distribution for new_deaths is clearly unbalanced because there are so many values close to 0 cases that we may need to categorize as follows:"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {},
"outputs": [],
"source": [
"def get_WoE_one(data, var, target):\n",
" crosstab = pd.crosstab(data[target], data[var])\n",
" \n",
" print(\"Obtaining WoE for variable \", var, \":\")\n",
" \n",
" for col in crosstab.columns:\n",
" if crosstab[col][1] == 0:\n",
" print(\" WoE for \", col, \"[\", sum(crosstab[col]), \"] is infinity\")\n",
" else:\n",
" print(\" WoE for \", col, \"[\", sum(crosstab[col]), \"] is\", np.log(float(crosstab[col][0]) / float(crosstab[col][1])))"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [],
"source": [
"df.loc[:, \"new_cases_grp\"] = df[\"new_cases\"].map(lambda x: \"<5\" if x <5 else \"<50\" if x < 50 else \"<100\" if x <100 else \"<250\" if x <250 else \"<500\" if x <500 else \"<5000\" if x <5000 else \">5000\")\n",
"df.loc[:, \"total_cases_grp\"] = df[\"total_cases\"].map(lambda x: \"<5\" if x <5 else \"<50\" if x < 50 else \"<100\" if x <100 else \"<250\" if x <250 else \"<500\" if x <500 else \"<5000\" if x <5000 else \">5000\")\n",
"df.loc[:, \"total_deaths_grp\"] = df[\"total_deaths\"].map(lambda x: \"<5\" if x <5 else \"<50\" if x < 50 else \"<100\" if x <100 else \"<250\" if x <250 else \"<500\" if x <500 else \"<5000\" if x <5000 else \">5000\")"
]
},
{
"cell_type": "code",
"execution_count": 114,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Obtaining WoE for variable new_cases_grp :\n",
" WoE for <100 [ 1280 ] is 0.6931471805599453\n",
" WoE for <250 [ 1509 ] is 0.4126079956205445\n",
" WoE for <5 [ 12172 ] is 3.5658776284485927\n",
" WoE for <50 [ 4701 ] is 1.3644684116915666\n",
" WoE for <500 [ 990 ] is 0.36772478012531734\n",
" WoE for <5000 [ 1754 ] is 0.9098182173685376\n",
" WoE for >5000 [ 356 ] is infinity\n"
]
}
],
"source": [
"get_WoE_one(df, 'new_cases_grp','new_deaths')"
]
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Obtaining WoE for variable total_cases_grp :\n",
" WoE for <100 [ 1445 ] is 2.4810766948957013\n",
" WoE for <250 [ 1914 ] is 2.1613435619435557\n",
" WoE for <5 [ 4711 ] is 5.889729928565458\n",
" WoE for <50 [ 4450 ] is 3.553720721830038\n",
" WoE for <500 [ 1562 ] is 1.7038900913277886\n",
" WoE for <5000 [ 4738 ] is 1.0360514565081735\n",
" WoE for >5000 [ 3942 ] is 0.8353840932613548\n"
]
}
],
"source": [
"get_WoE_one(df, 'total_cases_grp','new_deaths')"
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Obtaining WoE for variable total_deaths_grp :\n",
" WoE for <100 [ 1114 ] is 0.5635543515143511\n",
" WoE for <250 [ 1261 ] is 0.6654226325450905\n",
" WoE for <5 [ 12380 ] is 3.3786403471830004\n",
" WoE for <50 [ 5079 ] is 1.3120954592093432\n",
" WoE for <500 [ 727 ] is 0.29334780998745824\n",
" WoE for <5000 [ 1450 ] is 1.5668782980153044\n",
" WoE for >5000 [ 751 ] is 0.3364722366212129\n"
]
}
],
"source": [
"get_WoE_one(df, 'total_deaths_grp','new_deaths')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By looking at the previous values, we can conclude all levels of these columns can have sufficient weight to have positive impact when including it into the future model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.4.5 Conclusions"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 648x360 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# correlation matrix with most relevant variables to include in the model\n",
"cm = df[[\"new_deaths\",\"new_cases\",\"total_deaths\",\"total_cases\"]].corr()\n",
"sns.set(font_scale=1)\n",
"f, ax = plt.subplots(figsize=(9,5))\n",
"hm = sns.heatmap(cm, cbar=True, annot= True, square= True, fmt ='.2g')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Clearly all variables show strong correlation against the new_deaths target variable"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Feature Engineering and Data Cleaning: <a id='part2'></a>\n",
"\n",
"With this transformed variable, we are going to obtain the dummy variables, delete colineal variables and separate the datasets into training and validation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1. Dummy variables <a id='part2_1'></a>\n",
"\n",
"Creating dummy variables means that every category group will be transformed into a unique number or index so that the future model and feature selection can work with."
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {},
"outputs": [],
"source": [
"data_model = pd.concat(((pd.get_dummies(df['new_cases_grp'], prefix= 'new_cases_grp')),\n",
" (pd.get_dummies(df['total_cases_grp'], prefix= 'total_cases_grp')),\n",
" (pd.get_dummies(df['total_deaths_grp'], prefix= 'total_deaths_grp'))), axis = 1)"
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>new_cases_grp_&lt;100</th>\n",
" <th>new_cases_grp_&lt;250</th>\n",
" <th>new_cases_grp_&lt;5</th>\n",
" <th>new_cases_grp_&lt;50</th>\n",
" <th>new_cases_grp_&lt;500</th>\n",
" <th>new_cases_grp_&lt;5000</th>\n",
" <th>new_cases_grp_&gt;5000</th>\n",
" <th>total_cases_grp_&lt;100</th>\n",
" <th>total_cases_grp_&lt;250</th>\n",
" <th>total_cases_grp_&lt;5</th>\n",
" <th>total_cases_grp_&lt;50</th>\n",
" <th>total_cases_grp_&lt;500</th>\n",
" <th>total_cases_grp_&lt;5000</th>\n",
" <th>total_cases_grp_&gt;5000</th>\n",
" <th>total_deaths_grp_&lt;100</th>\n",
" <th>total_deaths_grp_&lt;250</th>\n",
" <th>total_deaths_grp_&lt;5</th>\n",
" <th>total_deaths_grp_&lt;50</th>\n",
" <th>total_deaths_grp_&lt;500</th>\n",
" <th>total_deaths_grp_&lt;5000</th>\n",
" <th>total_deaths_grp_&gt;5000</th>\n",
" </tr>\n",
" <tr>\n",
" <th>date</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2019-12-31</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2020-01-01</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2020-01-02</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2020-01-03</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2020-01-04</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" new_cases_grp_<100 new_cases_grp_<250 new_cases_grp_<5 \\\n",
"date \n",
"2019-12-31 0 0 1 \n",
"2020-01-01 0 0 1 \n",
"2020-01-02 0 0 1 \n",
"2020-01-03 0 0 1 \n",
"2020-01-04 0 0 1 \n",
"\n",
" new_cases_grp_<50 new_cases_grp_<500 new_cases_grp_<5000 \\\n",
"date \n",
"2019-12-31 0 0 0 \n",
"2020-01-01 0 0 0 \n",
"2020-01-02 0 0 0 \n",
"2020-01-03 0 0 0 \n",
"2020-01-04 0 0 0 \n",
"\n",
" new_cases_grp_>5000 total_cases_grp_<100 total_cases_grp_<250 \\\n",
"date \n",
"2019-12-31 0 0 0 \n",
"2020-01-01 0 0 0 \n",
"2020-01-02 0 0 0 \n",
"2020-01-03 0 0 0 \n",
"2020-01-04 0 0 0 \n",
"\n",
" total_cases_grp_<5 total_cases_grp_<50 total_cases_grp_<500 \\\n",
"date \n",
"2019-12-31 1 0 0 \n",
"2020-01-01 1 0 0 \n",
"2020-01-02 1 0 0 \n",
"2020-01-03 1 0 0 \n",
"2020-01-04 1 0 0 \n",
"\n",
" total_cases_grp_<5000 total_cases_grp_>5000 \\\n",
"date \n",
"2019-12-31 0 0 \n",
"2020-01-01 0 0 \n",
"2020-01-02 0 0 \n",
"2020-01-03 0 0 \n",
"2020-01-04 0 0 \n",
"\n",
" total_deaths_grp_<100 total_deaths_grp_<250 total_deaths_grp_<5 \\\n",
"date \n",
"2019-12-31 0 0 1 \n",
"2020-01-01 0 0 1 \n",
"2020-01-02 0 0 1 \n",
"2020-01-03 0 0 1 \n",
"2020-01-04 0 0 1 \n",
"\n",
" total_deaths_grp_<50 total_deaths_grp_<500 \\\n",
"date \n",
"2019-12-31 0 0 \n",
"2020-01-01 0 0 \n",
"2020-01-02 0 0 \n",
"2020-01-03 0 0 \n",
"2020-01-04 0 0 \n",
"\n",
" total_deaths_grp_<5000 total_deaths_grp_>5000 \n",
"date \n",
"2019-12-31 0 0 \n",
"2020-01-01 0 0 \n",
"2020-01-02 0 0 \n",
"2020-01-03 0 0 \n",
"2020-01-04 0 0 "
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_model.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we will try to apply deletion of colineal variables through Linear Regression methods. For that we will define a function called VIF (Variance Inflation Factor) which quantifies the multicolineality between variables or how much the variance increases because of colineality. \n",
"\n",
"$VIF = \\dfrac{1}{1 - Ri^2}$"
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"\n",
"def calculateVIF(data):\n",
" features = list(data.columns)\n",
" num_features = len(features)\n",
" \n",
" model = LinearRegression()\n",
" \n",
" result = pd.DataFrame(index = ['VIF'], columns = features)\n",
" result = result.fillna(0)\n",
" \n",
" for ite in range(num_features):\n",
" x_features = features[:]\n",
" y_featue = features[ite]\n",
" x_features.remove(y_featue)\n",
" \n",
" x = data[x_features]\n",
" y = data[y_featue]\n",
" \n",
" model.fit(data[x_features], data[y_featue])\n",
" try:\n",
" result[y_featue] = 1/(1 - model.score(data[x_features], data[y_featue]))\n",
" except ZeroDivisionError:\n",
" result[y_featue] = 5 \n",
" return result\n",
"\n",
"def selectDataUsingVIF(data, max_VIF = 5):\n",
" result = data.copy(deep = True)\n",
" \n",
" VIF = calculateVIF(result)\n",
" \n",
" while VIF.to_numpy().max() > max_VIF:\n",
" col_max = np.where(VIF == VIF.to_numpy().max())[1][0]\n",
" features = list(result.columns)\n",
" features.remove(features[col_max])\n",
" result = result[features]\n",
" \n",
" VIF = calculateVIF(result)\n",
" \n",
" return result"
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>new_cases_grp_&lt;250</th>\n",
" <th>new_cases_grp_&lt;50</th>\n",
" <th>new_cases_grp_&lt;500</th>\n",
" <th>new_cases_grp_&lt;5000</th>\n",
" <th>new_cases_grp_&gt;5000</th>\n",
" <th>total_cases_grp_&lt;250</th>\n",
" <th>total_cases_grp_&lt;5</th>\n",
" <th>total_cases_grp_&lt;50</th>\n",
" <th>total_cases_grp_&lt;500</th>\n",
" <th>total_cases_grp_&lt;5000</th>\n",
" <th>total_deaths_grp_&lt;250</th>\n",
" <th>total_deaths_grp_&lt;50</th>\n",
" <th>total_deaths_grp_&lt;500</th>\n",
" <th>total_deaths_grp_&lt;5000</th>\n",
" <th>total_deaths_grp_&gt;5000</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>VIF</th>\n",
" <td>1.361721</td>\n",
" <td>1.384048</td>\n",
" <td>1.339346</td>\n",
" <td>2.03413</td>\n",
" <td>1.573668</td>\n",
" <td>1.770469</td>\n",
" <td>3.06613</td>\n",
" <td>2.848104</td>\n",
" <td>1.686894</td>\n",
" <td>2.524725</td>\n",
" <td>1.40448</td>\n",
" <td>1.81879</td>\n",
" <td>1.392213</td>\n",
" <td>1.995398</td>\n",
" <td>1.978372</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" new_cases_grp_<250 new_cases_grp_<50 new_cases_grp_<500 \\\n",
"VIF 1.361721 1.384048 1.339346 \n",
"\n",
" new_cases_grp_<5000 new_cases_grp_>5000 total_cases_grp_<250 \\\n",
"VIF 2.03413 1.573668 1.770469 \n",
"\n",
" total_cases_grp_<5 total_cases_grp_<50 total_cases_grp_<500 \\\n",
"VIF 3.06613 2.848104 1.686894 \n",
"\n",
" total_cases_grp_<5000 total_deaths_grp_<250 total_deaths_grp_<50 \\\n",
"VIF 2.524725 1.40448 1.81879 \n",
"\n",
" total_deaths_grp_<500 total_deaths_grp_<5000 total_deaths_grp_>5000 \n",
"VIF 1.392213 1.995398 1.978372 "
]
},
"execution_count": 121,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# We will proceed to delete those variables where VIF is infinity through SelectDataUsingVIF, to obtain VIF for each of them.\n",
"# In the function there is a try except for divisions by zero to avoid them and assign them a 5 in case their denominator is infinite.\n",
"\n",
"model_vars = selectDataUsingVIF(data_model)\n",
"calculateVIF(model_vars)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All these variables show good VIF score, and therefore will be considered for further analysis and predictions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2. Variable selection with low variances <a id='part2_2'></a>\n",
"\n",
"This allows us to delete the variables whose variance is not big enough"
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Original variables 3\n",
"Final variables 3\n",
"Variable List ['new_cases' 'total_deaths' 'total_cases']\n"
]
}
],
"source": [
"from sklearn.feature_selection import VarianceThreshold\n",
"features.remove(\"new_deaths\")\n",
"x = df[features]\n",
"y = df[\"new_deaths\"]\n",
"var_th = VarianceThreshold(threshold = 0.2)\n",
"x_var = var_th.fit_transform(x)\n",
"\n",
"print(\"Original variables \", x.shape[1])\n",
"print(\"Final variables \", x_var.shape[1])\n",
"\n",
"print(\"Variable List \", np.asarray(list(x))[var_th.get_support()])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3. Univariant Variable Selection <a id='part2_3'></a>\n",
"\n",
"In this step, variables are analyzed if they can explain by themselfes part of the variance of the objective variable."
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(22762, 3)\n",
"Variable list ['new_cases' 'total_deaths' 'total_cases']\n"
]
}
],
"source": [
"from sklearn.feature_selection import SelectKBest\n",
"from sklearn.feature_selection import f_regression \n",
"from sklearn.feature_selection import chi2 \n",
"\n",
"# For Lineal regression classification use f_regression\n",
"\n",
"S_f3 = SelectKBest(f_regression, k = 3)\n",
"X_f3 = S_f3.fit_transform(x, y)\n",
"\n",
"print(X_f3.shape)\n",
"print(\"Variable list \", np.asarray(list(x))[S_f3.get_support()])\n",
"\n",
"# For classification models use chi2\n",
"#S_chi3 = SelectKBest(chi2, k = 3)\n",
"#X_chi3 = S_chi3.fit_transform(x, y)\n",
"\n",
"#print(X_chi3.shape)\n",
"#print(\"Variable list \", np.asarray(list(x))[S_chi3.get_support()])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.4. Variable selection depending on their percentile punctuation <a id='part2_4'></a>"
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(22762, 1)\n",
"Listado de variables ['new_cases']\n",
"(22762, 2)\n",
"Listado de variables ['new_cases' 'total_deaths']\n"
]
}
],
"source": [
"from sklearn.feature_selection import SelectPercentile\n",
"from sklearn.feature_selection import f_regression \n",
"\n",
"S_per5 = SelectPercentile(f_regression, percentile = 50)\n",
"X_per5 = S_per5.fit_transform(x, y)\n",
"\n",
"print(X_per5.shape)\n",
"print(\"Listado de variables \", np.asarray(list(x))[S_per5.get_support()])\n",
"\n",
"S_per7 = SelectPercentile(f_regression, percentile = 70)\n",
"X_per7 = S_per7.fit_transform(x, y)\n",
"\n",
"print(X_per7.shape)\n",
"print(\"Listado de variables \", np.asarray(list(x))[S_per7.get_support()])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Predictive Modelling (Scikit-Learn):<a id='part3'></a>\n",
"\n",
"Now that we have analyzed the data some more with the EDA methods, we will need to predict and see if the variables considered to predict the model for the new_deaths values make sense or not, and how big the error is for these predictions. As we want to predict a numeric value we will use some of the algorithms related to regression as:\n",
"\n",
"- Linear Regression\n",
"- Ridge\n",
"- Lasso\n",
"\n",
"\n",
"(All these algorithms are used for classication ML Models, so here will not be used\n",
"- Logistic Regression\n",
"- Support Vector Machines (Linear and radial)\n",
"- Random Forest\n",
"- K-Nearest Neighbours\n",
"- Naive Bayes\n",
"- Decision Tree\n",
"- Logistic Regression"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.1. Linear Regression<a id='part3_1'></a>"
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"R2 in training data is: 0.9194404012881845\n",
"R2 in validation is: 0.8745137899078896\n"
]
}
],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"# We will obtain the dataset for training and validation\n",
"x_train, x_test, y_train, y_test = train_test_split(x, y)\n",
"\n",
"# Creación de un modelo\n",
"model = LinearRegression()\n",
"model.fit(x_train, y_train)\n",
"\n",
"predit_train = model.predict(x_train)\n",
"predit_test = model.predict(x_test)\n",
"\n",
"# R2 evaluation\n",
"print('R2 in training data is: ', model.score(x_train, y_train))\n",
"print('R2 in validation is: ', model.score(x_test, y_test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we get strong values of correlation for R2!!! close to 0.9"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we will try to apply factors to the different coefficients of the regression lineal model obtained from the Ridge and Lasso models\n",
"\n",
"### 3.2. Ridge regression (regularization)<a id='part3_2'></a>\n",
"\n",
"#### Regularized Linear Models\n",
"\n",
"a good way to reduce overfitting is to regularize the model (i.e., to constrain it): the fewer degrees of freedom it has, the harder it will be for it to overfit the data. For example, a simple way to regularize a polynomial model is to reduce the number of polynomial degrees.\n",
"\n",
"For a linear model, regularization is typically achieved by constraining the weights of the model. We will now look at Ridge Regression, Lasso Regression, and Elastic Net, which implement three different ways to constrain the weights.\n",
"\n",
"\n",
"###### Ridge Regression\n",
"Ridge Regression (also called Tikhonov regularization) is a regularized version of Linear Regression: a regularization term equal to alpha sigma-summation Underscript i equals 1 Overscript n Endscripts theta Subscript i Baseline Superscript 2 is added to the cost function. This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible. Note that the regularization term should only be added to the cost function during training. Once the model is trained, you want to evaluate the model’s performance using the unregularized performance measure.\n",
"\n",
"NOTE\n",
"It is quite common for the cost function used during training to be different from the performance measure used for testing. Apart from regularization, another reason why they might be different is that a good training cost function should have optimization-friendly derivatives, while the performance measure used for testing should be as close as possible to the final objective. A good example of this is a classifier trained using a cost function such as the log loss but evaluated using precision/recall.\n",
"\n",
"The hyperparameter α controls how much you want to regularize the model. If α = 0 then Ridge Regression is just Linear Regression. If α is very large, then all weights end up very close to zero and the result is a flat line going through the data’s mean. Equation 4-8 presents the Ridge Regression cost function.11\n",
"\n",
"Equation 4-8. Ridge Regression cost function\n",
"<img src=\"./ridge_eq.png\"></img>\n",
"\n",
"Note that the bias term θ0 is not regularized (the sum starts at i = 1, not 0). If we define w as the vector of feature weights (θ1 to θn), then the regularization term is simply equal to ½(∥ w ∥2)2, where ∥ w ∥2 represents the ℓ2 norm of the weight vector.12 For Gradient Descent, just add αw to the MSE gradient vector (Equation 4-6).\n",
"\n",
"WARNING\n",
"\n",
"It is important to scale the data (e.g., using a StandardScaler) before performing Ridge Regression, as it is sensitive to the scale of the input features. This is true of most regularized models.\n",
"\n",
"Figure 4-17 shows several Ridge models trained on some linear data using different α value. On the left, plain Ridge models are used, leading to linear predictions. On the right, the data is first expanded using PolynomialFeatures(degree=10), then it is scaled using a StandardScaler, and finally the Ridge models are applied to the resulting features: this is Polynomial Regression with Ridge regularization. Note how increasing α leads to flatter (i.e., less extreme, more reasonable) predictions; this reduces the model’s variance but increases its bias.\n",
"\n",
"\n",
"<img src=\"./ridge_pic.png\"></img>\n",
"\n",
"As with Linear Regression, we can perform Ridge Regression either by computing a closed-form equation or by performing Gradient Descent. The pros and cons are the same. Equation 4-9 shows the closed-form solution (where A is the (n + 1) × (n + 1) identity matrix13 except with a 0 in the top-left cell, corresponding to the bias term).\n",
"\n",
"\n",
"<img src=\"./ridge_closed_eq.png\"></img>"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"R2 in training is: 0.9194404012881845\n",
"R2 in validation is: 0.8745137899078871\n",
"b_0 is: -0.6070945465987521 and b_1 is: 0.08142462773339545\n",
"[ 0.08142463 0.0234741 -0.00208438]\n",
"R2 scores mean is: 0.106540017150107\n",
"R2 scores are: [ 0.59743214 0.27258325 0.76480953 -1.96319703 0.86107221]\n"
]
}
],
"source": [
"from sklearn.linear_model import Ridge\n",
"from sklearn.model_selection import cross_val_score\n",
"model_ridge = Ridge(alpha = 0.01)\n",
"model_ridge.fit(x_train, y_train)\n",
"\n",
"predit_train = model_ridge.predict(x_train)\n",
"predit_test = model_ridge.predict(x_test)\n",
"\n",
"# Evaluación de R2\n",
"print('R2 in training is: ', model_ridge.score(x_train, y_train))\n",
"print('R2 in validation is: ', model_ridge.score(x_test, y_test))\n",
"print(\"b_0 is:\", model_ridge.intercept_, \"and b_1 is:\", model_ridge.coef_[0])\n",
"print(model_ridge.coef_)\n",
"\n",
"scores = cross_val_score(model_ridge, x, y, cv = 5)\n",
"\n",
"print('R2 scores mean is: ', scores.mean())\n",
"print('R2 scores are: ', scores)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.3. Lasso regression (regularization)<a id='part3_3'></a>\n",
"\n",
"Least Absolute Shrinkage and Selection Operator Regression (simply called Lasso Regression) is another regularized version of Linear Regression: just like Ridge Regression, it adds a regularization term to the cost function, but it uses the ℓ1 norm of the weight vector instead of half the square of the ℓ2 norm (see Equation 4-10).\n",
"\n",
"\n",
"<img src=\"./lasso_eq.png\"></img>\n",
"\n",
"<img src=\"./lasso_pics.png\"></img>\n",
"\n",
"An important characteristic of Lasso Regression is that it tends to completely eliminate the weights of the least important features (i.e., set them to zero). For example, the dashed line in the right plot on Figure 4-18 (with α = 10-7) looks quadratic, almost linear: all the weights for the high-degree polynomial features are equal to zero. In other words, Lasso Regression automatically performs feature selection and outputs a sparse model (i.e., with few nonzero feature weights)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.4. Elastic Net (regularization)<a id='part3_4'></a>\n",
"\n",
"Elastic Net is a middle ground between Ridge Regression and Lasso Regression. The regularization term is a simple mix of both Ridge and Lasso’s regularization terms, and you can control the mix ratio r. When r = 0, Elastic Net is equivalent to Ridge Regression, and when r = 1, it is equivalent to Lasso Regression (see Equation 4-12).\n",
"\n",
"\n",
"<img src=\"./elastic_net_eq.png\"></img>\n",
"\n",
"So when should you use plain Linear Regression (i.e., without any regularization), Ridge, Lasso, or Elastic Net? It is almost always preferable to have at least a little bit of regularization, so generally you should avoid plain Linear Regression. Ridge is a good default, but if you suspect that only a few features are actually useful, you should prefer Lasso or Elastic Net since they tend to reduce the useless features’ weights down to zero as we have discussed. In general, Elastic Net is preferred over Lasso since Lasso may behave erratically when the number of features is greater than the number of training instances or when several features are strongly correlated.\n"
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"R2 in training is: 0.9194404012881412\n",
"R2 in validation is: 0.8745137784758792\n",
"b_0 is: -0.6070917910116265 and b_1 is: 0.08142460579829844\n",
"[ 0.08142461 0.02347407 -0.00208437]\n",
"R2 scores mean is: 0.10654021668708305\n",
"R2 scores are: [ 0.59743234 0.27258365 0.76480959 -1.96319595 0.86107145]\n"
]
}
],
"source": [
"from sklearn.linear_model import Lasso\n",
"\n",
"model_lasso = Lasso(alpha = 0.1)\n",
"model_lasso.fit(x_train, y_train)\n",
"\n",
"predit_train = model_lasso.predict(x_train)\n",
"predit_test = model_lasso.predict(x_test)\n",
"\n",
"# Evaluación de R2\n",
"print('R2 in training is: ', model_lasso.score(x_train, y_train))\n",
"print('R2 in validation is: ', model_lasso.score(x_test, y_test))\n",
"print(\"b_0 is:\", model_lasso.intercept_, \"and b_1 is:\", model_lasso.coef_[0])\n",
"print(model_lasso.coef_)\n",
"\n",
"scores = cross_val_score(model_lasso, x, y, cv = 5)\n",
"print('R2 scores mean is: ', scores.mean())\n",
"print('R2 scores are: ', scores)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For these values of alpha we get very good values of R2 close to 0.9"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.5. Model evaluation through a polynomial grade function <a id='part3_5'></a>\n",
"\n",
"We will create a function that can make the fitting of a polynomial function and through cross validation allows to obtain the best settings and variations"
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {},
"outputs": [],
"source": [
"\n",
"#X = np.arange(1,n_samples, 1)\n",
"#y = list(y)\n",
"y_result = df.groupby([\"date\"])[\"new_deaths\"].sum().values\n",
"n_samples = len(y_result)\n",
"X = np.arange(1,n_samples+1, 1)\n",
"#y = df[\"new_deaths\"].values"
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.pipeline import Pipeline\n",
"from sklearn.preprocessing import PolynomialFeatures\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.model_selection import cross_val_score\n",
"\n",
"\n",
"def evaluateFit(degrees, X, y):\n",
" polynomial_features = PolynomialFeatures(degree = degrees, include_bias = False)\n",
" \n",
" linear_regression = LinearRegression()\n",
" pipeline = Pipeline([(\"polynomial_features\", polynomial_features), (\"linear_regression\", linear_regression)])\n",
" pipeline.fit(X[:, np.newaxis], y)\n",
" \n",
" scores = cross_val_score(pipeline, X[:, np.newaxis], y, scoring = \"neg_mean_squared_error\", cv = 10)\n",
" \n",
" X_test = np.arange(1,n_samples +1, 1)\n",
" plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label=\"Model\")\n",
" plt.plot(X_test, y, label = \"True function\")\n",
" plt.scatter(X, y, label = \"Samples\")\n",
" plt.xlabel(\"x\")\n",
" plt.ylabel(\"y\")\n",
" plt.legend(loc=\"best\")\n",
" plt.title(\"Degree {}\\nMSE = {:.2e}(+/- {:.2e})\".format(degrees, -scores.mean(), scores.std()))"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {},
"outputs": [],
"source": [
"df.reset_index(inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.close()\n",
"evaluateFit(1, X, y_result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using a simple Linear Regression shows how the fitted line adapts to the trend of the data as previous pic shows."
]
},
{
"cell_type": "code",
"execution_count": 132,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.close()\n",
"evaluateFit(5, X, y_result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using a 5-degree Linear Regression shows how the fitted line adapts much better to the trend of the data than the simple Linear Regression as previous pic shows."
]
},
{
"cell_type": "code",
"execution_count": 133,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.close()\n",
"evaluateFit(15, X, y_result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using a 15-degree Linear Regression shows how the fitted line adapts even much better to the trend of the data than the 5-degree Linear Regression as previous pic shows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.6. Validation curves for new_deaths <a id='part3_6'></a>\n",
"\n",
"As data may show imperfections and unbalances we should proceed with computing the validation curve for a class of models. Here we will use again the polynomial regression model (by changing the hyperparameter of degree).\n",
"\n",
"The question here is to answer which degree gives a suitable relation between bias (under-fitting) and variance (over-fitting).\n",
"Let´s use validation_curve from sklearn. \n",
"\n",
"The more erroneous the assumptions with respect to the true relationship, the higher the bias, and vice-versa.A low-biased method fits training data very well.\n",
"\n",
"<img src=\"./low_bias_high_variance.png\">\n",
"\n",
"You can see that a low-biased method captures most of the differences (even the minor ones) between the different training sets. varies a lot as we change training sets, and this indicates high variance.\n",
"\n",
"\n",
"The reverse also holds: the greater the bias, the lower the variance. A high-bias method builds simplistic models that generally don’t fit well training data. As we change training sets, the models we get from a high-bias algorithm are, generally, not very different from one another.\n",
"\n",
"\n",
"<img src=\"./high_bias_low_variance.png\">\n",
"\n",
"In practice, however, we need to accept a trade-off. We can’t have both low bias and low variance, so we want to aim for something in the middle.\n",
"\n",
"\n",
"<img src=\"./acceptable_bias_variance.png\">\n",
"\n",
"Having a model, data, parameter name and a range to look at to, this function will automatically compute both training score and validation scores across the range of hyperparameter values:"
]
},
{
"cell_type": "code",
"execution_count": 134,
"metadata": {},
"outputs": [],
"source": [
"x_features = df[['new_cases', 'total_deaths', 'total_cases']]\n",
"y_feature = df[\"new_deaths\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sciki-Learn has a polynomial preprocessor to be used along with the LinearRegression method."
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.pipeline import make_pipeline\n",
"def PolynomialRegression(degree=2, **kwargs):\n",
" return make_pipeline(PolynomialFeatures(degree),\n",
" LinearRegression(**kwargs))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By using learning_curve(), it will give us the correspondent MSE for a regression model. As we specified 7 training set sizes, seven rows with six error scores. This is due because learning_curve() runs a k-fold cross-validation under the hood, where the value of k is given by what we specify for the cv parameter. In our case, cv= 5, so there will be five splits. For each split, an estimator is trained for every training sset size specified. Each column in the two arrays above designates a split, and each row corresponds to a test size."
]
},
{
"cell_type": "code",
"execution_count": 136,
"metadata": {},
"outputs": [],
"source": [
"train_sizes = [1, 10, 40, 60, 80, 100, 120, 140]\n",
"from sklearn.model_selection import learning_curve\n",
"\n",
"train_sizes, train_scores, val_scores = learning_curve(estimator=PolynomialRegression(2), X=x_features, y=y_feature, train_sizes=train_sizes, cv=5, scoring=\"neg_mean_squared_error\")\n",
"#train_sizes, train_scores, val_scores = learning_curve(estimator=LinearRegression(), X=x_features, y=y_feature, train_sizes=train_sizes, cv=5, scoring=\"neg_mean_squared_error\")\n",
"\n",
"train_scores_mean = -train_scores.mean(axis = 1)\n",
"validation_scores_mean = -val_scores.mean(axis = 1)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.legend.Legend at 0x34b8eee288>"
]
},
"execution_count": 137,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 576x396 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.style.use('seaborn')\n",
"plt.plot(train_sizes, train_scores_mean, label = 'Training error')\n",
"plt.plot(train_sizes, validation_scores_mean, label = 'Validation error')\n",
"plt.ylabel('MSE', fontsize = 14)\n",
"plt.xlabel('Training set size', fontsize = 14)\n",
"plt.title('Learning curves for a linear regression model', fontsize = 18, y = 1.03)\n",
"plt.legend()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As of now, the MSE can be seen not very high so we can tune the hyperparameter degree inside PolynomialFeatures() to obtain the lower error in this graph. As the training set size increases, we can see that there is a peak around 100 features size but then the error lowers to ideal values. To be able to see more trends, we would need to have a higher number of data points, because now we have less than 200"
]
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import mean_squared_error\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"def plot_learning_curves(model, X, y):\n",
" X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)\n",
" train_errors, val_errors = [], []\n",
" for m in range(1, len(X_train)):\n",
" model.fit(X_train[:m], y_train[:m])\n",
" y_train_predict = model.predict(X_train[:m])\n",
" y_val_predict = model.predict(X_val)\n",
" train_errors.append(mean_squared_error(y_train[:m], y_train_predict))\n",
" val_errors.append(mean_squared_error(y_val, y_val_predict))\n",
" plt.plot(np.sqrt(train_errors), \"r-+\", linewidth=2, label=\"train\")\n",
" plt.plot(np.sqrt(val_errors), \"b-\", linewidth=3, label=\"val\")"
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 576x396 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"y_result = df.groupby([\"date\"])[\"new_deaths\"].sum().values\n",
"n_samples = len(y_result)\n",
"X = np.arange(1,n_samples+1, 1).reshape(-1, 1)\n",
"lin_reg = LinearRegression()\n",
"plot_learning_curves(lin_reg, X, y_result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"when there are just one or two instances in the training set, the model can fit them perfectly, which is why the curve starts at zero. But as new instances are added to the training set, it becomes impossible for the model to fit the training data perfectly, both because the data is noisy and because it is not linear at all. So the error on the training data goes up until it reaches a plateau, at which point adding new instances to the training set doesn’t make the average error much better or worse. Now let’s look at the performance of the model on the validation data. When the model is trained on very few training instances, it is incapable of generalizing properly, which is why the validation error is initially quite big. Then as the model is shown more training examples, it learns and thus the validation error slowly goes down. However, once again a straight line cannot do a good job modeling the data, so the error ends up at a plateau, crossing by some values the other curve.\n",
"\n",
"These learning curves are typical of an underfitting model. Both curves have reached a plateau; they are close and fairly high."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let’s look at the learning curves of a 10th-degree polynomial model on the same data:"
]
},
{
"cell_type": "code",
"execution_count": 140,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 576x396 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.pipeline import Pipeline\n",
"\n",
"polynomial_regression = Pipeline([\n",
" (\"poly_features\", PolynomialFeatures(degree=10, include_bias=False)),\n",
" (\"lin_reg\", LinearRegression()),\n",
" ])\n",
"\n",
"plot_learning_curves(polynomial_regression, X, y_result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These learning curves look a bit like the previous ones, but there are two very important differences:\n",
"\n",
"The error on the training data is much lower than with the Linear Regression model.\n",
"\n",
"There is a gap between the curves. This means that the model performs significantly better on the training data than on the validation data, which is the hallmark of an overfitting model. However, if you used a much larger training set, the two curves would continue to get closer.\n",
"\n",
"One way to improve an overfitting model is to feed it more training data until the validation error reaches the training error."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3.6.1. The Bias/Variance Tradeoff\n",
"\n",
"An important theoretical result of statistics and Machine Learning is the fact that a model’s generalization error can be expressed as the sum of three very different errors:\n",
"\n",
"**Bias**\n",
"This part of the generalization error is due to wrong assumptions, such as assuming that the data is linear when it is actually quadratic. A high-bias model is most likely to underfit the training data.10\n",
"\n",
"**Variance**\n",
"This part is due to the model’s excessive sensitivity to small variations in the training data. A model with many degrees of freedom (such as a high-degree polynomial model) is likely to have high variance, and thus to overfit the training data.\n",
"\n",
"**Irreducible error**\n",
"This part is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up the data (e.g., fix the data sources, such as broken sensors, or detect and remove outliers).\n",
"\n",
"Increasing a model’s complexity will typically increase its variance and reduce its bias. Conversely, reducing a model’s complexity increases its bias and reduces its variance. This is why it is called a tradeoff."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is how to perform Ridge Regression with Scikit-Learn using a closed-form solution (a variant of Equation 4-9 using a matrix factorization technique by André-Louis Cholesky):\n"
]
},
{
"cell_type": "code",
"execution_count": 141,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 576x396 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.linear_model import Ridge, SGDRegressor\n",
"ridge_reg = Ridge(alpha=1, solver=\"cholesky\")\n",
"ridge_reg.fit(X, y_result)\n",
"y_ridge_pred = ridge_reg.predict(X)\n",
"plot_learning_curves(polynomial_regression, X, y_ridge_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And using Stochastic Gradient Descent:"
]
},
{
"cell_type": "code",
"execution_count": 142,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAhgAAAFTCAYAAAB783UiAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAgAElEQVR4nO3dbZBU1b3v8V9PNz3PDKAjJ4aMASJHEy4lYLzx+EDAazARKkZUHpLBRMryMShJLAkRRZ2gxFSeiBrR0lQwZaQwlWCdeDUW5BKNUgKCBQreYByvSHTUQaZ7mJnu3uu+AHp29/Tunhm6955Z8/28oXfvnunFOjn687/+a+2QMcYIAACgiMqCHgAAALAPAQMAABQdAQMAABQdAQMAABQdAQMAABQdAQMAABRdYAFj586damxsLPi55uZmzZo1q8f7r7zyiqZNm1aKoQEAgOMUCeJLH374YW3YsEGVlZV5P/enP/1Jv/vd79Ta2prx/oEDB/Too48qmUyWcpgAAKCfAqlgNDQ0aPXq1enrvXv3qrGxUY2Njfrud7+rtrY2SVJdXZ0ef/zxjJ/t7OzUHXfcoRUrVvg5ZAAA0AeBBIyZM2cqEukunixfvlx33HGH1q5dq/PPP1+PPPKIJGn69OmqqqrK+Nm77rpLV111lUaPHu3rmAEAQO8FskSSbd++fbrzzjslSYlEQmPHjs35uffff19bt27VO++8o/vvv1+ffPKJlixZop///Od+DhcAABQwIALG2LFjtWrVKp188snatm2bWlpacn5u9OjRevbZZ9PX55xzDuECAIABaEAEjBUrVujWW29VKpWSJP34xz8OeEQAAOB4hHiaKgAAKDYO2gIAAEVHwAAAAEXnew9GS0tbyX73yJFVam1tL9nvH+yYH2/MTX7MT37MT37Mj7fBPjf19bWe96yqYEQi4aCHMKAxP96Ym/yYn/yYn/yYH282z41VAQMAAAwMBAwAAFB0BAwAAFB0BAwAAFB0BAwAAFB0BAwAAFB0BAwAAFB0BAwAAFB0BAwAAFB01gaMZMoJeggAAAxZVgaM3z/3pq7/2Wb990tvBz0UAACGJOsCRkdXUhtffVfJlKPnXvl/QQ8HAIAhybqAkUg6MubI685EKtjBAAAwRFkXMBzjeu2+AAAAvrEvYLhChUOfJwAAgbA7YBgjY6hiAADgN/sCRlagyL4GAAClZ1/AyOq7YJkEAAD/2RcwsisYNHoCAOA76wJGKitQZF8DAIDSsy5g9FgioQcDAADfWRcwsvMESyQAAPjPuoDBEgkAAMGzLmDQ5AkAQPDsCxjZFQx6MAAA8J31AcNQwQAAwHf2BQxDDwYAAEGzL2D0OMmTgAEAgN/sCxhUMAAACJx1ASM7UHDQFgAA/rMuYGQ/3IwlEgAA/GdfwGCJBACAwNkXMGjyBAAgcPYFjOyTPOnBAADAd/YFDCoYAAAEzvqAQQ8GAAD+sy5gZD97hCUSAAD8Z13AyH72CEskAAD4z7qAkZ0nWCIBAMB/1gWMHid5EjAAAPCddQGjxy4SejAAAPCdfQGDkzwBAAicfQGDJRIAAAJHwAAAAEVnX8DocQ5GQAMBAGAIi+S7mUgktGzZMu3fv19dXV267rrrdMEFF6Tvb9y4Uffff78ikYjmzJmjK664ouQDLoQeDAAAgpc3YGzYsEEjRozQfffdp9bWVn3jG99IB4xEIqF77rlH69evV2VlpebPn6/p06ervr7el4F7YZsqAADBy7tEctFFF+mmm25KX4fD4fTrffv2qaGhQXV1dYpGo5o6daq2bt1aupH2knEyr1OOk/uDAACgZPJWMKqrqyVJsVhMixcv1s0335y+F4vFVFtbm/HZWCxW8AtHjqxSJBIu+Ln+ilZk/pUqq6Kqr6/1+PTQw1x4Y27yY37yY37yY3682To3eQOGJB04cEA33HCDFixYoNmzZ6ffr6mpUTweT1/H4/GMwOGltbW9n0MtrL6+VvF4V8Z7bW2damlpK9l3Dib19bXMhQfmJj/mJz/mJz/mx9tgn5t84SjvEsmHH36oq666Srfccosuu+yyjHvjx49Xc3OzDh48qK6uLm3dulWTJ08uzoiPA49rBwAgeHkrGL/5zW906NAhPfDAA3rggQckSZdffrkOHz6suXPnaunSpVq0aJGMMZozZ45Gjx7ty6Dzyd5FYjgqHAAA3+UNGLfddptuu+02z/szZszQjBkzij6o40EFAwCA4Nl30BbbVAEACJx9AYODtgAACJx1AaPHQVv0YAAA4DvrAkZ2wYIlEgAA/GddwDA0eQIAEDjrAgbPIgEAIHjWBYyej2snYAAA4Df7AgYVDAAAAmdfwGCbKgAAgbMvYFDBAAAgcAQMAABQdPYFjKw8kaLJEwAA31kXMNimCgBA8KwLGD22qRIwAADwnX0BgwoGAACBsy9gZG9TpQcDAADf2RcwqGAAABC4IRAwAhoIAABDmH0Bg5M8AQAInH0BI7uCQQ8GAAC+sy9gZB+0RQUDAADfWRcwsgOFIWAAAOA76wJG9hIJFQwAAPxnXcAw2Sd50oMBAIDvrAsY2RULKhgAAPjPuoDBs0gAAAiefQGDkzwBAAichQEj85pnkQAA4D+rAoYxhiUSAAAGAKsCRq4sQcAAAMB/dgWMHGGCgAEAgP+sChipHI9ONeIsDAAA/GZVwPCqVlDFAADAXwQMAABQdFYFDK9TOznNEwAAf1kVMLx6LejBAADAX3YFDJZIAAAYEKwKGF5LIQQMAAD8ZVXA8AoS9GAAAOCvIREwqGAAAOAvqwKG5xIJTZ4AAPjKqoDBEgkAAAODXQHDa5sqAQMAAF9ZFTBSKSoYAAAMBFYFDK8KBi0YAAD4y6qAQQUDAICBwaqAQQ8GAAADg10Bw3MXiePzSAAAGNqsChheQYICBgAA/rIqYHgVKlgiAQDAX1YFDK8KBk2eAAD4q1cBY+fOnWpsbOzx/mOPPaaLL75YjY2Namxs1FtvvVX0AfaF57NI2KcKAICvIoU+8PDDD2vDhg2qrKzscW/37t1atWqVJk6cWJLB9RWPawcAYGAoWMFoaGjQ6tWrc97bvXu31qxZo/nz5+uhhx4q+uD6im2qAAAMDAUrGDNnztS7776b897FF1+sBQsWqKamRjfeeKM2bdqk6dOn5/19I0dWKRIJ92+0Bbzx7ic536+uKVd9fW1JvnOwYR68MTf5MT/5MT/5MT/ebJ2bggHDizFGV155pWprj0zMtGnT9PrrrxcMGK2t7f39yoK8lkgOfnJYLS1tJfvewaK+vpZ58MDc5Mf85Mf85Mf8eBvsc5MvHPV7F0ksFtOsWbMUj8dljNGWLVsC78XwbPJkiQQAAF/1uYLx9NNPq729XXPnztWSJUu0cOFCRaNRnX322Zo2bVopxthrXhUMtqkCAOCvXgWMMWPGaN26dZKk2bNnp9+/5JJLdMkll5RmZP1ABQMAgIHBsoO2OAcDAICBwKqA4f2wMwIGAAB+sitgcA4GAAADglUBI5UiYAAAMBBYFTA8Kxj0YAAA4CurAgZPUwUAYGCwKmB45AuWSAAA8JllAaM7SITLQunXVDAAAPCXVQHDvUQSCXf/1ejBAADAX1YFDHcFY1ikLOf7AACg9KwKGO6lkEiYJRIAAIJiVcBwL4W4l0iMR/MnAAAoDbsCRir3EkmKHgwAAHxlVcBIeVQwHK/9qwAAoCSsChiOkztg0IMBAIC/rA0Yw1xNnhQwAADwl1UBI+W1TZUeDAAAfGVVwGCJBACAgcHegMFBWwAABMaqgOHeRTIsTMAAACAoVgUMryUSejAAAPCXtQEj46AtKhgAAPjKqoCR8qpgEDAAAPCVVQHD8XjYGQEDAAB/WRUwUq4TtXgWCQAAwbEqYLhP7GSJBACA4NgVMDwedkaTJwAA/rIqYKRSuZdIDAEDAABfWRUw3BUMejAAAAiOVQEjxS4SAAAGBKsChudJngQMAAB8ZVXAyHhcO02eAAAExqqA4fk0VXowAADwlb0BgwoGAACBsSpgeC2RsE0VAAB/WRUwPLepEjAAAPCVVQEjlfLYpkoPBgAAvrIqYGQcFU4FAwCAwNgVMDx6MNwPQQMAAKVnVcBIcdAWAAADglUBw/MkT2Nk6MMAAMA3VgUMdwUjXBZSqLvPk0ZPAAB8ZFXAcFcwysqOhIzue0GMCACAocmygNGdIsrKQior44mqAAAEwZqAYYyRO0OUhUIqc62RsFUVAAD/WBQwul+HQlIoFMpcIqEHAwAA31gTMNwVimOVC5ZIAAAIhjUBw12hOFa5cAcMlkgAAPCPPQHDFSBCxwJGiAoGAABBsCdgmJ5LJPRgAAAQjF4FjJ07d6qxsbHH+xs3btScOXM0d+5crVu3ruiD6wvHyb9EQgUDAAD/RAp94OGHH9aGDRtUWVmZ8X4ikdA999yj9evXq7KyUvPnz9f06dNVX19fssHmk3HI1tFcEaYHAwCAQBSsYDQ0NGj16tU93t+3b58aGhpUV1enaDSqqVOnauvWrSUZZG9knIFBDwYAAIEqGDBmzpypSKRnoSMWi6m2tjZ9XV1drVgsVtzR9UEq6xRP958SPRgAAPip4BKJl5qaGsXj8fR1PB7PCBxeRo6sUiQS7u/XekqVdWelSCSs+vpalUe7v2d4XaXq6wuPz3bMgTfmJj/mJz/mJz/mx5utc9PvgDF+/Hg1Nzfr4MGDqqqq0tatW7Vo0aKCP9fa2t7fr8yr5WPX7zVGLS1tGcsiH30U14iKfv91rVBfX6uWlraghzEgMTf5MT/5MT/5MT/eBvvc5AtHff437tNPP6329nbNnTtXS5cu1aJFi2SM0Zw5czR69OjjGujxcHKd5MmzSAAACESvAsaYMWPS21Bnz56dfn/GjBmaMWNGaUbWR4VO8jT0YAAA4Bt7Dtpyn+SZ46AtKhgAAPjHnoBRoILBNlUAAPxjTcDIeJrq0b8VPRgAAATDmoBhuo/B4FkkAAAEzJqAUfCgLSoYAAD4xpqAkXFUeKhnwGCJBAAA/1gUMNw9GMfOwch9HwAAlJY9AcPpGTDCLJEAABAIKwNGrm2qLJEAAOAfKwNGrqPCqWAAAOAfewKGcZ/keeTPzG2qfo8IAIChy6KA0f2akzwBAAiWNQGj0DkY9GAAAOAfawJGxkmeZfRgAAAQJGsCRipHkydHhQMAEAxrAkbGQVuc5AkAQKDsCRgctAUAwIBhT8DIeVQ4FQwAAIJgT8Bwn+SZY4nE0IMBAIBvrAwYoaN/qzA9GAAABMKegMFBWwAADBjWBIyMg7ZyPIuECgYAAP6xJmC480Oukzw5BwMAAP/YEzAKHbRFBQMAAN/YGTB4FgkAAIGyJ2AUOAfDEDAAAPCNPQEjY4nkyJ8Z21TpwQAAwDf2BIxcFQx6MAAACIQ1ASNV4CRPejAAAPCPNQHDdB+DoVCOHgwqGAAA+MeagOFeIgnzNFUAAAJlTcBI5TgHo4wmTwAAAmFNwMjd5Om6TwUDAADf2BMwcp3kSQ8GAACBsCdgZFQwjv1JwAAAIAj2BIwcR4WHXWsk9GAAAOAfOwPG0aWRED0YAAAEwp6A4coPubep+j0iAACGLnsCRq5tqiFO8gQAIAj2BIwc21QzKhj0YAAA4BtrAkYqR5Mnu0gAAAiGNQEj5xIJAQMAgEBYEzBMriUSejAAAAiENQEj81kkR/+kBwMAgEBYEzAyn6Z65K/FEgkAAMGwJ2C4AkQox1HhLJEAAOAfiwJG9+v0w86oYAAAEAhrAoa7QnEsWGQctEUPBgAAvrEmYGTsIsmxTdVQwQAAwDfWBIxCJ3nSgwEAgH+sCRic5AkAwMARKfQBx3G0YsUK7d27V9FoVE1NTTrllFPS95uamrR9+3ZVV1dLkh544AHV1taWbsSe48wRMFw9GEZHqhzu9wAAQGkUDBjPP/+8urq69OSTT2rHjh2699579eCDD6bv7969W4888ohGjRpV0oEWkrFE4soQ4bJQurrhOEZlYQIGAAClVnCJZNu2bTrvvPMkSWeccYZ27dqVvuc4jpqbm3X77bdr3rx5Wr9+felGWkCuZ5FILJMAABCEghWMWCymmpqa9HU4HFYymVQkElF7e7u+9a1v6Tvf+Y5SqZQWLlyoiRMn6rTTTvP8fSNHVikSCRdn9Bm6g8RJ9bUaObxCkhQJh5RIHv3uUdWqqhhWgu8ePOrr/V++GiyYm/yYn/yYn/yYH2+2zk3BgFFTU6N4PJ6+dhxHkciRH6usrNTChQtVWVkpSfrSl76kPXv25A0Yra3txzvmnJKp7pO2Pm6NK9mZOHrVHTw+aGlT9RAOGPX1tWppaQt6GAMSc5Mf85Mf85Mf8+NtsM9NvnBUcIlkypQp2rx5syRpx44dmjBhQvre22+/rQULFiiVSimRSGj79u36whe+UIQh952T46Ct7NfHPvPxoQ4efgYAQAkVrGBceOGFevHFFzVv3jwZY7Ry5Uo99thjamho0AUXXKDZs2friiuu0LBhw/T1r39dp556qh/j7iGV46AtqWcPxlP/Z5/++6VmndYwQrfMn6wQu0oAACi6ggGjrKxMd911V8Z748ePT7+++uqrdfXVVxd/ZH1kPJo8sw/b+tur+yVJe945qI8+6dCJIyr9GyQAAEOE1QdtSZlbVts7kop3JNPXsY6EAABA8VkTMDKPCpfrtavJ8+DhjJ9xhw0AAFA8VgQMY4zcPZuZPRjdf8UPWrMCxmEqGAAAlIIVAcNdvQiFlNG46e7B+CBriywVDAAASsOOgNF9BEZGoJAyezCyl0ja6cEAAKAkLAkYuXeQSFk9GD2WSKhgAABQCnYEDPcSSVYFw13R+OhQR8a9OBUMAABKwoqA4d6iGs5Twcg+vJMeDAAASsOKgJG5RTW7B8P7pE52kQAAUBpWBAzjcciW1LPp040KBgAApWFFwMg4xTMrT2QHDjd6MAAAKA0rAoZ7iaTnNlXvgNFOBQMAgJKwI2A47oO2vJs8s3UmUkokHc/7AACgf+wIGO5jwvvQgyFx2BYAAKVgR8Bw8iyRFAgYNHoCAFB81gWMHid55ujBGDW8PP2aRk8AAIrPjoCR5xyM7IpGJBzSySdUp6+pYAAAUHxWBIxUL59FIkknDK9QTdWw9DWHbQEAUHxWBIzMCkbmveyAcWJdharLXQGDCgYAAEVnR8DIc5JndkXjhLpKVVdG0tfsIgEAoPgihT8y8OVr8szuwTixrkLRYeH0NY9sBwCg+OwIGK5zMAptUz2xriKjZyPeSQUDAIBis26JJPskz54VjEpVV7qbPKlgAABQbJZUMHr/LJIT6ipk1P15ejAAACg+KwJGKl+Tp+s6Eg6priaqw53dVYsYu0gAACg6K5ZITC/PwThheIXKQiFVV3TnKs7BAACg+KwIGPkqGO4lkxPrKiRJVRXdPRjtHUkZ1xILAAA4flYEjIyDtrIePeK+PqGuUpI0LFKm8qNbVR1j1NGVKvkYAQAYSuwLGFkJo2F0bfr1aQ0j0q+r3MskNHoCAFBUVjR55jvJ83+MP0HXXzJRScfRWaePTr9fXTFMrW2dko5sVT2xzp+xAgAwFFgSMLpf53pc+5mnndTjZ6qpYAAAUDLWL5F4cR+21c5WVQAAisqOgJFnm6oXdwUjRgUDAICisiNg5DnJ00t1Re4KRlcixbZVAACOkxU9GKn+VDAqex62tW1vi9Y8vVujR1bptoVTM566CgAAes+KCobJs4vEi/uwrWNNnv97S7MSSUfvtsT06v/9sLiDBABgCLEiYKQymjx79zOZu0iSSiQdNb/fln7P/RoAAPSNFQGjX02eGY9sT6j5/TYlU92/p/nfBAwAAPrLvoDR6ybP7gpGe0dS+/Z/knG/+d9tNHsCANBPdgQMVw7o/TbVzB6M7IDR3plUy8HDRRkfAABDjR0Bwzm+baqxjqT2vXeox2feZpkEAIB+sSNguJYyQr0MGBXlYR0rdnR2pdLPJXGjDwMAgP6xI2D0o4JRFgplVDFy/TwVDAAA+seKgNGfg7akzEe2HzP1P+vTr995n0ZPAAD6w4qAkfGws97ni5wVjP/5+dHpHSbxjqQ+/KTjuMcHAMBQY0XAMO7HtfchYbiPCz9m/Kfr1DC6Nn1NHwYAAH1nRcBI9eNx7VLPCsbokZUaXhXVZ/+jO2DQhwEAQN9ZETD6c9CWlHnYlnSkeiFJp/yHu4LRc/sqAADIz76A0acmz8wKxrGAkV3BoNETAIC+sSNgmP4FjJrsCsbJwyVJ9SMqVVXe3ej5EY2eAAD0iR0Box/nYEiZDzwrj4Y1pr5GkhQKhTKXSXiyKgAAfVIwYDiOo9tvv11z585VY2OjmpubM+6vW7dOl156qa644gpt2rSpZAPN56SRlenXo0dV9frn6mqi6dfjPjU8o3/jFNdOkrfeO6RkymGpBACAXuq5TzPL888/r66uLj355JPasWOH7r33Xj344IOSpJaWFq1du1ZPPfWUOjs7tWDBAp1zzjmKRqMFfmtx/a8zPyPHGH320yM17ugyR2+c1jBSXxg7Su99GNfXzx2bcc9dwXhmyzt6Zss7kqRQSAqXlSlcFlK4LKSyspDC4SOvh4XLVFEeUWU0rIpoROFw7mqKZ43FY3nH6/N5V4McR3LnoZBUXh5RZ2eyl2PxettjjH04fyTUlw/n+d053w5lX4a8b7suKiqGqbMj4fml2d/Vx7+CtwGZWXsOqrwiqs6OrgDGcoRv09TPL6qoHKaOw4nCHzzeLxqkKiqGqaOjL/MTMB//B+fH3NRUl2valAadUFdR0u/JVjBgbNu2Teedd54k6YwzztCuXbvS91577TVNnjxZ0WhU0WhUDQ0N2rNnjyZNmlS6EedQXR7Wf/1zk5x/HFBzInXkzexqQ/raZFzOMUZGUuiRZ/S260Y0VCFVntPju4yRkilHyVSx/xYAAJTGm2+1aOminv9OK6WCASMWi6mmpiZ9HQ6HlUwmFYlEFIvFVFvb/V/61dXVisVieX/fyJFVikTCxzHknpLth7Vv2ytyOns+sKy/qiX916haba/7TyVDYTmhMjkhK1pWAABDTN3wCtXX1xb+YBEVDBg1NTWKx+Ppa8dxFIlEct6Lx+MZgSOX1tb2/o41r8/ec59qnMM62Bp31a9DGX/0KHsX+FyjQmp0vW+M5BjJyCjlGKVM6MifzpGdLImUUUfCUUfC0eGEo3TvafrX96jfp18Yo5x1eGNyl+NzV/BCCkUiCkUimWemH/1w7fBKtR06XOB3eN8wOW70tS3F6/O5fnf+sfTmuzI/ZTwvpJqacrXFOo/9YN7vGoqtOLXu+UEPzE9+zI83P+amuiKiyafWq6Wl+BsW8oWWggFjypQp2rRpk772ta9px44dmjBhQvrepEmT9Itf/EKdnZ3q6urSvn37Mu77KTJ8uGrrP62OEkygLerra0vyPzAbMDf5MT/5MT/5MT/ebJ6bggHjwgsv1Isvvqh58+bJGKOVK1fqscceU0NDgy644AI1NjZqwYIFMsZoyZIlKi8v92PcAABgAAsZn/deljKp2ZwEi4H58cbc5Mf85Mf85Mf8eBvsc5NviYSuRQAAUHQEDAAAUHQEDAAAUHQEDAAAUHQEDAAAUHQEDAAAUHQEDAAAUHQEDAAAUHQEDAAAUHS+n+QJAADsRwUDAAAUHQEDAAAUHQEDAAAUHQEDAAAUHQEDAAAUHQEDAAAUXSToARwvx3G0YsUK7d27V9FoVE1NTTrllFOCHlagEomEli1bpv3796urq0vXXXedPve5z2np0qUKhUI69dRTdccdd6isbGjny48++kiXXnqpHn30UUUiEebH5aGHHtLGjRuVSCQ0f/58nXXWWczPUYlEQkuXLtX+/ftVVlamu+++m//9SNq5c6d++tOfau3atWpubs45H7/+9a/1t7/9TZFIRMuWLdOkSZOCHrZv3PPzxhtv6O6771Y4HFY0GtWqVat04oknat26dfrDH/6gSCSi6667TtOnTw962MfHDHLPPvusufXWW40xxrz66qvm2muvDXhEwVu/fr1pamoyxhjz8ccfm2nTpplrrrnGvPzyy8YYY5YvX26ee+65IIcYuK6uLnP99debr3zlK+af//wn8+Py8ssvm2uuucakUikTi8XMr371K+bH5a9//atZvHixMcaYF154wdx4441Dfn7WrFljZs2aZS6//HJjjMk5H7t27TKNjY3GcRyzf/9+c+mllwY5ZF9lz883v/lN8/rrrxtjjHniiSfMypUrzQcffGBmzZplOjs7zaFDh9KvB7NBH7G3bdum8847T5J0xhlnaNeuXQGPKHgXXXSRbrrppvR1OBzW7t27ddZZZ0mSzj//fP3jH/8IangDwqpVqzRv3jyddNJJksT8uLzwwguaMGGCbrjhBl177bX68pe/zPy4jB07VqlUSo7jKBaLKRKJDPn5aWho0OrVq9PXueZj27ZtOvfccxUKhXTyyScrlUrp448/DmrIvsqen5/97Gc6/fTTJUmpVErl5eV67bXXNHnyZEWjUdXW1qqhoUF79uwJashFMegDRiwWU01NTfo6HA4rmUwGOKLgVVdXq6amRrFYTIsXL9bNN98sY4xCoVD6fltbW8CjDM4f//hHjRo1Kh1MJTE/Lq2trdq1a5d++ctf6s4779QPfvAD5selqqpK+/fv11e/+lUtX75cjY2NQ35+Zs6cqUike8U913xk/7N6KM1T9vwc+w+b7du36/HHH9e3v/1txWIx1dbWpj9TXV2tWCzm+1iLadD3YNTU1Cgej6evHcfJ+D/kUHXgwAHdcMMNWrBggWbPnq377rsvfS8ej2v48OEBji5YTz31lEKhkF566SW98cYbuvXWWzP+S2qoz8+IESM0btw4RaNRjRs3TuXl5fr3v/+dvj/U5+e3v/2tzj33XH3/+9/XgQMHdOWVVyqRSKTvD/X5kZTRf3JsPrL/WR2PxzP+hTrU/OUvf9GDDz6oNWvWaNSoUVbOz6CvYEyZMkWbN2+WJO3YsUMTJkwIeETB+/DDD3XVVVfplltu0WWXXSZJ+vznP68tW7ZIkjZv3qwzzzwzyCEG6ve//70ef/xxrV27VqeffrpWrVql888/n/k5asAFvocAAAF3SURBVOrUqfr73/8uY4zef/99HT58WGeffTbzc9Tw4cPT/+Cvq6tTMpnk/7+y5JqPKVOm6IUXXpDjOHrvvffkOI5GjRoV8EiD8ec//zn9z6DPfOYzkqRJkyZp27Zt6uzsVFtbm/bt2zfo/3026B92dmwXyZtvviljjFauXKnx48cHPaxANTU16ZlnntG4cePS7/3oRz9SU1OTEomExo0bp6amJoXD4QBHOTA0NjZqxYoVKisr0/Lly5mfo37yk59oy5YtMsZoyZIlGjNmDPNzVDwe17Jly9TS0qJEIqGFCxdq4sSJQ35+3n33XX3ve9/TunXr9K9//SvnfKxevVqbN2+W4zj64Q9/OKSC2LH5eeKJJ3T22WfrU5/6VLrS9cUvflGLFy/WunXr9OSTT8oYo2uuuUYzZ84MeNTHZ9AHDAAAMPAM+iUSAAAw8BAwAABA0REwAABA0REwAABA0REwAABA0REwAABA0REwAABA0REwAABA0f1/FR+/fRHm0P8AAAAASUVORK5CYII=\n",
"text/plain": [
"<Figure size 576x396 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sgd_reg = SGDRegressor(penalty=\"l2\")\n",
"sgd_reg.fit(X, y_result)\n",
"y_sdg_pred = sgd_reg.predict(X)\n",
"plot_learning_curves(polynomial_regression, X, y_sdg_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The penalty hyperparameter sets the type of regularization term to use. Specifying \"l2\" indicates that you want SGD to add a regularization term to the cost function equal to half the square of the ℓ2 norm of the weight vector: this is simply Ridge Regression."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following will show how to perform the model with a Lasso regression model"
]
},
{
"cell_type": "code",
"execution_count": 143,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 576x396 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.linear_model import Lasso\n",
"lasso_reg = Lasso(alpha=0.1)\n",
"lasso_reg.fit(X, y_result)\n",
"y_lasso_pred = lasso_reg.predict(X)\n",
"plot_learning_curves(polynomial_regression, X, y_lasso_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following picture will show how to perform the model using ElasticNet estimator"
]
},
{
"cell_type": "code",
"execution_count": 144,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 576x396 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.linear_model import ElasticNet\n",
"elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)\n",
"elastic_net.fit(X, y_result)\n",
"y_elastNet_pred = elastic_net.predict(X)\n",
"plot_learning_curves(polynomial_regression, X, y_elastNet_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.7. Conclusions <a id='part3_7'></a>\n",
"\n",
"as we can see, the models have become better by changing from pure LinearRegression() (where the error of the training and validation sets converged at around 3000) to PolynominalFeatures (where the error of the training and validation sets converged at around 500) and better still with the regularization methods in PolynominalFeatures (Ridge, Lasso, Stochastic Gradient Descent and Elastic Net in values close to 100 of error)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Predictive Modelling (TensorFlow)<a id='part4'></a>\n",
"\n",
"In this section we will use the TensorFlow library to create models of prediction for the new_deaths variable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.1. TensorFlow for Polynomial Linear Regressions to new_deaths <a id='part4_1'></a>"
]
},
{
"cell_type": "code",
"execution_count": 145,
"metadata": {},
"outputs": [],
"source": [
"from __future__ import absolute_import, division, print_function"
]
},
{
"cell_type": "code",
"execution_count": 146,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"WARNING:tensorflow:From C:\\Users\\oscar\\anaconda3\\lib\\site-packages\\tensorflow_core\\python\\compat\\v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.\n",
"Instructions for updating:\n",
"non-resource variables are not supported in the long term\n"
]
}
],
"source": [
"# importing TensorFlow Libraries\n",
"# Here we will use TensorFlow v1\n",
"import tensorflow.compat.v1 as tf\n",
"import numpy as np\n",
"from sklearn.preprocessing import StandardScaler, MinMaxScaler\n",
"from sklearn.metrics import r2_score as r2\n",
"# Prevent TensorFlow from using v2\n",
"tf.disable_v2_behavior() \n",
"\n",
"rng = np.random"
]
},
{
"cell_type": "code",
"execution_count": 147,
"metadata": {},
"outputs": [],
"source": [
"# Setting up the tunning parameters\n",
"alpha = 0.01\n",
"epochs = 200\n",
"m = len(x)\n",
"errors = []"
]
},
{
"cell_type": "code",
"execution_count": 148,
"metadata": {},
"outputs": [],
"source": [
"# Generating the data series for new_deaths\n",
"Y = df.groupby([\"date\"])[\"new_deaths\"].sum().values\n",
"#Y = Y[np.where(Y>50)]\n",
"n_samples = len(Y)\n",
"X = np.arange(1,n_samples+1, 1)"
]
},
{
"cell_type": "code",
"execution_count": 149,
"metadata": {},
"outputs": [],
"source": [
"predDegree = 5\n",
"\n",
"# Reshape the data series for both X (array from range(1, len(y))) and Y data\n",
"x_reshaped = X.reshape((len(X), 1))\n",
"y_reshaped = Y.reshape((len(Y), 1))\n",
"\n",
"# Splitting the data into training and test size chunks\n",
"x_train, x_test, y_train, y_test = train_test_split(x_reshaped,y_reshaped, test_size=0.2)"
]
},
{
"cell_type": "code",
"execution_count": 150,
"metadata": {},
"outputs": [],
"source": [
"# First, scale the X data\n",
"scaler = StandardScaler().fit(x_train)\n",
"x_train = scaler.transform(x_train)\n",
"x_test = scaler.transform(x_test)"
]
},
{
"cell_type": "code",
"execution_count": 151,
"metadata": {},
"outputs": [],
"source": [
"# Create TensorFLow placeholders for each variable\n",
"X = tf.placeholder(tf.float32, shape=[None, 1], name='x-input')\n",
"Y = tf.placeholder(tf.float32, shape=[None, 1], name='y-input')\n",
"\n",
"theta_1 = tf.Variable(tf.zeros([1, 1]))\n",
"theta_2 = tf.Variable(tf.zeros([1, 1]))\n",
"theta_3 = tf.Variable(tf.zeros([1, 1]))\n",
"theta_4 = tf.Variable(tf.zeros([1, 1]))\n",
"theta_5 = tf.Variable(tf.zeros([1, 1]))\n",
"theta_6 = tf.Variable(tf.zeros([1, 1]))"
]
},
{
"cell_type": "code",
"execution_count": 152,
"metadata": {},
"outputs": [],
"source": [
"# Create the function/model that recreates the line to fit the data\n",
"model = tf.matmul(tf.pow(X, 5), theta_1) + tf.matmul(tf.pow(X, 4), theta_2) + tf.matmul(tf.pow(X, 3), theta_3) + tf.matmul(tf.pow(X, 2), theta_3) +tf.matmul(X, theta_5) + theta_6\n",
"\n",
"# Create the function of MSE to obtain the cost of the model\n",
"cost = tf.reduce_sum(tf.square(Y-model))/(2*m)\n",
"\n",
"# Create the optimizer to tell the cost how much it has to vary the slope to adapt the model to the real data\n",
"optimizer = tf.train.GradientDescentOptimizer(alpha).minimize(cost)\n",
"\n",
"# Initialize previous TensorFlow variables\n",
"init = tf.global_variables_initializer()"
]
},
{
"cell_type": "code",
"execution_count": 153,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 576x396 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 576x396 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"R2 Correlation: -0.6940568618959166\n"
]
}
],
"source": [
"with tf.Session() as sess:\n",
" sess.run(init)\n",
" for i in range(epochs):\n",
" sess.run(optimizer, feed_dict={X:x_train, Y:y_train})\n",
" loss = sess.run(cost, feed_dict={X:x_train, Y:y_train})\n",
" errors.append(loss)\n",
" \n",
" theta1, theta2, theta3, theta4, theta5, theta6 = sess.run([theta_1, theta_2, theta_3, theta_4, theta_5, theta_6])\n",
"\n",
"plt.plot(list(range(epochs)), errors)\n",
"plt.title(\"Cost vs Iteration\")\n",
"plt.show()\n",
"\n",
"x = scaler.transform(x_reshaped)\n",
"pred = theta1 * x**5 + theta2 * x**4 + theta3 * x**3 + theta4 * x**2 + theta5 * x + theta6\n",
"\n",
"plt.plot(x, pred, 'red', label=\"Prediction\")\n",
"plt.plot(x, y_reshaped, 'blue', label=\"True Values\")\n",
"plt.legend()\n",
"plt.title(\"Salary vs Position\")\n",
"plt.show()\n",
"\n",
"print(\"R2 Correlation: \", r2(y_reshaped, pred))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"this model doesn´t seem to adapt very well to the data shown, so tuning the parameters and factor inside the model should be the right approach to get it better at prediction and fitting to the data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Conclusions<a id='part5'></a>\n",
"\n",
"At this point, we can deduce that diving into machine learning algorithms is not very confortable, but it is interesting how we can guess the trends of a specific variable or parameter to try to understand its behaviour and maybe future values that can come up afterwards. So, here we used scikit learn and tensorFlow to predict and create models to fit that data, but still, there are mistakes applied and poor data brought in. Still I hope this is a initial guide into diving to this fantastic and still place-to-explore called machine learning and deep learning."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. References<a id='part6'></a>\n",
"\n",
" - https://www.kaggle.com/ash316/eda-to-prediction-dietanic\n",
" - https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html\n",
" - https://www.oreilly.com/library/view/hands-on-machine-learning/9781491962282/ch04.html\n",
" - https://github.com/aymericdamien/TensorFlow-Examples/\n",
" - https://towardsdatascience.com/linear-regression-from-scratch-with-tensorflow-2-part-1-3e2443804df0"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment