Skip to content

Instantly share code, notes, and snippets.

@memonkey01
Created April 6, 2019 02:33
Show Gist options
  • Select an option

  • Save memonkey01/1a2e0299b663c2ea004dd1702863438c to your computer and use it in GitHub Desktop.

Select an option

Save memonkey01/1a2e0299b663c2ea004dd1702863438c to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# PCA - Replicando el Dow Jones\n",
"## Analisis de componentes principales"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Autor: Guillermo Izquierdo \n",
"Este código es para fines educativos exclusivamente "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"En estadística, el análisis de componentes principales (en español ACP, en inglés, PCA) es una técnica utilizada para describir un conjunto de datos en términos de nuevas variables (\"componentes\") no correlacionadas. Los componentes se ordenan por la cantidad de varianza original que describen, por lo que la técnica es útil para reducir la dimensionalidad de un conjunto de datos.\n",
"\n",
"Técnicamente, el ACP busca la proyección según la cual los datos queden mejor representados en términos de mínimos cuadrados. Esta convierte un conjunto de observaciones de variables posiblemente correlacionadas en un conjunto de valores de variables sin correlación lineal llamadas componentes principales.\n",
"\n",
"El ACP se emplea sobre todo en análisis exploratorio de datos y para construir modelos predictivos. El ACP comporta el cálculo de la descomposición en autovalores de la matriz de covarianza, normalmente tras centrar los datos en la media de cada atributo.\n",
"\n",
"Debe diferenciarse del análisis factorial con el que tiene similaridades formales y en el cual puede ser utilizado como un método de aproximación para la extracción de factores."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"from pandas_datareader import data as pdr\n",
"import numpy as np\n",
"import datetime as date"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"tickers = ['^DJI','JNJ','WMT','KO','DIS','PG','PFE','HD','UTX','VZ','MRK','MCD','WBA','CVX','BA','AAPL','MMM','NKE',\n",
" 'UNH','INTC','XOM','IBM','CAT','DWDP','V','CSCO','JPM','GS','TRV','MSFT','AXP']\n",
"\n",
"enddate = date.datetime(2019,11,1)\n",
"startdate = date.datetime(2016,1,1)\n",
"\n",
"data = pd.DataFrame()\n",
"for tick in tickers:\n",
" data[tick] = pdr.get_data_yahoo(tick, start = startdate, end = enddate)['Close']\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" ^DJI JNJ WMT KO DIS \\\n",
"Date \n",
"2016-01-04 17148.939453 100.480003 61.459999 42.400002 102.980003 \n",
"2016-01-05 17158.660156 100.900002 62.919998 42.549999 100.900002 \n",
"2016-01-06 16906.509766 100.389999 63.549999 42.320000 100.360001 \n",
"2016-01-07 16514.099609 99.220001 65.029999 41.619999 99.500000 \n",
"2016-01-08 16346.450195 98.160004 63.540001 41.509998 99.250000 \n",
"\n",
" PG PFE HD UTX VZ ... \\\n",
"Date ... \n",
"2016-01-04 78.370003 31.950001 131.070007 95.570000 45.869999 ... \n",
"2016-01-05 78.620003 32.180000 130.429993 95.720001 46.500000 ... \n",
"2016-01-06 77.860001 31.610001 129.080002 93.120003 45.520000 ... \n",
"2016-01-07 77.180000 31.400000 125.400002 91.900002 45.270000 ... \n",
"2016-01-08 75.970001 31.000000 123.900002 90.400002 44.830002 ... \n",
"\n",
" IBM CAT DWDP V CSCO JPM \\\n",
"Date \n",
"2016-01-04 135.949997 67.989998 49.930000 75.699997 26.410000 63.619999 \n",
"2016-01-05 135.850006 67.279999 49.549999 76.269997 26.290001 63.730000 \n",
"2016-01-06 135.169998 66.220001 48.459999 75.269997 26.010000 62.810001 \n",
"2016-01-07 132.860001 63.939999 46.639999 73.790001 25.410000 60.270000 \n",
"2016-01-08 131.630005 63.290001 46.279999 72.879997 24.780001 58.919998 \n",
"\n",
" GS TRV MSFT AXP \n",
"Date \n",
"2016-01-04 177.139999 109.970001 54.799999 67.589996 \n",
"2016-01-05 174.089996 110.470001 55.049999 66.550003 \n",
"2016-01-06 169.839996 109.040001 54.049999 64.419998 \n",
"2016-01-07 164.619995 106.440002 52.169998 63.840000 \n",
"2016-01-08 163.940002 105.989998 52.330002 63.630001 \n",
"\n",
"[5 rows x 31 columns]\n"
]
}
],
"source": [
"print(data.head())"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" JNJ WMT KO DIS PG \\\n",
"Date \n",
"2016-01-04 100.480003 61.459999 42.400002 102.980003 78.370003 \n",
"2016-01-05 100.900002 62.919998 42.549999 100.900002 78.620003 \n",
"2016-01-06 100.389999 63.549999 42.320000 100.360001 77.860001 \n",
"2016-01-07 99.220001 65.029999 41.619999 99.500000 77.180000 \n",
"2016-01-08 98.160004 63.540001 41.509998 99.250000 75.970001 \n",
"\n",
" PFE HD UTX VZ MRK ... \\\n",
"Date ... \n",
"2016-01-04 31.950001 131.070007 95.570000 45.869999 52.480000 ... \n",
"2016-01-05 32.180000 130.429993 95.720001 46.500000 53.150002 ... \n",
"2016-01-06 31.610001 129.080002 93.120003 45.520000 52.419998 ... \n",
"2016-01-07 31.400000 125.400002 91.900002 45.270000 51.959999 ... \n",
"2016-01-08 31.000000 123.900002 90.400002 44.830002 51.080002 ... \n",
"\n",
" IBM CAT DWDP V CSCO JPM \\\n",
"Date \n",
"2016-01-04 135.949997 67.989998 49.930000 75.699997 26.410000 63.619999 \n",
"2016-01-05 135.850006 67.279999 49.549999 76.269997 26.290001 63.730000 \n",
"2016-01-06 135.169998 66.220001 48.459999 75.269997 26.010000 62.810001 \n",
"2016-01-07 132.860001 63.939999 46.639999 73.790001 25.410000 60.270000 \n",
"2016-01-08 131.630005 63.290001 46.279999 72.879997 24.780001 58.919998 \n",
"\n",
" GS TRV MSFT AXP \n",
"Date \n",
"2016-01-04 177.139999 109.970001 54.799999 67.589996 \n",
"2016-01-05 174.089996 110.470001 55.049999 66.550003 \n",
"2016-01-06 169.839996 109.040001 54.049999 64.419998 \n",
"2016-01-07 164.619995 106.440002 52.169998 63.840000 \n",
"2016-01-08 163.940002 105.989998 52.330002 63.630001 \n",
"\n",
"[5 rows x 30 columns]\n"
]
}
],
"source": [
"dji = pd.DataFrame(data.pop('^DJI'))\n",
"print(data.head())"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" JNJ WMT KO DIS PG PFE \\\n",
"Date \n",
"2016-01-04 -2.207090 -1.651418 -1.024570 -0.177897 -1.314746 -0.928036 \n",
"2016-01-05 -2.170048 -1.525128 -0.950779 -0.486168 -1.267351 -0.867616 \n",
"2016-01-06 -2.215027 -1.470633 -1.063926 -0.566201 -1.411434 -1.017352 \n",
"2016-01-07 -2.318214 -1.342614 -1.408289 -0.693659 -1.540350 -1.072519 \n",
"2016-01-08 -2.411699 -1.471498 -1.462403 -0.730711 -1.769746 -1.177597 \n",
"\n",
" HD UTX VZ MRK ... IBM \\\n",
"Date ... \n",
"2016-01-04 -1.037887 -1.576620 -1.387128 -1.456222 ... -0.995113 \n",
"2016-01-05 -1.062302 -1.564745 -1.206041 -1.348584 ... -1.002137 \n",
"2016-01-06 -1.113801 -1.770590 -1.487732 -1.465862 ... -1.049908 \n",
"2016-01-07 -1.254183 -1.867180 -1.559592 -1.539763 ... -1.212188 \n",
"2016-01-08 -1.311404 -1.985937 -1.686065 -1.681138 ... -1.298596 \n",
"\n",
" CAT DWDP V CSCO JPM GS \\\n",
"Date \n",
"2016-01-04 -1.470666 -1.359375 -1.140246 -1.320966 -1.334517 -0.923029 \n",
"2016-01-05 -1.494247 -1.407173 -1.116722 -1.338359 -1.328873 -1.005359 \n",
"2016-01-06 -1.529452 -1.544277 -1.157992 -1.378941 -1.376073 -1.120083 \n",
"2016-01-07 -1.605178 -1.773203 -1.219072 -1.465903 -1.506388 -1.260990 \n",
"2016-01-08 -1.626766 -1.818485 -1.256627 -1.557213 -1.575649 -1.279345 \n",
"\n",
" TRV MSFT AXP \n",
"Date \n",
"2016-01-04 -1.432633 -1.079536 -0.968762 \n",
"2016-01-05 -1.378234 -1.067135 -1.032557 \n",
"2016-01-06 -1.533815 -1.116740 -1.163216 \n",
"2016-01-07 -1.816688 -1.209997 -1.198794 \n",
"2016-01-08 -1.865647 -1.202060 -1.211676 \n",
"\n",
"[5 rows x 30 columns]\n"
]
}
],
"source": [
"normalize = lambda x: (x-x.mean())/x.std()\n",
"a=data.apply(normalize)\n",
"print(a.head())"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.decomposition import KernelPCA\n",
"pca = KernelPCA().fit(a)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"422\n"
]
}
],
"source": [
"print(len(pca.lambdas_))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[14795.849 2613.479 1955.464 1108.871 857.244]\n"
]
}
],
"source": [
"print(pca.lambdas_[:5].round(3))"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0.64301823 0.11358013 0.08498322 0.04819082 0.03725528]\n"
]
}
],
"source": [
"recompose = lambda x: x/x.sum()\n",
"print(recompose(pca.lambdas_)[:5])"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pca = KernelPCA(n_components = 1).fit(data.apply(normalize))\n",
"dji['PCA_1'] = pca.transform(-data)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"dji.apply(normalize).plot()\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" ^DJI PCA_1\n",
"Date \n",
"2016-01-04 17148.939453 390.770622\n",
"2016-01-05 17158.660156 390.832219\n",
"2016-01-06 16906.509766 384.531322\n",
"2016-01-07 16514.099609 374.238852\n",
"2016-01-08 16346.450195 370.496940\n"
]
}
],
"source": [
"print(dji.head())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment