Created
October 13, 2018 21:17
-
-
Save ClebsonDantasUchoa/7b7c30ed55e24e13c724b006e5949764 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "### Resoluçao do problema: https://www.codenation.com.br/journey/data-science/challenge/enem-4.html\n", | |
| "\n", | |
| "## Definição: \n", | |
| "Neste desafio deverá descobrir quais estudantes estão fazendo a prova apenas para treino.\n", | |
| "\n", | |
| "Alguns estudantes decidem realizar prova do ENEM de forma precoce, como um teste (coluna IN_TREINEIRO). Neste desafio, você deve criar um modelo de classificação binária para inferir a mesma. Os resultados possíveis da sua resposta devem ser “0” ou “1”.\n", | |
| "\n", | |
| "Salve sua resposta em um arquivo chamado answer.csv com duas colunas: NU_INSCRICAO e IN_TREINEIRO.\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Importação das bibliotecas" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 1, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "%matplotlib inline\n", | |
| "import pandas as pd\n", | |
| "import numpy as np\n", | |
| "from sklearn import linear_model\n", | |
| "from sklearn import metrics\n", | |
| "import matplotlib.pyplot as plt\n", | |
| "from sklearn import tree\n", | |
| "from sklearn import svm\n", | |
| "from sklearn import neighbors\n", | |
| "from sklearn.ensemble import GradientBoostingRegressor\n", | |
| "from sklearn import model_selection" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## leitura do arquivo de treino" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 2, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "train = pd.read_csv('train.csv')" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 3, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "columns=[\n", | |
| " 'NU_NOTA_CN','NU_NOTA_CH','NU_NOTA_LC','NU_NOTA_REDACAO','TP_ST_CONCLUSAO','IN_TREINEIRO'\n", | |
| "]" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 4, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "train = train[columns]" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Visualização dos dados após a filtragem" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 5, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<div>\n", | |
| "<style scoped>\n", | |
| " .dataframe tbody tr th:only-of-type {\n", | |
| " vertical-align: middle;\n", | |
| " }\n", | |
| "\n", | |
| " .dataframe tbody tr th {\n", | |
| " vertical-align: top;\n", | |
| " }\n", | |
| "\n", | |
| " .dataframe thead th {\n", | |
| " text-align: right;\n", | |
| " }\n", | |
| "</style>\n", | |
| "<table border=\"1\" class=\"dataframe\">\n", | |
| " <thead>\n", | |
| " <tr style=\"text-align: right;\">\n", | |
| " <th></th>\n", | |
| " <th>NU_NOTA_CN</th>\n", | |
| " <th>NU_NOTA_CH</th>\n", | |
| " <th>NU_NOTA_LC</th>\n", | |
| " <th>NU_NOTA_REDACAO</th>\n", | |
| " <th>TP_ST_CONCLUSAO</th>\n", | |
| " <th>IN_TREINEIRO</th>\n", | |
| " </tr>\n", | |
| " </thead>\n", | |
| " <tbody>\n", | |
| " <tr>\n", | |
| " <th>0</th>\n", | |
| " <td>436.3</td>\n", | |
| " <td>495.4</td>\n", | |
| " <td>581.2</td>\n", | |
| " <td>520.0</td>\n", | |
| " <td>1</td>\n", | |
| " <td>0</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>1</th>\n", | |
| " <td>474.5</td>\n", | |
| " <td>544.1</td>\n", | |
| " <td>599.0</td>\n", | |
| " <td>580.0</td>\n", | |
| " <td>2</td>\n", | |
| " <td>0</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>2</th>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>3</td>\n", | |
| " <td>0</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>3</th>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>1</td>\n", | |
| " <td>0</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>4</th>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>1</td>\n", | |
| " <td>0</td>\n", | |
| " </tr>\n", | |
| " </tbody>\n", | |
| "</table>\n", | |
| "</div>" | |
| ], | |
| "text/plain": [ | |
| " NU_NOTA_CN NU_NOTA_CH NU_NOTA_LC NU_NOTA_REDACAO TP_ST_CONCLUSAO \\\n", | |
| "0 436.3 495.4 581.2 520.0 1 \n", | |
| "1 474.5 544.1 599.0 580.0 2 \n", | |
| "2 NaN NaN NaN NaN 3 \n", | |
| "3 NaN NaN NaN NaN 1 \n", | |
| "4 NaN NaN NaN NaN 1 \n", | |
| "\n", | |
| " IN_TREINEIRO \n", | |
| "0 0 \n", | |
| "1 0 \n", | |
| "2 0 \n", | |
| "3 0 \n", | |
| "4 0 " | |
| ] | |
| }, | |
| "execution_count": 5, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "train.head()" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Tratamento de dados faltantes" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 6, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "NU_NOTA_CN 3389\n", | |
| "NU_NOTA_CH 3389\n", | |
| "NU_NOTA_LC 3597\n", | |
| "NU_NOTA_REDACAO 3597\n", | |
| "TP_ST_CONCLUSAO 0\n", | |
| "IN_TREINEIRO 0\n", | |
| "dtype: int64\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "print(train.isnull().sum())" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 7, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "train.fillna(0, inplace=True)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 8, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "NU_NOTA_CN 0\n", | |
| "NU_NOTA_CH 0\n", | |
| "NU_NOTA_LC 0\n", | |
| "NU_NOTA_REDACAO 0\n", | |
| "TP_ST_CONCLUSAO 0\n", | |
| "IN_TREINEIRO 0\n", | |
| "dtype: int64\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "print(train.isnull().sum())" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Separação dos dados de treino entre treino e teste" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 9, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "X = train.values[:, :-1]\n", | |
| "y = train.values[:, -1]" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 10, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, \n", | |
| " test_size=0.3, random_state=1)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Treinamento e avaliçao dos modelos" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## DecisionTree" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 11, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| " precision recall f1-score support\n", | |
| "\n", | |
| " 0.0 0.98 0.96 0.97 3588\n", | |
| " 1.0 0.78 0.86 0.82 531\n", | |
| "\n", | |
| "avg / total 0.95 0.95 0.95 4119\n", | |
| "\n", | |
| "accuracy: \n", | |
| "0.9502306385044914\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "dt = tree.DecisionTreeClassifier()\n", | |
| "dt.fit(X_train, y_train)\n", | |
| "resposta_dt = dt.predict(X_test)\n", | |
| "print(metrics.classification_report(y_test, resposta_dt))\n", | |
| "accuracy_dt = metrics.accuracy_score(y_test, resposta_dt)\n", | |
| "print('accuracy: ')\n", | |
| "print(accuracy_dt)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## LogisticRegression" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 12, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| " precision recall f1-score support\n", | |
| "\n", | |
| " 0.0 0.89 0.95 0.91 3588\n", | |
| " 1.0 0.33 0.18 0.23 531\n", | |
| "\n", | |
| "avg / total 0.81 0.85 0.83 4119\n", | |
| "\n", | |
| "accuracy: \n", | |
| "0.8468074775430929\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "lr = linear_model.LogisticRegression()\n", | |
| "lr.fit(X_train, y_train)\n", | |
| "resposta_lr = lr.predict(X_test)\n", | |
| "print(metrics.classification_report(y_test, resposta_lr))\n", | |
| "accuracy_lr = metrics.accuracy_score(y_test, resposta_lr);\n", | |
| "print('accuracy: ')\n", | |
| "print(accuracy_lr)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## SVC" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 13, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| " precision recall f1-score support\n", | |
| "\n", | |
| " 0.0 0.89 0.99 0.93 3588\n", | |
| " 1.0 0.61 0.14 0.23 531\n", | |
| "\n", | |
| "avg / total 0.85 0.88 0.84 4119\n", | |
| "\n", | |
| "accuracy: \n", | |
| "0.8778829813061423\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "svc = svm.SVC()\n", | |
| "svc.fit(X_train, y_train)\n", | |
| "resposta_svc = svc.predict(X_test)\n", | |
| "print(metrics.classification_report(y_test, resposta_svc))\n", | |
| "accuracy_svc = metrics.accuracy_score(y_test, resposta_svc);\n", | |
| "print('accuracy: ')\n", | |
| "print(accuracy_svc)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## KNN" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 14, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| " precision recall f1-score support\n", | |
| "\n", | |
| " 0.0 0.89 0.97 0.93 3588\n", | |
| " 1.0 0.46 0.17 0.25 531\n", | |
| "\n", | |
| "avg / total 0.83 0.87 0.84 4119\n", | |
| "\n", | |
| "accuracy: \n", | |
| "0.8676863316338917\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "knn = neighbors.KNeighborsClassifier()\n", | |
| "knn.fit(X_train, y_train)\n", | |
| "resposta_knn = knn.predict(X_test)\n", | |
| "print(metrics.classification_report(y_test, resposta_knn))\n", | |
| "accuracy_knn = metrics.accuracy_score(y_test, resposta_knn);\n", | |
| "print('accuracy: ')\n", | |
| "print(accuracy_knn)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## GradientBoostingRegressor" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 15, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| " precision recall f1-score support\n", | |
| "\n", | |
| " 0.0 0.89 0.97 0.93 3588\n", | |
| " 1.0 0.46 0.17 0.25 531\n", | |
| "\n", | |
| "avg / total 0.83 0.87 0.84 4119\n", | |
| "\n", | |
| "accuracy: \n", | |
| "0.8676863316338917\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "modeloGBR = GradientBoostingRegressor()\n", | |
| "modeloGBR.fit(X_train, y_train)\n", | |
| "resposta_gbr = knn.predict(X_test)\n", | |
| "print(metrics.classification_report(y_test, resposta_gbr))\n", | |
| "accuracy_gbr = metrics.accuracy_score(y_test, resposta_gbr);\n", | |
| "print('accuracy: ')\n", | |
| "print(accuracy_gbr)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Comparação entre modelos" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 16, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "comparacao = pd.DataFrame(data=[[\n", | |
| " accuracy_dt, accuracy_lr, accuracy_svc, accuracy_knn, accuracy_gbr]],\n", | |
| " columns=['DT', 'LR', 'SVC', 'KNN', 'GBR'])" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 17, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<div>\n", | |
| "<style scoped>\n", | |
| " .dataframe tbody tr th:only-of-type {\n", | |
| " vertical-align: middle;\n", | |
| " }\n", | |
| "\n", | |
| " .dataframe tbody tr th {\n", | |
| " vertical-align: top;\n", | |
| " }\n", | |
| "\n", | |
| " .dataframe thead th {\n", | |
| " text-align: right;\n", | |
| " }\n", | |
| "</style>\n", | |
| "<table border=\"1\" class=\"dataframe\">\n", | |
| " <thead>\n", | |
| " <tr style=\"text-align: right;\">\n", | |
| " <th></th>\n", | |
| " <th>DT</th>\n", | |
| " <th>LR</th>\n", | |
| " <th>SVC</th>\n", | |
| " <th>KNN</th>\n", | |
| " <th>GBR</th>\n", | |
| " </tr>\n", | |
| " </thead>\n", | |
| " <tbody>\n", | |
| " <tr>\n", | |
| " <th>0</th>\n", | |
| " <td>0.950231</td>\n", | |
| " <td>0.846807</td>\n", | |
| " <td>0.877883</td>\n", | |
| " <td>0.867686</td>\n", | |
| " <td>0.867686</td>\n", | |
| " </tr>\n", | |
| " </tbody>\n", | |
| "</table>\n", | |
| "</div>" | |
| ], | |
| "text/plain": [ | |
| " DT LR SVC KNN GBR\n", | |
| "0 0.950231 0.846807 0.877883 0.867686 0.867686" | |
| ] | |
| }, | |
| "execution_count": 17, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "comparacao.head()" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 18, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "comparacao = comparacao.transpose()" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 19, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEICAYAAACktLTqAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAE7ZJREFUeJzt3X+QXfV53/H3xwghghXsSli1WRzBGHeCTRzMAu6Qxqvg1IK2IlNTW5TiEIcoM7XiH9AfysSDU9J23HiI28Qktdq6dtyGDUkmtlrLhjYxSdqGGKgdA2IoG8BmRYux7FBKUITkp3/cq+xlvdLe1b3au/re92tGM3vO+d5zH55ZPnvu9/y4qSokSW15yagLkCQNn+EuSQ0y3CWpQYa7JDXIcJekBhnuktSgRcM9yceTfD3JA0fYniS/mGQmyVeSvHH4ZUqSlqKfI/dPAJuPsv1y4Nzuv23ArwxeliRpEIuGe1X9PvDNowy5EvjV6rgbeFmSVw6rQEnS0g1jzv1M4Ime5dnuuu+QZFuSe7v/tg3hvSVJC1g1hH1kgXULPtOgqnYCOwHWr19fk5OTHxvC+w/kueee47TTTht1GSuCveiwD3PsxZyV0ov77rvvG1V1xmLjhhHus8BZPcsTwJOLvWjjxo3ce++9Q3j7wdx1111MTU2NuowVwV502Ic59mLOSulFkq/2M24Y0zK7gHd2r5p5E/BMVf3vIexXknSMFj1yT3IbMAWsTzILfBA4GaCq/jWwG7gCmAH+DPix41WsJKk/i4Z7VV29yPYC3j20iiRJAxvGnLsknVBeeOEFZmdn2b9/f9+vOf3003nooYeOY1UvtmbNGiYmJjj55JOP6fWGu6SxMzs7y9q1a9m4cSPJQhf8fadnn32WtWvXHufKOqqKffv2MTs7y9lnn31M+/DZMpLGzv79+1m3bl3fwb7ckrBu3bolfbKYz3CXNJZWarAfNmh9hrskNeiEnnPfuOOzA+/jxvMPct0A+3n8Q39j4BokjdYwsqRXP7nw+c9/nve+970cOnSI66+/nh07dgy1Bo/cJWmZHTp0iHe/+9187nOfY8+ePdx2223s2bNnqO9huEvSMvviF7/Ia17zGs455xxWr17N1q1b+cxnPjPU9zDcJWmZ7d27l7POmnsk18TEBHv37h3qexjukrTMOjf2v9iwr94x3CVpmU1MTPDEE3NfgzE7O8urXvWqob6H4S5Jy+yiiy7ikUce4bHHHuPAgQNMT0+zZcuWob7HCX0ppCQNQz+XLg7z8QOrVq3iox/9KG9961s5dOgQ73rXu3jd6143lH3/xXsMdW+SpL5cccUVXHHFFcdt/07LSFKDDHdJapDhLmksLXQ54koyaH2Gu6Sxs2bNGvbt27diA/7w89zXrFlzzPvwhKqksTMxMcHs7CxPP/1036/Zv3//QGG7VIe/ielYGe6Sxs7JJ5+85G84uuuuu7jggguOU0XD57SMJDXII3c1Z9Bncw/6jH/wOf8aPY/cJalBHrlLDfNTzJxx64VH7pLUIMNdkhpkuEtSgwx3SWqQJ1QbMejJIhj8hNFKOXEmySN3SWqS4S5JDTLcJalBhrskNchwl6QGGe6S1KC+wj3J5iQPJ5lJsmOB7a9O8oUkX0rylSTH7yu9JUmLWjTck5wE3ApcDpwHXJ3kvHnDPgDcXlUXAFuBXx52oZKk/vVz5H4xMFNVj1bVAWAauHLemAK+u/vz6cCTwytRkrRUWewLYpNcBWyuquu7y9cCl1TV9p4xrwTuBF4OnAa8paruW2Bf24BtABs2bLhwenp6oOLv3/vMQK8H2HAqPPX8sb/+/DNPH7iGYbAXcwbtxaB9AHvRy17MGUYvNm3adF9VTS42rp/HD2SBdfP/IlwNfKKqbknyV4FPJXl9VX37RS+q2gnsBJicnKypqak+3v7IBn22MnRuub/l/mN/CsPj10wNXMMw2Is5g/Zi0D6AvehlL+YsZy/6mZaZBc7qWZ7gO6ddfhy4HaCq/hBYA6wfRoGSpKXrJ9zvAc5NcnaS1XROmO6aN+ZrwGUASb6XTrg/PcxCJUn9WzTcq+ogsB24A3iIzlUxDya5OcmW7rAbgZ9I8sfAbcB1tdhkviTpuOlrAqmqdgO75627qefnPcClwy1NknSsvENVkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoP6Cvckm5M8nGQmyY4jjHl7kj1JHkzya8MtU5K0FKsWG5DkJOBW4IeBWeCeJLuqak/PmHOBnwYurapvJXnF8SpYkrS4fo7cLwZmqurRqjoATANXzhvzE8CtVfUtgKr6+nDLlCQtRT/hfibwRM/ybHddr9cCr03y35PcnWTzsAqUJC1dquroA5K/A7y1qq7vLl8LXFxVP9Uz5j8DLwBvByaAPwBeX1V/Om9f24BtABs2bLhwenp6oOLv3/vMQK8H2HAqPPX8sb/+/DNPH7iGYbAXcwbtxaB9AHvRy17MGUYvNm3adF9VTS42btE5dzpH6mf1LE8ATy4w5u6qegF4LMnDwLnAPb2DqmonsBNgcnKypqam+nj7I7tux2cHej3Ajecf5Jb7+2nDwh6/ZmrgGobBXswZtBeD9gHsRS97MWc5e9HPtMw9wLlJzk6yGtgK7Jo35tPAJoAk6+lM0zw6zEIlSf1bNNyr6iCwHbgDeAi4vaoeTHJzki3dYXcA+5LsAb4A/MOq2ne8ipYkHV1fnzGqajewe966m3p+LuCG7j9J0oh5h6okNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktSgvsI9yeYkDyeZSbLjKOOuSlJJJodXoiRpqRYN9yQnAbcClwPnAVcnOW+BcWuB9wB/NOwiJUlL08+R+8XATFU9WlUHgGngygXG/Rzw88D+IdYnSToGqaqjD0iuAjZX1fXd5WuBS6pqe8+YC4APVNXbktwF/IOquneBfW0DtgFs2LDhwunp6YGKv3/vMwO9HmDDqfDU88f++vPPPH3gGobBXswZtBeD9gHsRS97MWcYvdi0adN9VbXo1PeqPvaVBdb9xV+EJC8BPgJct9iOqmonsBNgcnKypqam+nj7I7tux2cHej3Ajecf5Jb7+2nDwh6/ZmrgGobBXswZtBeD9gHsRS97MWc5e9HPtMwscFbP8gTwZM/yWuD1wF1JHgfeBOzypKokjU4/4X4PcG6Ss5OsBrYCuw5vrKpnqmp9VW2sqo3A3cCWhaZlJEnLY9Fwr6qDwHbgDuAh4PaqejDJzUm2HO8CJUlL19cEUlXtBnbPW3fTEcZODV6WJGkQ3qEqSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1qK9wT7I5ycNJZpLsWGD7DUn2JPlKkt9J8j3DL1WS1K9Fwz3JScCtwOXAecDVSc6bN+xLwGRVfR/wm8DPD7tQSVL/+jlyvxiYqapHq+oAMA1c2Tugqr5QVX/WXbwbmBhumZKkpUhVHX1AchWwuaqu7y5fC1xSVduPMP6jwP+pqn+6wLZtwDaADRs2XDg9PT1Q8ffvfWag1wNsOBWeev7YX3/+macPXMMw2Is5g/Zi0D6AvehlL+YMoxebNm26r6omFxu3qo99ZYF1C/5FSPL3gEngzQttr6qdwE6AycnJmpqa6uPtj+y6HZ8d6PUAN55/kFvu76cNC3v8mqmBaxgGezFn0F4M2gewF73sxZzl7EU/lc4CZ/UsTwBPzh+U5C3AzwBvrqo/H055kqRj0c+c+z3AuUnOTrIa2Ars6h2Q5ALgY8CWqvr68MuUJC3FouFeVQeB7cAdwEPA7VX1YJKbk2zpDvsw8FLgN5J8OcmuI+xOkrQM+ppAqqrdwO55627q+fktQ65LkjQA71CVpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1KC+wj3J5iQPJ5lJsmOB7ack+fXu9j9KsnHYhUqS+rdouCc5CbgVuBw4D7g6yXnzhv048K2qeg3wEeBfDLtQSVL/+jlyvxiYqapHq+oAMA1cOW/MlcAnuz//JnBZkgyvTEnSUqSqjj4guQrYXFXXd5evBS6pqu09Yx7ojpntLv9Jd8w35u1rG7Ctu/hXgIeH9R8ygPXANxYdNR7sRYd9mGMv5qyUXnxPVZ2x2KBVfexooSPw+X8R+hlDVe0Edvbxnssmyb1VNTnqOlYCe9FhH+bYizknWi/6mZaZBc7qWZ4AnjzSmCSrgNOBbw6jQEnS0vUT7vcA5yY5O8lqYCuwa96YXcCPdn++CvjdWmy+R5J03Cw6LVNVB5NsB+4ATgI+XlUPJrkZuLeqdgH/DvhUkhk6R+xbj2fRQ7aipolGzF502Ic59mLOCdWLRU+oSpJOPN6hKkkNMtwlqUGGuyQ1yHCXpAaNVbgn+cSoa9DKkuSiJJcvsH5LkgtHUZNWtiSnjbqGfvRzh2pLvm/UBawk3YfCvfzwYyK69zFcB7y/qr53lLUtow/T+W+ebw+dS99+aFmrGaEkNx1lc1XVzy1bMStAkjOBVwJfqaoDSV4BvI/O78urRllbP8Yt3L8ryQUs/LgEqup/LnM9I5NkK/Ax4LkkjwA/C3yKzk1r14ywtOW2rqoen7+yqmaSrBtBPaP03ALrvgu4HlgHjE24J3kf8DPADHBKkn8F/ALwq8AJ8YlurK5zT/IsnfBa8Fk4VTVOR2kPAD/SDbE3An8IbK2q3x5xacsqyUz3UdVL2ta6JGuB99J5nPftwC1V9fXRVrV8kuwBfqCqvpnk1XRC/ger6u4Rl9a3cTtynxmnAF/Egaqagc4nliSPjVuwd/3XJP8M+EDvIzOS/BPgd0dX1mgk+UvADXQ+vX0SeGNVfWu0VY3E/qr6JkBVfS3J/zqRgh3GL9w15xVJbuhZfmnvclX9wghqGoUb6Tw+YybJl7vr3gDcS2c6Ymwk+TDwt+mcazi/qv7fiEsapYkkv9iz/Ire5ap6zwhqWpJxm5b561V1Z/fnMwCq6unRVjUaST54lM1VVTcvWzErQJJzgNd1Fx+sqkdHWc8oJPk28OfAQV78yO7Q+Z347pEUNgJJfvRo26vqk0fbvhKMW7gH+CCwnc4v7Evo/CL/0riF2dEkeV9V/ctR17EcunOr/wH49ar6k1HXIw3LWF3nTucypkuBi6pqXVW9HLgEuDTJ+0db2opyw+JDmnE1sBa4s/vl7u9LsuIvc9PxlWR9kg8meU+Slyb5lSQPJPlMkhPiJPu4Hbl/CfjhBb7+7wzgzqq6YDSVrSxJnqiqsxYf2ZYkbwLeAbyNztURt1XVvxltVcunezVZ8eKryYrOubnVVTU25+iS3EnnvMta4DLg3wP/CfhrwDVVNTW66vozbuH+QFW9fqnbxk2Sr1XVq0ddx6gkmQI+ApxXVaeMuJyR6V4O+feBnwR+u6puHHFJyybJH1fVG7pTuV/t/f8hyZer6vtHWF5fxuYvcdeBY9zWnJ6jtO/YBJy6zOWMXJKL6EzRvA14nM4VI78xyppGJcnL6ExhvhP4NTrTmPtGW9WyOwSds8hJ5n8p9rdHUM+SjVu4vyHJ/11gfYA1y13MKFXV2lHXsBIk+efA24E/BaaBS6tqdrRVjUaS9XQuDX0H8HHggqp6ZrRVjcw5SXbRyYbDP9NdPnt0ZfVvrKZlpPmS7AY+VFW/311+J52j968CP3v4RpZxkOQ54Gk688vPzt8+Rvc+kOTNC6w+HJapqt9bznqOxbgduUvz/WXgAYAkPwh8CPgp4PvpTM1cNbrSlt2HmQuw+Z/sxu0o8GXARFXdCpDki8AZdPrwj0dZWL8Md427l/Qcnb8D2FlVvwX8Vs8dq+Pi3x5pSirJ31ruYkbsHwFbe5ZXA5PAaXQ+2az48zHjdp27NN+qJIcPci7jxc+TGbeDn99JsnH+yiQ/BozFTW09VlfVEz3L/62q9lXV1+gE/Io3br+80ny3Ab/XvSLieeAPALo3qozbycT3A/8lyRVV9QhAkp8G/i6w0Bx0y17eu1BV23sWz1jmWo6JJ1Q19ro3L72Szo1sz3XXvRZ46Tg94x8gyWV0nvP/I3QenHYR8DfH7cmQSf4jcNf8m9iS/CQwVVVXj6ay/hnukl4kyQ8Anwb+B/D2qto/4pKWXfdblz5N50Fqh//AXwicQud7EJ4aVW39MtwlAd/x+IFTgBfo3Mwzdk+FPCzJD/Hip4WeMM/4N9wlqUFeLSNJDTLcJalBhrskNchwl6QG/X8bq+hQRhbKWQAAAABJRU5ErkJggg==\n", | |
| "text/plain": [ | |
| "<Figure size 432x288 with 1 Axes>" | |
| ] | |
| }, | |
| "metadata": {}, | |
| "output_type": "display_data" | |
| } | |
| ], | |
| "source": [ | |
| "comparacao.plot(kind='bar', grid=True);" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Criação do modelo escolhido" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 20, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "test = pd.read_csv('test.csv')" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 21, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "columns=['NU_NOTA_CN','NU_NOTA_CH', 'NU_NOTA_LC', 'NU_NOTA_REDACAO', 'TP_ST_CONCLUSAO']" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 22, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "test = test[columns]" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Tratamento de dados faltantes" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 23, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "NU_NOTA_CN 1112\n", | |
| "NU_NOTA_CH 1112\n", | |
| "NU_NOTA_LC 1170\n", | |
| "NU_NOTA_REDACAO 1170\n", | |
| "TP_ST_CONCLUSAO 0\n", | |
| "dtype: int64\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "print(test.isnull().sum())" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 24, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "test = test.fillna(0)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 25, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "NU_NOTA_CN 0\n", | |
| "NU_NOTA_CH 0\n", | |
| "NU_NOTA_LC 0\n", | |
| "NU_NOTA_REDACAO 0\n", | |
| "TP_ST_CONCLUSAO 0\n", | |
| "dtype: int64\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "print(test.isnull().sum())" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## realizando a predição com o modelo" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 26, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "definitivo = tree.DecisionTreeClassifier()\n", | |
| "definitivo.fit(X, y)\n", | |
| "resposta_definitivo = definitivo.predict(test.values)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Criação do arquivo csv" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 27, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "answer = pd.DataFrame()\n", | |
| "answer['NU_INSCRICAO'] = pd.read_csv('test.csv')['NU_INSCRICAO']\n", | |
| "answer['IN_TREINEIRO'] = resposta_definitivo" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 28, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "(4570, 2)" | |
| ] | |
| }, | |
| "execution_count": 28, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "answer.shape" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 29, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<div>\n", | |
| "<style scoped>\n", | |
| " .dataframe tbody tr th:only-of-type {\n", | |
| " vertical-align: middle;\n", | |
| " }\n", | |
| "\n", | |
| " .dataframe tbody tr th {\n", | |
| " vertical-align: top;\n", | |
| " }\n", | |
| "\n", | |
| " .dataframe thead th {\n", | |
| " text-align: right;\n", | |
| " }\n", | |
| "</style>\n", | |
| "<table border=\"1\" class=\"dataframe\">\n", | |
| " <thead>\n", | |
| " <tr style=\"text-align: right;\">\n", | |
| " <th></th>\n", | |
| " <th>NU_INSCRICAO</th>\n", | |
| " <th>IN_TREINEIRO</th>\n", | |
| " </tr>\n", | |
| " </thead>\n", | |
| " <tbody>\n", | |
| " <tr>\n", | |
| " <th>0</th>\n", | |
| " <td>ba0cc30ba34e7a46764c09dfc38ed83d15828897</td>\n", | |
| " <td>0.0</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>1</th>\n", | |
| " <td>177f281c68fa032aedbd842a745da68490926cd2</td>\n", | |
| " <td>0.0</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>2</th>\n", | |
| " <td>6cf0d8b97597d7625cdedc7bdb6c0f052286c334</td>\n", | |
| " <td>0.0</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>3</th>\n", | |
| " <td>5c356d810fa57671402502cd0933e5601a2ebf1e</td>\n", | |
| " <td>0.0</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>4</th>\n", | |
| " <td>df47c07bd881c2db3f38c6048bf77c132ad0ceb3</td>\n", | |
| " <td>0.0</td>\n", | |
| " </tr>\n", | |
| " </tbody>\n", | |
| "</table>\n", | |
| "</div>" | |
| ], | |
| "text/plain": [ | |
| " NU_INSCRICAO IN_TREINEIRO\n", | |
| "0 ba0cc30ba34e7a46764c09dfc38ed83d15828897 0.0\n", | |
| "1 177f281c68fa032aedbd842a745da68490926cd2 0.0\n", | |
| "2 6cf0d8b97597d7625cdedc7bdb6c0f052286c334 0.0\n", | |
| "3 5c356d810fa57671402502cd0933e5601a2ebf1e 0.0\n", | |
| "4 df47c07bd881c2db3f38c6048bf77c132ad0ceb3 0.0" | |
| ] | |
| }, | |
| "execution_count": 29, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "answer.head()" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 30, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "answer.to_csv('answer.csv', index=False)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Após submeter o arquivo 'answer.csv' para a codenation, foi obtida a pontuação de 95%" | |
| ] | |
| } | |
| ], | |
| "metadata": { | |
| "kernelspec": { | |
| "display_name": "Python 3", | |
| "language": "python", | |
| "name": "python3" | |
| }, | |
| "language_info": { | |
| "codemirror_mode": { | |
| "name": "ipython", | |
| "version": 3 | |
| }, | |
| "file_extension": ".py", | |
| "mimetype": "text/x-python", | |
| "name": "python", | |
| "nbconvert_exporter": "python", | |
| "pygments_lexer": "ipython3", | |
| "version": "3.6.5" | |
| } | |
| }, | |
| "nbformat": 4, | |
| "nbformat_minor": 2 | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment