Skip to content

Instantly share code, notes, and snippets.

@mrocklin
Forked from TomAugspurger/dplyr_pandas.ipynb
Last active August 29, 2015 14:07
Show Gist options
  • Select an option

  • Save mrocklin/a191fc2d6dfff2a5a855 to your computer and use it in GitHub Desktop.

Select an option

Save mrocklin/a191fc2d6dfff2a5a855 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"kernelspec": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"display_name": "IPython (Python 3)",
"language": "python",
"name": "python3"
},
"name": "",
"signature": "sha256:4268d1a235f92d01f44af2ef9ef4690744226b9e7307b2f7756fc3c184d0454f"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook compares [pandas](http://pandas.pydata.org)\n",
"and [dplyr](http://cran.r-project.org/web/packages/dplyr/index.html).\n",
"The comparison is just on syntax (verbage), not performance. Whether you're an R user looking to switch to pandas (or the other way around), I hope this guide will help ease the transition.\n",
"\n",
"We'll work through the [introductory dplyr vignette](http://cran.r-project.org/web/packages/dplyr/vignettes/introduction.html) to analyze some flight data.\n",
"\n",
"I'm working on a better layout to show the two packages side by side.\n",
"But for now I'm just putting the ``dplyr`` code in a comment above each python call."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Some prep work to get the data from R and into pandas\n",
"%matplotlib inline\n",
"%load_ext rmagic\n",
"\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"\n",
"pd.set_option(\"display.max_rows\", 5)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": [
"/Users/tom/Envs/py3/lib/python3.4/site-packages/IPython/extensions/rmagic.py:693: UserWarning: The rmagic extension in IPython is deprecated in favour of rpy2.ipython. If available, that will be loaded instead.\n",
"http://rpy.sourceforge.net/\n",
" warnings.warn(\"The rmagic extension in IPython is deprecated in favour of \"\n"
]
}
],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%R\n",
"library(\"nycflights13\")\n",
"write.csv(flights, \"flights.csv\")"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Data: nycflights13"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"flights = pd.read_csv(\"flights.csv\", index_col=0)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# dim(flights) <--- The R code\n",
"flights.shape # <--- The python code"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": [
"(336776, 16)"
]
}
],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# head(flights)\n",
"flights.head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>month</th>\n",
" <th>day</th>\n",
" <th>dep_time</th>\n",
" <th>dep_delay</th>\n",
" <th>arr_time</th>\n",
" <th>arr_delay</th>\n",
" <th>carrier</th>\n",
" <th>tailnum</th>\n",
" <th>flight</th>\n",
" <th>origin</th>\n",
" <th>dest</th>\n",
" <th>air_time</th>\n",
" <th>distance</th>\n",
" <th>hour</th>\n",
" <th>minute</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 517</td>\n",
" <td> 2</td>\n",
" <td> 830</td>\n",
" <td> 11</td>\n",
" <td> UA</td>\n",
" <td> N14228</td>\n",
" <td> 1545</td>\n",
" <td> EWR</td>\n",
" <td> IAH</td>\n",
" <td> 227</td>\n",
" <td> 1400</td>\n",
" <td> 5</td>\n",
" <td> 17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 533</td>\n",
" <td> 4</td>\n",
" <td> 850</td>\n",
" <td> 20</td>\n",
" <td> UA</td>\n",
" <td> N24211</td>\n",
" <td> 1714</td>\n",
" <td> LGA</td>\n",
" <td> IAH</td>\n",
" <td> 227</td>\n",
" <td> 1416</td>\n",
" <td> 5</td>\n",
" <td> 33</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 542</td>\n",
" <td> 2</td>\n",
" <td> 923</td>\n",
" <td> 33</td>\n",
" <td> AA</td>\n",
" <td> N619AA</td>\n",
" <td> 1141</td>\n",
" <td> JFK</td>\n",
" <td> MIA</td>\n",
" <td> 160</td>\n",
" <td> 1089</td>\n",
" <td> 5</td>\n",
" <td> 42</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 544</td>\n",
" <td>-1</td>\n",
" <td> 1004</td>\n",
" <td>-18</td>\n",
" <td> B6</td>\n",
" <td> N804JB</td>\n",
" <td> 725</td>\n",
" <td> JFK</td>\n",
" <td> BQN</td>\n",
" <td> 183</td>\n",
" <td> 1576</td>\n",
" <td> 5</td>\n",
" <td> 44</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 554</td>\n",
" <td>-6</td>\n",
" <td> 812</td>\n",
" <td>-25</td>\n",
" <td> DL</td>\n",
" <td> N668DN</td>\n",
" <td> 461</td>\n",
" <td> LGA</td>\n",
" <td> ATL</td>\n",
" <td> 116</td>\n",
" <td> 762</td>\n",
" <td> 5</td>\n",
" <td> 54</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 5,
"text": [
" year month day dep_time dep_delay arr_time arr_delay carrier tailnum \\\n",
"1 2013 1 1 517 2 830 11 UA N14228 \n",
"2 2013 1 1 533 4 850 20 UA N24211 \n",
"3 2013 1 1 542 2 923 33 AA N619AA \n",
"4 2013 1 1 544 -1 1004 -18 B6 N804JB \n",
"5 2013 1 1 554 -6 812 -25 DL N668DN \n",
"\n",
" flight origin dest air_time distance hour minute \n",
"1 1545 EWR IAH 227 1400 5 17 \n",
"2 1714 LGA IAH 227 1416 5 33 \n",
"3 1141 JFK MIA 160 1089 5 42 \n",
"4 725 JFK BQN 183 1576 5 44 \n",
"5 461 LGA ATL 116 762 5 54 "
]
}
],
"prompt_number": 5
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Single table verbs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"``dplyr`` has a small set of nicely defined verbs. I've listed their closest pandas verbs.\n",
"\n",
"<table>\n",
" <tr>\n",
" <td><b>dplyr</b></td>\n",
" <td><b>pandas</b></td>\n",
" </tr>\n",
" <tr>\n",
" <td>filter() (and slice())</td>\n",
" <td>query() (and loc[], iloc[])</td>\n",
" </tr>\n",
" <tr>\n",
" <td>arrange()</td>\n",
" <td>sort()</td>\n",
" </tr>\n",
" <tr>\n",
" <td>select() (and rename())</td>\n",
" <td>\\_\\_getitem\\_\\_ (and rename())</td>\n",
" </tr>\n",
" <tr>\n",
" <td>distinct()</td>\n",
" <td>drop_duplicates()</td>\n",
" </tr>\n",
" <tr>\n",
" <td>mutate() (and transmute())</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <td>summarise()</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <td>sample_n() and sample_frac()</td>\n",
" <td>None</td>\n",
"</table>\n",
"\n",
"Some of the \"missing\" verbs in pandas are because there are other, different ways of achieving the same goal. For example `summarise` is spread across `mean`, `std`, etc. Others, like `sample_n`, just haven't been implemented yet."
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Filter rows with filter(), query()"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# filter(flights, month == 1, day == 1)\n",
"flights.query(\"month == 1 & day == 1\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>month</th>\n",
" <th>day</th>\n",
" <th>dep_time</th>\n",
" <th>dep_delay</th>\n",
" <th>arr_time</th>\n",
" <th>arr_delay</th>\n",
" <th>carrier</th>\n",
" <th>tailnum</th>\n",
" <th>flight</th>\n",
" <th>origin</th>\n",
" <th>dest</th>\n",
" <th>air_time</th>\n",
" <th>distance</th>\n",
" <th>hour</th>\n",
" <th>minute</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1 </th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 517</td>\n",
" <td> 2</td>\n",
" <td> 830</td>\n",
" <td> 11</td>\n",
" <td> UA</td>\n",
" <td> N14228</td>\n",
" <td> 1545</td>\n",
" <td> EWR</td>\n",
" <td> IAH</td>\n",
" <td> 227</td>\n",
" <td> 1400</td>\n",
" <td> 5</td>\n",
" <td> 17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2 </th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 533</td>\n",
" <td> 4</td>\n",
" <td> 850</td>\n",
" <td> 20</td>\n",
" <td> UA</td>\n",
" <td> N24211</td>\n",
" <td> 1714</td>\n",
" <td> LGA</td>\n",
" <td> IAH</td>\n",
" <td> 227</td>\n",
" <td> 1416</td>\n",
" <td> 5</td>\n",
" <td> 33</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>841</th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> AA</td>\n",
" <td> N3EVAA</td>\n",
" <td> 1925</td>\n",
" <td> LGA</td>\n",
" <td> MIA</td>\n",
" <td> NaN</td>\n",
" <td> 1096</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>842</th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> B6</td>\n",
" <td> N618JB</td>\n",
" <td> 125</td>\n",
" <td> JFK</td>\n",
" <td> FLL</td>\n",
" <td> NaN</td>\n",
" <td> 1069</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>842 rows \u00d7 16 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 6,
"text": [
" year month day dep_time dep_delay arr_time arr_delay carrier \\\n",
"1 2013 1 1 517 2 830 11 UA \n",
"2 2013 1 1 533 4 850 20 UA \n",
".. ... ... ... ... ... ... ... ... \n",
"841 2013 1 1 NaN NaN NaN NaN AA \n",
"842 2013 1 1 NaN NaN NaN NaN B6 \n",
"\n",
" tailnum flight origin dest air_time distance hour minute \n",
"1 N14228 1545 EWR IAH 227 1400 5 17 \n",
"2 N24211 1714 LGA IAH 227 1416 5 33 \n",
".. ... ... ... ... ... ... ... ... \n",
"841 N3EVAA 1925 LGA MIA NaN 1096 NaN NaN \n",
"842 N618JB 125 JFK FLL NaN 1069 NaN NaN \n",
"\n",
"[842 rows x 16 columns]"
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The more verbose version:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# flights[flights$month == 1 & flights$day == 1, ]\n",
"flights[(flights.month == 1) & (flights.day == 1)]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>month</th>\n",
" <th>day</th>\n",
" <th>dep_time</th>\n",
" <th>dep_delay</th>\n",
" <th>arr_time</th>\n",
" <th>arr_delay</th>\n",
" <th>carrier</th>\n",
" <th>tailnum</th>\n",
" <th>flight</th>\n",
" <th>origin</th>\n",
" <th>dest</th>\n",
" <th>air_time</th>\n",
" <th>distance</th>\n",
" <th>hour</th>\n",
" <th>minute</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1 </th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 517</td>\n",
" <td> 2</td>\n",
" <td> 830</td>\n",
" <td> 11</td>\n",
" <td> UA</td>\n",
" <td> N14228</td>\n",
" <td> 1545</td>\n",
" <td> EWR</td>\n",
" <td> IAH</td>\n",
" <td> 227</td>\n",
" <td> 1400</td>\n",
" <td> 5</td>\n",
" <td> 17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2 </th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 533</td>\n",
" <td> 4</td>\n",
" <td> 850</td>\n",
" <td> 20</td>\n",
" <td> UA</td>\n",
" <td> N24211</td>\n",
" <td> 1714</td>\n",
" <td> LGA</td>\n",
" <td> IAH</td>\n",
" <td> 227</td>\n",
" <td> 1416</td>\n",
" <td> 5</td>\n",
" <td> 33</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>841</th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> AA</td>\n",
" <td> N3EVAA</td>\n",
" <td> 1925</td>\n",
" <td> LGA</td>\n",
" <td> MIA</td>\n",
" <td> NaN</td>\n",
" <td> 1096</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>842</th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> B6</td>\n",
" <td> N618JB</td>\n",
" <td> 125</td>\n",
" <td> JFK</td>\n",
" <td> FLL</td>\n",
" <td> NaN</td>\n",
" <td> 1069</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>842 rows \u00d7 16 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 7,
"text": [
" year month day dep_time dep_delay arr_time arr_delay carrier \\\n",
"1 2013 1 1 517 2 830 11 UA \n",
"2 2013 1 1 533 4 850 20 UA \n",
".. ... ... ... ... ... ... ... ... \n",
"841 2013 1 1 NaN NaN NaN NaN AA \n",
"842 2013 1 1 NaN NaN NaN NaN B6 \n",
"\n",
" tailnum flight origin dest air_time distance hour minute \n",
"1 N14228 1545 EWR IAH 227 1400 5 17 \n",
"2 N24211 1714 LGA IAH 227 1416 5 33 \n",
".. ... ... ... ... ... ... ... ... \n",
"841 N3EVAA 1925 LGA MIA NaN 1096 NaN NaN \n",
"842 N618JB 125 JFK FLL NaN 1069 NaN NaN \n",
"\n",
"[842 rows x 16 columns]"
]
}
],
"prompt_number": 7
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# slice(flights, 1:10)\n",
"flights.iloc[:9]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>month</th>\n",
" <th>day</th>\n",
" <th>dep_time</th>\n",
" <th>dep_delay</th>\n",
" <th>arr_time</th>\n",
" <th>arr_delay</th>\n",
" <th>carrier</th>\n",
" <th>tailnum</th>\n",
" <th>flight</th>\n",
" <th>origin</th>\n",
" <th>dest</th>\n",
" <th>air_time</th>\n",
" <th>distance</th>\n",
" <th>hour</th>\n",
" <th>minute</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 517</td>\n",
" <td> 2</td>\n",
" <td> 830</td>\n",
" <td> 11</td>\n",
" <td> UA</td>\n",
" <td> N14228</td>\n",
" <td> 1545</td>\n",
" <td> EWR</td>\n",
" <td> IAH</td>\n",
" <td> 227</td>\n",
" <td> 1400</td>\n",
" <td> 5</td>\n",
" <td> 17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 533</td>\n",
" <td> 4</td>\n",
" <td> 850</td>\n",
" <td> 20</td>\n",
" <td> UA</td>\n",
" <td> N24211</td>\n",
" <td> 1714</td>\n",
" <td> LGA</td>\n",
" <td> IAH</td>\n",
" <td> 227</td>\n",
" <td> 1416</td>\n",
" <td> 5</td>\n",
" <td> 33</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 557</td>\n",
" <td>-3</td>\n",
" <td> 709</td>\n",
" <td>-14</td>\n",
" <td> EV</td>\n",
" <td> N829AS</td>\n",
" <td> 5708</td>\n",
" <td> LGA</td>\n",
" <td> IAD</td>\n",
" <td> 53</td>\n",
" <td> 229</td>\n",
" <td> 5</td>\n",
" <td> 57</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 557</td>\n",
" <td>-3</td>\n",
" <td> 838</td>\n",
" <td> -8</td>\n",
" <td> B6</td>\n",
" <td> N593JB</td>\n",
" <td> 79</td>\n",
" <td> JFK</td>\n",
" <td> MCO</td>\n",
" <td> 140</td>\n",
" <td> 944</td>\n",
" <td> 5</td>\n",
" <td> 57</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>9 rows \u00d7 16 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 8,
"text": [
" year month day dep_time dep_delay arr_time arr_delay carrier \\\n",
"1 2013 1 1 517 2 830 11 UA \n",
"2 2013 1 1 533 4 850 20 UA \n",
".. ... ... ... ... ... ... ... ... \n",
"8 2013 1 1 557 -3 709 -14 EV \n",
"9 2013 1 1 557 -3 838 -8 B6 \n",
"\n",
" tailnum flight origin dest air_time distance hour minute \n",
"1 N14228 1545 EWR IAH 227 1400 5 17 \n",
"2 N24211 1714 LGA IAH 227 1416 5 33 \n",
".. ... ... ... ... ... ... ... ... \n",
"8 N829AS 5708 LGA IAD 53 229 5 57 \n",
"9 N593JB 79 JFK MCO 140 944 5 57 \n",
"\n",
"[9 rows x 16 columns]"
]
}
],
"prompt_number": 8
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Arrange rows with arrange(), sort()"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# arrange(flights, year, month, day) \n",
"flights.sort(['year', 'month', 'day'])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>month</th>\n",
" <th>day</th>\n",
" <th>dep_time</th>\n",
" <th>dep_delay</th>\n",
" <th>arr_time</th>\n",
" <th>arr_delay</th>\n",
" <th>carrier</th>\n",
" <th>tailnum</th>\n",
" <th>flight</th>\n",
" <th>origin</th>\n",
" <th>dest</th>\n",
" <th>air_time</th>\n",
" <th>distance</th>\n",
" <th>hour</th>\n",
" <th>minute</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1 </th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 517</td>\n",
" <td> 2</td>\n",
" <td> 830</td>\n",
" <td> 11</td>\n",
" <td> UA</td>\n",
" <td> N14228</td>\n",
" <td> 1545</td>\n",
" <td> EWR</td>\n",
" <td> IAH</td>\n",
" <td> 227</td>\n",
" <td> 1400</td>\n",
" <td> 5</td>\n",
" <td> 17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2 </th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 533</td>\n",
" <td> 4</td>\n",
" <td> 850</td>\n",
" <td> 20</td>\n",
" <td> UA</td>\n",
" <td> N24211</td>\n",
" <td> 1714</td>\n",
" <td> LGA</td>\n",
" <td> IAH</td>\n",
" <td> 227</td>\n",
" <td> 1416</td>\n",
" <td> 5</td>\n",
" <td> 33</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>111295</th>\n",
" <td> 2013</td>\n",
" <td> 12</td>\n",
" <td> 31</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> UA</td>\n",
" <td> NaN</td>\n",
" <td> 219</td>\n",
" <td> EWR</td>\n",
" <td> ORD</td>\n",
" <td> NaN</td>\n",
" <td> 719</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>111296</th>\n",
" <td> 2013</td>\n",
" <td> 12</td>\n",
" <td> 31</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> UA</td>\n",
" <td> NaN</td>\n",
" <td> 443</td>\n",
" <td> JFK</td>\n",
" <td> LAX</td>\n",
" <td> NaN</td>\n",
" <td> 2475</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>336776 rows \u00d7 16 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 9,
"text": [
" year month day dep_time dep_delay arr_time arr_delay carrier \\\n",
"1 2013 1 1 517 2 830 11 UA \n",
"2 2013 1 1 533 4 850 20 UA \n",
"... ... ... ... ... ... ... ... ... \n",
"111295 2013 12 31 NaN NaN NaN NaN UA \n",
"111296 2013 12 31 NaN NaN NaN NaN UA \n",
"\n",
" tailnum flight origin dest air_time distance hour minute \n",
"1 N14228 1545 EWR IAH 227 1400 5 17 \n",
"2 N24211 1714 LGA IAH 227 1416 5 33 \n",
"... ... ... ... ... ... ... ... ... \n",
"111295 NaN 219 EWR ORD NaN 719 NaN NaN \n",
"111296 NaN 443 JFK LAX NaN 2475 NaN NaN \n",
"\n",
"[336776 rows x 16 columns]"
]
}
],
"prompt_number": 9
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# arrange(flights, desc(arr_delay))\n",
"flights.sort('arr_delay', ascending=False)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>month</th>\n",
" <th>day</th>\n",
" <th>dep_time</th>\n",
" <th>dep_delay</th>\n",
" <th>arr_time</th>\n",
" <th>arr_delay</th>\n",
" <th>carrier</th>\n",
" <th>tailnum</th>\n",
" <th>flight</th>\n",
" <th>origin</th>\n",
" <th>dest</th>\n",
" <th>air_time</th>\n",
" <th>distance</th>\n",
" <th>hour</th>\n",
" <th>minute</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>7073 </th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 9</td>\n",
" <td> 641</td>\n",
" <td> 1301</td>\n",
" <td> 1242</td>\n",
" <td> 1272</td>\n",
" <td> HA</td>\n",
" <td> N384HA</td>\n",
" <td> 51</td>\n",
" <td> JFK</td>\n",
" <td> HNL</td>\n",
" <td> 640</td>\n",
" <td> 4983</td>\n",
" <td> 6</td>\n",
" <td> 41</td>\n",
" </tr>\n",
" <tr>\n",
" <th>235779</th>\n",
" <td> 2013</td>\n",
" <td> 6</td>\n",
" <td> 15</td>\n",
" <td> 1432</td>\n",
" <td> 1137</td>\n",
" <td> 1607</td>\n",
" <td> 1127</td>\n",
" <td> MQ</td>\n",
" <td> N504MQ</td>\n",
" <td> 3535</td>\n",
" <td> JFK</td>\n",
" <td> CMH</td>\n",
" <td> 74</td>\n",
" <td> 483</td>\n",
" <td> 14</td>\n",
" <td> 32</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>336775</th>\n",
" <td> 2013</td>\n",
" <td> 9</td>\n",
" <td> 30</td>\n",
" <td> NaN</td>\n",
" <td> NaN</td>\n",
" <td> NaN</td>\n",
" <td> NaN</td>\n",
" <td> MQ</td>\n",
" <td> N511MQ</td>\n",
" <td> 3572</td>\n",
" <td> LGA</td>\n",
" <td> CLE</td>\n",
" <td> NaN</td>\n",
" <td> 419</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>336776</th>\n",
" <td> 2013</td>\n",
" <td> 9</td>\n",
" <td> 30</td>\n",
" <td> NaN</td>\n",
" <td> NaN</td>\n",
" <td> NaN</td>\n",
" <td> NaN</td>\n",
" <td> MQ</td>\n",
" <td> N839MQ</td>\n",
" <td> 3531</td>\n",
" <td> LGA</td>\n",
" <td> RDU</td>\n",
" <td> NaN</td>\n",
" <td> 431</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>336776 rows \u00d7 16 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 10,
"text": [
" year month day dep_time dep_delay arr_time arr_delay carrier \\\n",
"7073 2013 1 9 641 1301 1242 1272 HA \n",
"235779 2013 6 15 1432 1137 1607 1127 MQ \n",
"... ... ... ... ... ... ... ... ... \n",
"336775 2013 9 30 NaN NaN NaN NaN MQ \n",
"336776 2013 9 30 NaN NaN NaN NaN MQ \n",
"\n",
" tailnum flight origin dest air_time distance hour minute \n",
"7073 N384HA 51 JFK HNL 640 4983 6 41 \n",
"235779 N504MQ 3535 JFK CMH 74 483 14 32 \n",
"... ... ... ... ... ... ... ... ... \n",
"336775 N511MQ 3572 LGA CLE NaN 419 NaN NaN \n",
"336776 N839MQ 3531 LGA RDU NaN 431 NaN NaN \n",
"\n",
"[336776 rows x 16 columns]"
]
}
],
"prompt_number": 10
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Select columns with select(), []"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# select(flights, year, month, day) \n",
"flights[['year', 'month', 'day']]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>month</th>\n",
" <th>day</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1 </th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2 </th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>336775</th>\n",
" <td> 2013</td>\n",
" <td> 9</td>\n",
" <td> 30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>336776</th>\n",
" <td> 2013</td>\n",
" <td> 9</td>\n",
" <td> 30</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>336776 rows \u00d7 3 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 11,
"text": [
" year month day\n",
"1 2013 1 1\n",
"2 2013 1 1\n",
"... ... ... ...\n",
"336775 2013 9 30\n",
"336776 2013 9 30\n",
"\n",
"[336776 rows x 3 columns]"
]
}
],
"prompt_number": 11
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"# select(flights, year:day) \n",
"\n",
"# No real equivalent here. Although I think this is OK.\n",
"# Typically I'll have the columns I want stored in a list\n",
"# somewhere, which can be passed right into __getitem__ ([])."
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 12
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"# select(flights, -(year:day)) \n",
"\n",
"# Again, simliar story. I would just use\n",
"# flights.drop(cols_to_drop, axis=1)\n",
"# or fligths[flights.columns - pd.Index(cols_to_drop)]\n",
"# point to dplyr!"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 13
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# select(flights, tail_num = tailnum)\n",
"flights.rename(columns={'tailnum': 'tail_num'})['tail_num']"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 14,
"text": [
"1 N14228\n",
"...\n",
"336776 N839MQ\n",
"Name: tail_num, Length: 336776, dtype: object"
]
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But like Hadley mentions, not that useful since it only returns the one column. ``dplyr`` and ``pandas`` compare well here."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# rename(flights, tail_num = tailnum)\n",
"flights.rename(columns={'tailnum': 'tail_num'})"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>month</th>\n",
" <th>day</th>\n",
" <th>dep_time</th>\n",
" <th>dep_delay</th>\n",
" <th>arr_time</th>\n",
" <th>arr_delay</th>\n",
" <th>carrier</th>\n",
" <th>tail_num</th>\n",
" <th>flight</th>\n",
" <th>origin</th>\n",
" <th>dest</th>\n",
" <th>air_time</th>\n",
" <th>distance</th>\n",
" <th>hour</th>\n",
" <th>minute</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1 </th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 517</td>\n",
" <td> 2</td>\n",
" <td> 830</td>\n",
" <td> 11</td>\n",
" <td> UA</td>\n",
" <td> N14228</td>\n",
" <td> 1545</td>\n",
" <td> EWR</td>\n",
" <td> IAH</td>\n",
" <td> 227</td>\n",
" <td> 1400</td>\n",
" <td> 5</td>\n",
" <td> 17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2 </th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 533</td>\n",
" <td> 4</td>\n",
" <td> 850</td>\n",
" <td> 20</td>\n",
" <td> UA</td>\n",
" <td> N24211</td>\n",
" <td> 1714</td>\n",
" <td> LGA</td>\n",
" <td> IAH</td>\n",
" <td> 227</td>\n",
" <td> 1416</td>\n",
" <td> 5</td>\n",
" <td> 33</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>336775</th>\n",
" <td> 2013</td>\n",
" <td> 9</td>\n",
" <td> 30</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> MQ</td>\n",
" <td> N511MQ</td>\n",
" <td> 3572</td>\n",
" <td> LGA</td>\n",
" <td> CLE</td>\n",
" <td> NaN</td>\n",
" <td> 419</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>336776</th>\n",
" <td> 2013</td>\n",
" <td> 9</td>\n",
" <td> 30</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> MQ</td>\n",
" <td> N839MQ</td>\n",
" <td> 3531</td>\n",
" <td> LGA</td>\n",
" <td> RDU</td>\n",
" <td> NaN</td>\n",
" <td> 431</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>336776 rows \u00d7 16 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 15,
"text": [
" year month day dep_time dep_delay arr_time arr_delay carrier \\\n",
"1 2013 1 1 517 2 830 11 UA \n",
"2 2013 1 1 533 4 850 20 UA \n",
"... ... ... ... ... ... ... ... ... \n",
"336775 2013 9 30 NaN NaN NaN NaN MQ \n",
"336776 2013 9 30 NaN NaN NaN NaN MQ \n",
"\n",
" tail_num flight origin dest air_time distance hour minute \n",
"1 N14228 1545 EWR IAH 227 1400 5 17 \n",
"2 N24211 1714 LGA IAH 227 1416 5 33 \n",
"... ... ... ... ... ... ... ... ... \n",
"336775 N511MQ 3572 LGA CLE NaN 419 NaN NaN \n",
"336776 N839MQ 3531 LGA RDU NaN 431 NaN NaN \n",
"\n",
"[336776 rows x 16 columns]"
]
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas is more verbose, but the the argument to `columns` can be any mapping. So it's often used with a function to perform a common task, say `df.rename(columns=lambda x: x.replace('-', '_'))` to replace any dashes with underscores. Also, ``rename`` (the pandas version) can be applied to the Index."
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Extract distinct (unique) rows "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# distinct(select(flights, tailnum))\n",
"flights.tailnum.unique()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 16,
"text": [
"array(['N14228', 'N24211', 'N619AA', ..., 'N776SK', 'N785SK', 'N557AS'], dtype=object)"
]
}
],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"FYI this returns a numpy array instead of a Series."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# distinct(select(flights, origin, dest))\n",
"flights[['origin', 'dest']].drop_duplicates()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>origin</th>\n",
" <th>dest</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1 </th>\n",
" <td> EWR</td>\n",
" <td> IAH</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2 </th>\n",
" <td> LGA</td>\n",
" <td> IAH</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>255456</th>\n",
" <td> EWR</td>\n",
" <td> ANC</td>\n",
" </tr>\n",
" <tr>\n",
" <th>275946</th>\n",
" <td> EWR</td>\n",
" <td> LGA</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>224 rows \u00d7 2 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 17,
"text": [
" origin dest\n",
"1 EWR IAH\n",
"2 LGA IAH\n",
"... ... ...\n",
"255456 EWR ANC\n",
"275946 EWR LGA\n",
"\n",
"[224 rows x 2 columns]"
]
}
],
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"OK, so ``dplyr`` wins there from a consistency point of view. ``unique`` is only defined on Series, not DataFrames. The original intention for `drop_duplicates` is to check for records that were accidentally included twice. This feels a bit hacky using it to select the distinct combinations, but it works!"
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Add new columns with mutate() "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# mutate(flights,\n",
"# gain = arr_delay - dep_delay,\n",
"# speed = distance / air_time * 60)\n",
"\n",
"flights['gain'] = flights.arr_delay - flights.dep_delay\n",
"flights['speed'] = flights.distance / flights.air_time * 60\n",
"flights"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>month</th>\n",
" <th>day</th>\n",
" <th>dep_time</th>\n",
" <th>dep_delay</th>\n",
" <th>arr_time</th>\n",
" <th>arr_delay</th>\n",
" <th>carrier</th>\n",
" <th>tailnum</th>\n",
" <th>flight</th>\n",
" <th>origin</th>\n",
" <th>dest</th>\n",
" <th>air_time</th>\n",
" <th>distance</th>\n",
" <th>hour</th>\n",
" <th>minute</th>\n",
" <th>gain</th>\n",
" <th>speed</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1 </th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 517</td>\n",
" <td> 2</td>\n",
" <td> 830</td>\n",
" <td> 11</td>\n",
" <td> UA</td>\n",
" <td> N14228</td>\n",
" <td> 1545</td>\n",
" <td> EWR</td>\n",
" <td> IAH</td>\n",
" <td> 227</td>\n",
" <td> 1400</td>\n",
" <td> 5</td>\n",
" <td> 17</td>\n",
" <td> 9</td>\n",
" <td> 370.044053</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2 </th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 533</td>\n",
" <td> 4</td>\n",
" <td> 850</td>\n",
" <td> 20</td>\n",
" <td> UA</td>\n",
" <td> N24211</td>\n",
" <td> 1714</td>\n",
" <td> LGA</td>\n",
" <td> IAH</td>\n",
" <td> 227</td>\n",
" <td> 1416</td>\n",
" <td> 5</td>\n",
" <td> 33</td>\n",
" <td> 16</td>\n",
" <td> 374.273128</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>336775</th>\n",
" <td> 2013</td>\n",
" <td> 9</td>\n",
" <td> 30</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> MQ</td>\n",
" <td> N511MQ</td>\n",
" <td> 3572</td>\n",
" <td> LGA</td>\n",
" <td> CLE</td>\n",
" <td> NaN</td>\n",
" <td> 419</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>336776</th>\n",
" <td> 2013</td>\n",
" <td> 9</td>\n",
" <td> 30</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> MQ</td>\n",
" <td> N839MQ</td>\n",
" <td> 3531</td>\n",
" <td> LGA</td>\n",
" <td> RDU</td>\n",
" <td> NaN</td>\n",
" <td> 431</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>336776 rows \u00d7 18 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 18,
"text": [
" year month day dep_time dep_delay arr_time arr_delay carrier \\\n",
"1 2013 1 1 517 2 830 11 UA \n",
"2 2013 1 1 533 4 850 20 UA \n",
"... ... ... ... ... ... ... ... ... \n",
"336775 2013 9 30 NaN NaN NaN NaN MQ \n",
"336776 2013 9 30 NaN NaN NaN NaN MQ \n",
"\n",
" tailnum flight origin dest air_time distance hour minute gain \\\n",
"1 N14228 1545 EWR IAH 227 1400 5 17 9 \n",
"2 N24211 1714 LGA IAH 227 1416 5 33 16 \n",
"... ... ... ... ... ... ... ... ... ... \n",
"336775 N511MQ 3572 LGA CLE NaN 419 NaN NaN NaN \n",
"336776 N839MQ 3531 LGA RDU NaN 431 NaN NaN NaN \n",
"\n",
" speed \n",
"1 370.044053 \n",
"2 374.273128 \n",
"... ... \n",
"336775 NaN \n",
"336776 NaN \n",
"\n",
"[336776 rows x 18 columns]"
]
}
],
"prompt_number": 18
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# mutate(flights,\n",
"# gain = arr_delay - dep_delay,\n",
"# gain_per_hour = gain / (air_time / 60)\n",
"# )\n",
"\n",
"flights['gain'] = flights.arr_delay - flights.dep_delay\n",
"flights['gain_per_hour'] = flights.gain / (flights.air_time / 60)\n",
"flights"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>month</th>\n",
" <th>day</th>\n",
" <th>dep_time</th>\n",
" <th>dep_delay</th>\n",
" <th>arr_time</th>\n",
" <th>arr_delay</th>\n",
" <th>carrier</th>\n",
" <th>tailnum</th>\n",
" <th>flight</th>\n",
" <th>origin</th>\n",
" <th>dest</th>\n",
" <th>air_time</th>\n",
" <th>distance</th>\n",
" <th>hour</th>\n",
" <th>minute</th>\n",
" <th>gain</th>\n",
" <th>speed</th>\n",
" <th>gain_per_hour</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1 </th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 517</td>\n",
" <td> 2</td>\n",
" <td> 830</td>\n",
" <td> 11</td>\n",
" <td> UA</td>\n",
" <td> N14228</td>\n",
" <td> 1545</td>\n",
" <td> EWR</td>\n",
" <td> IAH</td>\n",
" <td> 227</td>\n",
" <td> 1400</td>\n",
" <td> 5</td>\n",
" <td> 17</td>\n",
" <td> 9</td>\n",
" <td> 370.044053</td>\n",
" <td> 2.378855</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2 </th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 533</td>\n",
" <td> 4</td>\n",
" <td> 850</td>\n",
" <td> 20</td>\n",
" <td> UA</td>\n",
" <td> N24211</td>\n",
" <td> 1714</td>\n",
" <td> LGA</td>\n",
" <td> IAH</td>\n",
" <td> 227</td>\n",
" <td> 1416</td>\n",
" <td> 5</td>\n",
" <td> 33</td>\n",
" <td> 16</td>\n",
" <td> 374.273128</td>\n",
" <td> 4.229075</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>336775</th>\n",
" <td> 2013</td>\n",
" <td> 9</td>\n",
" <td> 30</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> MQ</td>\n",
" <td> N511MQ</td>\n",
" <td> 3572</td>\n",
" <td> LGA</td>\n",
" <td> CLE</td>\n",
" <td> NaN</td>\n",
" <td> 419</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> NaN</td>\n",
" <td> NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>336776</th>\n",
" <td> 2013</td>\n",
" <td> 9</td>\n",
" <td> 30</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td> MQ</td>\n",
" <td> N839MQ</td>\n",
" <td> 3531</td>\n",
" <td> LGA</td>\n",
" <td> RDU</td>\n",
" <td> NaN</td>\n",
" <td> 431</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> NaN</td>\n",
" <td> NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>336776 rows \u00d7 19 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 19,
"text": [
" year month day dep_time dep_delay arr_time arr_delay carrier \\\n",
"1 2013 1 1 517 2 830 11 UA \n",
"2 2013 1 1 533 4 850 20 UA \n",
"... ... ... ... ... ... ... ... ... \n",
"336775 2013 9 30 NaN NaN NaN NaN MQ \n",
"336776 2013 9 30 NaN NaN NaN NaN MQ \n",
"\n",
" tailnum flight origin dest air_time distance hour minute gain \\\n",
"1 N14228 1545 EWR IAH 227 1400 5 17 9 \n",
"2 N24211 1714 LGA IAH 227 1416 5 33 16 \n",
"... ... ... ... ... ... ... ... ... ... \n",
"336775 N511MQ 3572 LGA CLE NaN 419 NaN NaN NaN \n",
"336776 N839MQ 3531 LGA RDU NaN 431 NaN NaN NaN \n",
"\n",
" speed gain_per_hour \n",
"1 370.044053 2.378855 \n",
"2 374.273128 4.229075 \n",
"... ... ... \n",
"336775 NaN NaN \n",
"336776 NaN NaN \n",
"\n",
"[336776 rows x 19 columns]"
]
}
],
"prompt_number": 19
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"``dplyr's`` approach may be nicer here since you get to refer to the variables in subsequent statements within the ``mutate()``. To achieve this with pandas, you have to add the `gain` variable as another column in ``flights``. If I don't want it around I would have to explicitly drop it."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# transmute(flights,\n",
"# gain = arr_delay - dep_delay,\n",
"# gain_per_hour = gain / (air_time / 60)\n",
"# )\n",
"\n",
"flights['gain'] = flights.arr_delay - flights.dep_delay\n",
"flights['gain_per_hour'] = flights.gain / (flights.air_time / 60)\n",
"flights[['gain', 'gain_per_hour']]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>gain</th>\n",
" <th>gain_per_hour</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1 </th>\n",
" <td> 9</td>\n",
" <td> 2.378855</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2 </th>\n",
" <td> 16</td>\n",
" <td> 4.229075</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>336775</th>\n",
" <td>NaN</td>\n",
" <td> NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>336776</th>\n",
" <td>NaN</td>\n",
" <td> NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>336776 rows \u00d7 2 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 20,
"text": [
" gain gain_per_hour\n",
"1 9 2.378855\n",
"2 16 4.229075\n",
"... ... ...\n",
"336775 NaN NaN\n",
"336776 NaN NaN\n",
"\n",
"[336776 rows x 2 columns]"
]
}
],
"prompt_number": 20
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Summarise values with summarise()"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"flights.dep_delay.mean()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 21,
"text": [
"12.639070257304708"
]
}
],
"prompt_number": 21
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Randomly sample rows with sample_n() and sample_frac()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There's an open PR on [Github](https://github.com/pydata/pandas/pull/7274) to make this nicer (closer to ``dplyr``). For now you can drop down to numpy."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# sample_n(flights, 10)\n",
"flights.loc[np.random.choice(flights.index, 10)]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>month</th>\n",
" <th>day</th>\n",
" <th>dep_time</th>\n",
" <th>dep_delay</th>\n",
" <th>arr_time</th>\n",
" <th>arr_delay</th>\n",
" <th>carrier</th>\n",
" <th>tailnum</th>\n",
" <th>flight</th>\n",
" <th>origin</th>\n",
" <th>dest</th>\n",
" <th>air_time</th>\n",
" <th>distance</th>\n",
" <th>hour</th>\n",
" <th>minute</th>\n",
" <th>gain</th>\n",
" <th>speed</th>\n",
" <th>gain_per_hour</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>3186 </th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 4</td>\n",
" <td> 1450</td>\n",
" <td> 4</td>\n",
" <td> 1809</td>\n",
" <td> 12</td>\n",
" <td> UA</td>\n",
" <td> N437UA</td>\n",
" <td> 477</td>\n",
" <td> LGA</td>\n",
" <td> IAH</td>\n",
" <td> 226</td>\n",
" <td> 1416</td>\n",
" <td> 14</td>\n",
" <td> 50</td>\n",
" <td> 8</td>\n",
" <td> 375.929204</td>\n",
" <td> 2.123894</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32341 </th>\n",
" <td> 2013</td>\n",
" <td> 10</td>\n",
" <td> 6</td>\n",
" <td> 1823</td>\n",
" <td>-6</td>\n",
" <td> 2031</td>\n",
" <td> 0</td>\n",
" <td> US</td>\n",
" <td> N187US</td>\n",
" <td> 1751</td>\n",
" <td> EWR</td>\n",
" <td> CLT</td>\n",
" <td> 79</td>\n",
" <td> 529</td>\n",
" <td> 18</td>\n",
" <td> 23</td>\n",
" <td> 6</td>\n",
" <td> 401.772152</td>\n",
" <td> 4.556962</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14127 </th>\n",
" <td> 2013</td>\n",
" <td> 1</td>\n",
" <td> 17</td>\n",
" <td> 746</td>\n",
" <td> 1</td>\n",
" <td> 1135</td>\n",
" <td> 10</td>\n",
" <td> AA</td>\n",
" <td> N352AA</td>\n",
" <td> 59</td>\n",
" <td> JFK</td>\n",
" <td> SFO</td>\n",
" <td> 386</td>\n",
" <td> 2586</td>\n",
" <td> 7</td>\n",
" <td> 46</td>\n",
" <td> 9</td>\n",
" <td> 401.968912</td>\n",
" <td> 1.398964</td>\n",
" </tr>\n",
" <tr>\n",
" <th>177917</th>\n",
" <td> 2013</td>\n",
" <td> 4</td>\n",
" <td> 14</td>\n",
" <td> 1556</td>\n",
" <td> 0</td>\n",
" <td> 1720</td>\n",
" <td> -7</td>\n",
" <td> UA</td>\n",
" <td> N73276</td>\n",
" <td> 1524</td>\n",
" <td> EWR</td>\n",
" <td> ORD</td>\n",
" <td> 116</td>\n",
" <td> 719</td>\n",
" <td> 15</td>\n",
" <td> 56</td>\n",
" <td>-7</td>\n",
" <td> 371.896552</td>\n",
" <td>-3.620690</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>10 rows \u00d7 19 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 22,
"text": [
" year month day dep_time dep_delay arr_time arr_delay carrier \\\n",
"3186 2013 1 4 1450 4 1809 12 UA \n",
"32341 2013 10 6 1823 -6 2031 0 US \n",
"... ... ... ... ... ... ... ... ... \n",
"14127 2013 1 17 746 1 1135 10 AA \n",
"177917 2013 4 14 1556 0 1720 -7 UA \n",
"\n",
" tailnum flight origin dest air_time distance hour minute gain \\\n",
"3186 N437UA 477 LGA IAH 226 1416 14 50 8 \n",
"32341 N187US 1751 EWR CLT 79 529 18 23 6 \n",
"... ... ... ... ... ... ... ... ... ... \n",
"14127 N352AA 59 JFK SFO 386 2586 7 46 9 \n",
"177917 N73276 1524 EWR ORD 116 719 15 56 -7 \n",
"\n",
" speed gain_per_hour \n",
"3186 375.929204 2.123894 \n",
"32341 401.772152 4.556962 \n",
"... ... ... \n",
"14127 401.968912 1.398964 \n",
"177917 371.896552 -3.620690 \n",
"\n",
"[10 rows x 19 columns]"
]
}
],
"prompt_number": 22
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# sample_frac(flights, 0.01)\n",
"flights.iloc[np.random.randint(0, len(flights),\n",
" .1 * len(flights))]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>month</th>\n",
" <th>day</th>\n",
" <th>dep_time</th>\n",
" <th>dep_delay</th>\n",
" <th>arr_time</th>\n",
" <th>arr_delay</th>\n",
" <th>carrier</th>\n",
" <th>tailnum</th>\n",
" <th>flight</th>\n",
" <th>origin</th>\n",
" <th>dest</th>\n",
" <th>air_time</th>\n",
" <th>distance</th>\n",
" <th>hour</th>\n",
" <th>minute</th>\n",
" <th>gain</th>\n",
" <th>speed</th>\n",
" <th>gain_per_hour</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>236290</th>\n",
" <td> 2013</td>\n",
" <td> 6</td>\n",
" <td> 16</td>\n",
" <td> 809</td>\n",
" <td> -6</td>\n",
" <td> 1038</td>\n",
" <td> 5</td>\n",
" <td> DL</td>\n",
" <td> N351NW</td>\n",
" <td> 914</td>\n",
" <td> LGA</td>\n",
" <td> DEN</td>\n",
" <td> 234</td>\n",
" <td> 1620</td>\n",
" <td> 8</td>\n",
" <td> 9</td>\n",
" <td> 11</td>\n",
" <td> 415.384615</td>\n",
" <td> 2.820513</td>\n",
" </tr>\n",
" <tr>\n",
" <th>121118</th>\n",
" <td> 2013</td>\n",
" <td> 2</td>\n",
" <td> 12</td>\n",
" <td> 1202</td>\n",
" <td> -3</td>\n",
" <td> 1522</td>\n",
" <td> 2</td>\n",
" <td> AA</td>\n",
" <td> N486AA</td>\n",
" <td> 743</td>\n",
" <td> LGA</td>\n",
" <td> DFW</td>\n",
" <td> 229</td>\n",
" <td> 1389</td>\n",
" <td> 12</td>\n",
" <td> 2</td>\n",
" <td> 5</td>\n",
" <td> 363.930131</td>\n",
" <td> 1.310044</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>68206 </th>\n",
" <td> 2013</td>\n",
" <td> 11</td>\n",
" <td> 14</td>\n",
" <td> 1010</td>\n",
" <td> 0</td>\n",
" <td> 1131</td>\n",
" <td> -4</td>\n",
" <td> MQ</td>\n",
" <td> N519MQ</td>\n",
" <td> 3675</td>\n",
" <td> LGA</td>\n",
" <td> BNA</td>\n",
" <td> 103</td>\n",
" <td> 764</td>\n",
" <td> 10</td>\n",
" <td> 10</td>\n",
" <td> -4</td>\n",
" <td> 445.048544</td>\n",
" <td>-2.330097</td>\n",
" </tr>\n",
" <tr>\n",
" <th>127355</th>\n",
" <td> 2013</td>\n",
" <td> 2</td>\n",
" <td> 19</td>\n",
" <td> 1131</td>\n",
" <td> 35</td>\n",
" <td> 1458</td>\n",
" <td> 20</td>\n",
" <td> UA</td>\n",
" <td> N36447</td>\n",
" <td> 1667</td>\n",
" <td> EWR</td>\n",
" <td> SFO</td>\n",
" <td> 349</td>\n",
" <td> 2565</td>\n",
" <td> 11</td>\n",
" <td> 31</td>\n",
" <td>-15</td>\n",
" <td> 440.974212</td>\n",
" <td>-2.578797</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>33677 rows \u00d7 19 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 23,
"text": [
" year month day dep_time dep_delay arr_time arr_delay carrier \\\n",
"236290 2013 6 16 809 -6 1038 5 DL \n",
"121118 2013 2 12 1202 -3 1522 2 AA \n",
"... ... ... ... ... ... ... ... ... \n",
"68206 2013 11 14 1010 0 1131 -4 MQ \n",
"127355 2013 2 19 1131 35 1458 20 UA \n",
"\n",
" tailnum flight origin dest air_time distance hour minute gain \\\n",
"236290 N351NW 914 LGA DEN 234 1620 8 9 11 \n",
"121118 N486AA 743 LGA DFW 229 1389 12 2 5 \n",
"... ... ... ... ... ... ... ... ... ... \n",
"68206 N519MQ 3675 LGA BNA 103 764 10 10 -4 \n",
"127355 N36447 1667 EWR SFO 349 2565 11 31 -15 \n",
"\n",
" speed gain_per_hour \n",
"236290 415.384615 2.820513 \n",
"121118 363.930131 1.310044 \n",
"... ... ... \n",
"68206 445.048544 -2.330097 \n",
"127355 440.974212 -2.578797 \n",
"\n",
"[33677 rows x 19 columns]"
]
}
],
"prompt_number": 23
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Grouped operations "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# planes <- group_by(flights, tailnum)\n",
"# delay <- summarise(planes,\n",
"# count = n(),\n",
"# dist = mean(distance, na.rm = TRUE),\n",
"# delay = mean(arr_delay, na.rm = TRUE))\n",
"# delay <- filter(delay, count > 20, dist < 2000)\n",
"\n",
"planes = flights.groupby(\"tailnum\")\n",
"delay = planes.agg({\"year\": \"count\",\n",
" \"distance\": \"mean\",\n",
" \"arr_delay\": \"mean\"})\n",
"delay.query(\"year > 20 & distance < 2000\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>distance</th>\n",
" <th>arr_delay</th>\n",
" </tr>\n",
" <tr>\n",
" <th>tailnum</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>N0EGMQ</th>\n",
" <td> 371</td>\n",
" <td> 676.188679</td>\n",
" <td> 9.982955</td>\n",
" </tr>\n",
" <tr>\n",
" <th>N10156</th>\n",
" <td> 153</td>\n",
" <td> 757.947712</td>\n",
" <td> 12.717241</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>N999DN</th>\n",
" <td> 61</td>\n",
" <td> 895.459016</td>\n",
" <td> 14.311475</td>\n",
" </tr>\n",
" <tr>\n",
" <th>N9EAMQ</th>\n",
" <td> 248</td>\n",
" <td> 674.665323</td>\n",
" <td> 9.235294</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2961 rows \u00d7 3 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 24,
"text": [
" year distance arr_delay\n",
"tailnum \n",
"N0EGMQ 371 676.188679 9.982955\n",
"N10156 153 757.947712 12.717241\n",
"... ... ... ...\n",
"N999DN 61 895.459016 14.311475\n",
"N9EAMQ 248 674.665323 9.235294\n",
"\n",
"[2961 rows x 3 columns]"
]
}
],
"prompt_number": 24
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For me, dplyr's ``n()`` looked is a bit starge at first, but it's already growing on me.\n",
"\n",
"I think pandas is more difficult for this particular example.\n",
"There isn't as natural a way to mix column-agnostic aggregations (like ``count``) with column-specific aggregations like the other two. You end up writing could like `.agg{'year': 'count'}` which reads, \"I want the count of `year`\", even though you don't care about `year` specifically.\n",
"Additionally assigning names can't be done as cleanly in pandas; you have to just follow it up with a ``rename`` like before."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# destinations <- group_by(flights, dest)\n",
"# summarise(destinations,\n",
"# planes = n_distinct(tailnum),\n",
"# flights = n()\n",
"# )\n",
"\n",
"destinations = flights.groupby('dest')\n",
"destinations.agg({\n",
" 'tailnum': lambda x: len(x.unique()),\n",
" 'year': 'count'\n",
" }).rename(columns={'tailnum': 'planes',\n",
" 'year': 'flights'})"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>planes</th>\n",
" <th>flights</th>\n",
" </tr>\n",
" <tr>\n",
" <th>dest</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>ABQ</th>\n",
" <td> 108</td>\n",
" <td> 254</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ACK</th>\n",
" <td> 58</td>\n",
" <td> 265</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>TYS</th>\n",
" <td> 273</td>\n",
" <td> 631</td>\n",
" </tr>\n",
" <tr>\n",
" <th>XNA</th>\n",
" <td> 176</td>\n",
" <td> 1036</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>105 rows \u00d7 2 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 25,
"text": [
" planes flights\n",
"dest \n",
"ABQ 108 254\n",
"ACK 58 265\n",
"... ... ...\n",
"TYS 273 631\n",
"XNA 176 1036\n",
"\n",
"[105 rows x 2 columns]"
]
}
],
"prompt_number": 25
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similar to how ``dplyr`` provides optimized C++ versions of most of the `summarise` functions, pandas uses [cython](http://cython.org) optimized versions for most of the `agg` methods."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# daily <- group_by(flights, year, month, day)\n",
"# (per_day <- summarise(daily, flights = n()))\n",
"\n",
"daily = flights.groupby(['year', 'month', 'day'])\n",
"per_day = daily['distance'].count()\n",
"per_day"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 26,
"text": [
"year month day\n",
"2013 1 1 842\n",
"...\n",
"2013 12 31 776\n",
"Name: distance, Length: 365, dtype: int64"
]
}
],
"prompt_number": 26
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# (per_month <- summarise(per_day, flights = sum(flights)))\n",
"per_month = per_day.groupby(level=['year', 'month']).sum()\n",
"per_month"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 27,
"text": [
"year month\n",
"2013 1 27004\n",
"...\n",
"2013 12 28135\n",
"Name: distance, Length: 12, dtype: int64"
]
}
],
"prompt_number": 27
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# (per_year <- summarise(per_month, flights = sum(flights)))\n",
"per_year = per_month.sum()\n",
"per_year"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 28,
"text": [
"336776"
]
}
],
"prompt_number": 28
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I'm not sure how ``dplyr`` is handling the other columns, like `year`, in the last example. With pandas, it's clear that we're grouping by them since they're included in the groupby. For the last example, we didn't group by anything, so they aren't included in the result."
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Chaining"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Any follower of Hadley's [twitter account](https://twitter.com/hadleywickham/) will know how much R users *love* the ``%>%`` (pipe) operator. And for good reason!"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# flights %>%\n",
"# group_by(year, month, day) %>%\n",
"# select(arr_delay, dep_delay) %>%\n",
"# summarise(\n",
"# arr = mean(arr_delay, na.rm = TRUE),\n",
"# dep = mean(dep_delay, na.rm = TRUE)\n",
"# ) %>%\n",
"# filter(arr > 30 | dep > 30)\n",
"\n",
"flights.groupby(['year', 'month', 'day'])\\\n",
" [['arr_delay', 'dep_delay']]\\\n",
" .mean()\\\n",
" .query('arr_delay > 30 | dep_delay > 30')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th>arr_delay</th>\n",
" <th>dep_delay</th>\n",
" </tr>\n",
" <tr>\n",
" <th>year</th>\n",
" <th>month</th>\n",
" <th>day</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"5\" valign=\"top\">2013</th>\n",
" <th rowspan=\"2\" valign=\"top\">1 </th>\n",
" <th>16</th>\n",
" <td> 34.247362</td>\n",
" <td> 24.612865</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31</th>\n",
" <td> 32.602854</td>\n",
" <td> 28.658363</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1 </th>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">12</th>\n",
" <th>17</th>\n",
" <td> 55.871856</td>\n",
" <td> 40.705602</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td> 32.226042</td>\n",
" <td> 32.254149</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>49 rows \u00d7 2 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 29,
"text": [
" arr_delay dep_delay\n",
"year month day \n",
"2013 1 16 34.247362 24.612865\n",
" 31 32.602854 28.658363\n",
"... ... ...\n",
" 12 17 55.871856 40.705602\n",
" 23 32.226042 32.254149\n",
"\n",
"[49 rows x 2 columns]"
]
}
],
"prompt_number": 29
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All those backslashes are a bit of an eyesore to me, so I'd still probably break it into two expressions."
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Other Data Sources"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas has tons [IO tools](http://pandas.pydata.org/pandas-docs/version/0.15.0/io.html) to help you get data in and out, including SQL databases via [SQLAlchemy](http://www.sqlalchemy.org)."
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Summary"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I think pandas held up pretty well, considering this was a vignette written for dplyr. I found the degree of similarity more interesting than the differences. The most difficult task was renaming of columns within an operation; they had to be followed up with a call to ``rename`` *after* the operation, which isn't that burdensome honestly.\n",
"\n",
"More and more it looks like we're moving towards future where being a language or package partisan just doesn't make sense. Not when you can load up a [Jupyter](http://jupyter.org) (formerly IPython) notebook to call up a library written in R, and hand those results off to python or Julia or whatever for followup, before going back to R to make a cool [shiny](http://shiny.rstudio.com) web app.\n",
"\n",
"There will always be a place for your \"utility belt\" package like dplyr or pandas, but it wouldn't hurt to be familiar with both.\n",
"\n",
"If you want to contribute to pandas, we're always looking for help at https://github.com/pydata/pandas/.\n",
"You can get ahold of me directly on [twitter](https://twitter.com/tomaugspurger)."
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment