Skip to content

Instantly share code, notes, and snippets.

@kunzimariano
Created March 2, 2016 18:25
Show Gist options
  • Select an option

  • Save kunzimariano/b8186488495e6fddec27 to your computer and use it in GitHub Desktop.

Select an option

Save kunzimariano/b8186488495e6fddec27 to your computer and use it in GitHub Desktop.
Jupyter Test
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before we get started, a couple of reminders to keep in mind when using iPython notebooks:\n",
"\n",
"- Remember that you can see from the left side of a code cell when it was last run if there is a number within the brackets.\n",
"- When you start a new notebook session, make sure you run all of the cells up to the point where you last left off. Even if the output is still visible from when you ran the cells in your previous session, the kernel starts in a fresh state so you'll need to reload the data, etc. on a new session.\n",
"- The previous point is useful to keep in mind if your answers do not match what is expected in the lesson's quizzes. Try reloading the data and run all of the processing steps one by one in order to make sure that you are working with the same variables and data that are at each quiz stage.\n",
"\n",
"\n",
"## Load Data from CSVs"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import unicodecsv\n",
"\n",
"## Longer version of code (replaced with shorter, equivalent version below)\n",
"\n",
"# enrollments = []\n",
"# f = open('enrollments.csv', 'rb')\n",
"# reader = unicodecsv.DictReader(f)\n",
"# for row in reader:\n",
"# enrollments.append(row)\n",
"# f.close()\n",
"\n",
"#with open('enrollments.csv', 'rb') as f:\n",
"# reader = unicodecsv.DictReader(f)\n",
"# enrollments = list(reader)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#####################################\n",
"# 1 #\n",
"#####################################\n",
"import unicodecsv\n",
"\n",
"## Read in the data from daily_engagement.csv and project_submissions.csv \n",
"## and store the results in the below variables.\n",
"## Then look at the first row of each table.\n",
"def read_csv(filename):\n",
" with open(filename, 'rb') as f:\n",
" reader = unicodecsv.DictReader(f)\n",
" return list(reader)\n",
"\n",
"enrollments = read_csv('enrollments.csv')\n",
"daily_engagement = read_csv('daily_engagement.csv')\n",
"project_submissions = read_csv('project_submissions.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fixing key with different name"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0\n"
]
}
],
"source": [
"#####################################\n",
"# 3 #\n",
"#####################################\n",
"\n",
"## Rename the \"acct\" column in the daily_engagement table to \"account_key\".\n",
"for e in daily_engagement:\n",
" e['account_key'] = e['acct']\n",
" del e['acct'] \n",
"\n",
"print daily_engagement[0]['account_key'] "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fixing Data Types"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{u'account_key': u'448',\n",
" u'cancel_date': datetime.datetime(2015, 1, 14, 0, 0),\n",
" u'days_to_cancel': 65,\n",
" u'is_canceled': True,\n",
" u'is_udacity': True,\n",
" u'join_date': datetime.datetime(2014, 11, 10, 0, 0),\n",
" u'status': u'canceled'}"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from datetime import datetime as dt\n",
"\n",
"# Takes a date as a string, and returns a Python datetime object. \n",
"# If there is no date given, returns None\n",
"def parse_date(date):\n",
" if date == '':\n",
" return None\n",
" else:\n",
" return dt.strptime(date, '%Y-%m-%d')\n",
" \n",
"# Takes a string which is either an empty string or represents an integer,\n",
"# and returns an int or None.\n",
"def parse_maybe_int(i):\n",
" if i == '':\n",
" return None\n",
" else:\n",
" return int(i)\n",
"\n",
"# Clean up the data types in the enrollments table\n",
"for enrollment in enrollments:\n",
" enrollment['cancel_date'] = parse_date(enrollment['cancel_date'])\n",
" enrollment['days_to_cancel'] = parse_maybe_int(enrollment['days_to_cancel'])\n",
" enrollment['is_canceled'] = enrollment['is_canceled'] == 'True'\n",
" enrollment['is_udacity'] = enrollment['is_udacity'] == 'True'\n",
" enrollment['join_date'] = parse_date(enrollment['join_date'])\n",
" \n",
"enrollments[0]"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'account_key': u'0',\n",
" u'lessons_completed': 0,\n",
" u'num_courses_visited': 1,\n",
" u'projects_completed': 0,\n",
" u'total_minutes_visited': 11.6793745,\n",
" u'utc_date': datetime.datetime(2015, 1, 9, 0, 0)}"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Clean up the data types in the engagement table\n",
"for engagement_record in daily_engagement:\n",
" engagement_record['lessons_completed'] = int(float(engagement_record['lessons_completed']))\n",
" engagement_record['num_courses_visited'] = int(float(engagement_record['num_courses_visited']))\n",
" engagement_record['projects_completed'] = int(float(engagement_record['projects_completed']))\n",
" engagement_record['total_minutes_visited'] = float(engagement_record['total_minutes_visited'])\n",
" engagement_record['utc_date'] = parse_date(engagement_record['utc_date'])\n",
" \n",
"daily_engagement[0]"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{u'account_key': u'256',\n",
" u'assigned_rating': u'UNGRADED',\n",
" u'completion_date': datetime.datetime(2015, 1, 16, 0, 0),\n",
" u'creation_date': datetime.datetime(2015, 1, 14, 0, 0),\n",
" u'lesson_key': u'3176718735',\n",
" u'processing_state': u'EVALUATED'}"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Clean up the data types in the submissions table\n",
"for submission in project_submissions:\n",
" submission['completion_date'] = parse_date(submission['completion_date'])\n",
" submission['creation_date'] = parse_date(submission['creation_date'])\n",
"\n",
"project_submissions[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note when running the above cells that we are actively changing the contents of our data variables. If you try to run these cells multiple times in the same session, an error will occur.\n",
"\n",
"## Investigating the Data"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"####enrollment####\n",
"1640\n",
"1302\n",
"####daily_engagement####\n",
"136240\n",
"1237\n",
"####project_submissions####\n",
"3642\n",
"743\n"
]
}
],
"source": [
"#####################################\n",
"# 2 #\n",
"#####################################\n",
"\n",
"## Find the total number of rows and the number of unique students (account keys)\n",
"## in each table.\n",
"def get_unique_count(dic):\n",
" d = set()\n",
" for e in dic:\n",
" d.add(e['account_key'])\n",
" return len(d)\n",
"\n",
"enrollment_num_rows = len(enrollments)\n",
"enrollment_num_unique_students = get_unique_count(enrollments) \n",
"print '####enrollment####'\n",
"print enrollment_num_rows\n",
"print enrollment_num_unique_students\n",
"\n",
"engagement_num_rows = len(daily_engagement)\n",
"engagement_num_unique_students = get_unique_count(daily_engagement)\n",
"print '####daily_engagement####'\n",
"print engagement_num_rows\n",
"print engagement_num_unique_students\n",
"\n",
"submission_num_rows = len(project_submissions)\n",
"submission_num_unique_students = get_unique_count(project_submissions)\n",
"print '####project_submissions####'\n",
"print submission_num_rows\n",
"print submission_num_unique_students\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Problems in the Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Missing Engagement Records"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{u'status': u'canceled', u'is_udacity': False, u'is_canceled': True, u'join_date': datetime.datetime(2014, 11, 12, 0, 0), u'account_key': u'1219', u'cancel_date': datetime.datetime(2014, 11, 12, 0, 0), u'days_to_cancel': 0}\n"
]
}
],
"source": [
"#####################################\n",
"# 4 #\n",
"#####################################\n",
"\n",
"## Find any one student enrollments where the student is missing from the daily engagement table.\n",
"## Output that enrollment.\n",
"\n",
"## set of accounts keys\n",
"keys = set()\n",
"for de in daily_engagement:\n",
" keys.add(de['account_key'])\n",
"\n",
"for e in enrollments:\n",
" if e['account_key'] not in keys:\n",
" result = e\n",
" break\n",
"print result"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Checking for More Problem Records"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{u'status': u'canceled', u'is_udacity': True, u'is_canceled': True, u'join_date': datetime.datetime(2015, 1, 10, 0, 0), u'account_key': u'1304', u'cancel_date': datetime.datetime(2015, 3, 10, 0, 0), u'days_to_cancel': 59}\n",
"{u'status': u'canceled', u'is_udacity': True, u'is_canceled': True, u'join_date': datetime.datetime(2015, 3, 10, 0, 0), u'account_key': u'1304', u'cancel_date': datetime.datetime(2015, 6, 17, 0, 0), u'days_to_cancel': 99}\n",
"{u'status': u'current', u'is_udacity': True, u'is_canceled': False, u'join_date': datetime.datetime(2015, 2, 25, 0, 0), u'account_key': u'1101', u'cancel_date': None, u'days_to_cancel': None}\n"
]
}
],
"source": [
"#####################################\n",
"# 5 #\n",
"#####################################\n",
"\n",
"## Find the number of surprising data points (enrollments missing from\n",
"## the engagement table) that remain, if any.\n",
"for e in enrollments:\n",
" if e['account_key'] not in keys and e['days_to_cancel'] != 0:\n",
" print e"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tracking Down the Remaining Problems"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"6"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a set of the account keys for all Udacity test accounts\n",
"udacity_test_accounts = set()\n",
"for enrollment in enrollments:\n",
" if enrollment['is_udacity']:\n",
" udacity_test_accounts.add(enrollment['account_key'])\n",
"len(udacity_test_accounts)"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Given some data with an account_key field, removes any records corresponding to Udacity test accounts\n",
"def remove_udacity_accounts(data):\n",
" non_udacity_data = []\n",
" for data_point in data:\n",
" if data_point['account_key'] not in udacity_test_accounts:\n",
" non_udacity_data.append(data_point)\n",
" return non_udacity_data"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1622\n",
"135656\n",
"3634\n"
]
}
],
"source": [
"# Remove Udacity test accounts from all three tables\n",
"non_udacity_enrollments = remove_udacity_accounts(enrollments)\n",
"non_udacity_engagement = remove_udacity_accounts(daily_engagement)\n",
"non_udacity_submissions = remove_udacity_accounts(project_submissions)\n",
"\n",
"print len(non_udacity_enrollments)\n",
"print len(non_udacity_engagement)\n",
"print len(non_udacity_submissions)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Refining the Question"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"995\n"
]
}
],
"source": [
"#####################################\n",
"# 6 #\n",
"#####################################\n",
"\n",
"## Create a dictionary named paid_students containing all students who either\n",
"## haven't canceled yet or who remained enrolled for more than 7 days. The keys\n",
"## should be account keys, and the values should be the date the student enrolled.\n",
"\n",
"paid_students = {}\n",
"for e in non_udacity_enrollments:\n",
" days_to_cancel = e['days_to_cancel']\n",
" if days_to_cancel == None or days_to_cancel > 7:\n",
" account_key = e['account_key']\n",
" join_date = e['join_date']\n",
" if account_key not in paid_students or \\\n",
" join_date > paid_students[account_key]:\n",
" paid_students[account_key] = join_date\n",
"print len(paid_students)\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting Data from First Week"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Takes a student's join date and the date of a specific engagement record,\n",
"# and returns True if that engagement record happened within one week\n",
"# of the student joining.\n",
"def within_one_week(join_date, engagement_date):\n",
" time_delta = engagement_date - join_date\n",
" return time_delta.days < 7"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#####################################\n",
"# 7 #\n",
"#####################################\n",
"\n",
"## Create a list of rows from the engagement table including only rows where\n",
"## the student is one of the paid students you just found, and the date is within\n",
"## one week of the student's join date.\n",
"\n",
"paid_engagement_in_first_week = "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exploring Student Engagement"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from collections import defaultdict\n",
"\n",
"# Create a dictionary of engagement grouped by student.\n",
"# The keys are account keys, and the values are lists of engagement records.\n",
"engagement_by_account = defaultdict(list)\n",
"for engagement_record in paid_engagement_in_first_week:\n",
" account_key = engagement_record['account_key']\n",
" engagement_by_account[account_key].append(engagement_record)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Create a dictionary with the total minutes each student spent in the classroom during the first week.\n",
"# The keys are account keys, and the values are numbers (total minutes)\n",
"total_minutes_by_account = {}\n",
"for account_key, engagement_for_student in engagement_by_account.items():\n",
" total_minutes = 0\n",
" for engagement_record in engagement_for_student:\n",
" total_minutes += engagement_record['total_minutes_visited']\n",
" total_minutes_by_account[account_key] = total_minutes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"# Summarize the data about minutes spent in the classroom\n",
"total_minutes = total_minutes_by_account.values()\n",
"print 'Mean:', np.mean(total_minutes)\n",
"print 'Standard deviation:', np.std(total_minutes)\n",
"print 'Minimum:', np.min(total_minutes)\n",
"print 'Maximum:', np.max(total_minutes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Debugging Data Analysis Code"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#####################################\n",
"# 8 #\n",
"#####################################\n",
"\n",
"## Go through a similar process as before to see if there is a problem.\n",
"## Locate at least one surprising piece of data, output it, and take a look at it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lessons Completed in First Week"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#####################################\n",
"# 9 #\n",
"#####################################\n",
"\n",
"## Adapt the code above to find the mean, standard deviation, minimum, and maximum for\n",
"## the number of lessons completed by each student during the first week. Try creating\n",
"## one or more functions to re-use the code above."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Number of Visits in First Week"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"######################################\n",
"# 10 #\n",
"######################################\n",
"\n",
"## Find the mean, standard deviation, minimum, and maximum for the number of\n",
"## days each student visits the classroom during the first week."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Splitting out Passing Students"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"######################################\n",
"# 11 #\n",
"######################################\n",
"\n",
"## Create two lists of engagement data for paid students in the first week.\n",
"## The first list should contain data for students who eventually pass the\n",
"## subway project, and the second list should contain data for students\n",
"## who do not.\n",
"\n",
"subway_project_lesson_keys = ['746169184', '3176718735']\n",
"\n",
"passing_engagement =\n",
"non_passing_engagement ="
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Comparing the Two Student Groups"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"######################################\n",
"# 12 #\n",
"######################################\n",
"\n",
"## Compute some metrics you're interested in and see how they differ for\n",
"## students who pass the subway project vs. students who don't. A good\n",
"## starting point would be the metrics we looked at earlier (minutes spent\n",
"## in the classroom, lessons completed, and days visited)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Making Histograms"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"######################################\n",
"# 13 #\n",
"######################################\n",
"\n",
"## Make histograms of the three metrics we looked at earlier for both\n",
"## students who passed the subway project and students who didn't. You\n",
"## might also want to make histograms of any other metrics you examined."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Improving Plots and Sharing Findings"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"######################################\n",
"# 14 #\n",
"######################################\n",
"\n",
"## Make a more polished version of at least one of your visualizations\n",
"## from earlier. Try importing the seaborn library to make the visualization\n",
"## look better, adding axis labels and a title, and changing one or more\n",
"## arguments to the hist() function."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment