Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save ipattis/923ca063c7e7cdf05fe1471752baa828 to your computer and use it in GitHub Desktop.

Select an option

Save ipattis/923ca063c7e7cdf05fe1471752baa828 to your computer and use it in GitHub Desktop.
Created on Skills Network Labs
{"cells":[{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["<center>\n"," <img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/Logos/organization_logo/organization_logo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n","</center>\n","\n","# Waffle Charts, Word Clouds, and Regression Plots\n","\n","Estimated time needed: **30** minutes\n","\n","## Objectives\n","\n","After completing this lab you will be able to:\n","\n","- Create Word cloud and Waffle charts\n","- Create regression plots with Seaborn library\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["## Table of Contents\n","\n","<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n","\n","1. [Exploring Datasets with _p_andas](#0)<br>\n","2. [Downloading and Prepping Data](#2)<br>\n","3. [Visualizing Data using Matplotlib](#4) <br>\n","4. [Waffle Charts](#6) <br>\n","5. [Word Clouds](#8) <br>\n","6. [Regression Plots](#10) <br> \n"," </div>\n"," <hr>\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["# Exploring Datasets with _pandas_ and Matplotlib<a id=\"0\"></a>\n","\n","Toolkits: The course heavily relies on [_pandas_](http://pandas.pydata.org?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ) and [**Numpy**](http://www.numpy.org?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ) for data wrangling, analysis, and visualization. The primary plotting library we will explore in the course is [Matplotlib](http://matplotlib.org?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).\n","\n","Dataset: Immigration to Canada from 1980 to 2013 - [International migration flows to and from selected countries - The 2015 revision](http://www.un.org/en/development/desa/population/migration/data/empirical2/migrationflows.shtml?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ) from United Nation's website\n","\n","The dataset contains annual data on the flows of international migrants as recorded by the countries of destination. The data presents both inflows and outflows according to the place of birth, citizenship or place of previous / next residence both for foreigners and nationals. In this lab, we will focus on the Canadian Immigration data.\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["# Downloading and Prepping Data <a id=\"2\"></a>\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Import Primary Modules:\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["import numpy as np # useful for many scientific computing in Python\n","import pandas as pd # primary data structure library\n","from PIL import Image # converting images into arrays"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Let's download and import our primary Canadian Immigration dataset using _pandas_ `read_excel()` method. Normally, before we can do that, we would need to download a module which _pandas_ requires to read in excel files. This module is **xlrd**. For your convenience, we have pre-installed this module, so you would not have to worry about that. Otherwise, you would need to run the following line of code to install the **xlrd** module:\n","\n","```\n","!conda install -c anaconda xlrd --yes\n","```\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Download the dataset and read it into a _pandas_ dataframe:\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["df_can = pd.read_excel('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/Data%20Files/Canada.xlsx',\n"," sheet_name='Canada by Citizenship',\n"," skiprows=range(20),\n"," skipfooter=2)\n","\n","print('Data downloaded and read into a dataframe!')"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Let's take a look at the first five items in our dataset\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["df_can.head()"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Let's find out how many entries there are in our dataset\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# print the dimensions of the dataframe\n","print(df_can.shape)"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Clean up data. We will make some modifications to the original dataset to make it easier to create our visualizations. Refer to _Introduction to Matplotlib and Line Plots_ and _Area Plots, Histograms, and Bar Plots_ for a detailed description of this preprocessing.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# clean up the dataset to remove unnecessary columns (eg. REG) \n","df_can.drop(['AREA','REG','DEV','Type','Coverage'], axis = 1, inplace = True)\n","\n","# let's rename the columns so that they make sense\n","df_can.rename (columns = {'OdName':'Country', 'AreaName':'Continent','RegName':'Region'}, inplace = True)\n","\n","# for sake of consistency, let's also make all column labels of type string\n","df_can.columns = list(map(str, df_can.columns))\n","\n","# set the country name as index - useful for quickly looking up countries using .loc method\n","df_can.set_index('Country', inplace = True)\n","\n","# add total column\n","df_can['Total'] = df_can.sum (axis = 1)\n","\n","# years that we will be using in this lesson - useful for plotting later on\n","years = list(map(str, range(1980, 2014)))\n","print ('data dimensions:', df_can.shape)"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["# Visualizing Data using Matplotlib<a id=\"4\"></a>\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Import `matplotlib`:\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["%matplotlib inline\n","\n","import matplotlib as mpl\n","import matplotlib.pyplot as plt\n","import matplotlib.patches as mpatches # needed for waffle Charts\n","\n","mpl.style.use('ggplot') # optional: for ggplot-like style\n","\n","# check for latest version of Matplotlib\n","print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["# Waffle Charts <a id=\"6\"></a>\n","\n","A `waffle chart` is an interesting visualization that is normally created to display progress toward goals. It is commonly an effective option when you are trying to add interesting visualization features to a visual that consists mainly of cells, such as an Excel dashboard.\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Let's revisit the previous case study about Denmark, Norway, and Sweden.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# let's create a new dataframe for these three countries \n","df_dsn = df_can.loc[['Denmark', 'Norway', 'Sweden'], :]\n","\n","# let's take a look at our dataframe\n","df_dsn"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Unfortunately, unlike R, `waffle` charts are not built into any of the Python visualization libraries. Therefore, we will learn how to create them from scratch.\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["**Step 1.** The first step into creating a waffle chart is determing the proportion of each category with respect to the total.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# compute the proportion of each category with respect to the total\n","total_values = sum(df_dsn['Total'])\n","category_proportions = [(float(value) / total_values) for value in df_dsn['Total']]\n","\n","# print out proportions\n","for i, proportion in enumerate(category_proportions):\n"," print (df_dsn.index.values[i] + ': ' + str(proportion))"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["**Step 2.** The second step is defining the overall size of the `waffle` chart.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["width = 40 # width of chart\n","height = 10 # height of chart\n","\n","total_num_tiles = width * height # total number of tiles\n","\n","print ('Total number of tiles is ', total_num_tiles)"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["**Step 3.** The third step is using the proportion of each category to determe it respective number of tiles\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# compute the number of tiles for each catagory\n","tiles_per_category = [round(proportion * total_num_tiles) for proportion in category_proportions]\n","\n","# print out number of tiles per category\n","for i, tiles in enumerate(tiles_per_category):\n"," print (df_dsn.index.values[i] + ': ' + str(tiles))"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Based on the calculated proportions, Denmark will occupy 129 tiles of the `waffle` chart, Norway will occupy 77 tiles, and Sweden will occupy 194 tiles.\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["**Step 4.** The fourth step is creating a matrix that resembles the `waffle` chart and populating it.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# initialize the waffle chart as an empty matrix\n","waffle_chart = np.zeros((height, width))\n","\n","# define indices to loop through waffle chart\n","category_index = 0\n","tile_index = 0\n","\n","# populate the waffle chart\n","for col in range(width):\n"," for row in range(height):\n"," tile_index += 1\n","\n"," # if the number of tiles populated for the current category is equal to its corresponding allocated tiles...\n"," if tile_index > sum(tiles_per_category[0:category_index]):\n"," # ...proceed to the next category\n"," category_index += 1 \n"," \n"," # set the class value to an integer, which increases with class\n"," waffle_chart[row, col] = category_index\n"," \n","print ('Waffle chart populated!')"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Let's take a peek at how the matrix looks like.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["waffle_chart"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["As expected, the matrix consists of three categories and the total number of each category's instances matches the total number of tiles allocated to each category.\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["**Step 5.** Map the `waffle` chart matrix into a visual.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# instantiate a new figure object\n","fig = plt.figure()\n","\n","# use matshow to display the waffle chart\n","colormap = plt.cm.coolwarm\n","plt.matshow(waffle_chart, cmap=colormap)\n","plt.colorbar()"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["**Step 6.** Prettify the chart.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# instantiate a new figure object\n","fig = plt.figure()\n","\n","# use matshow to display the waffle chart\n","colormap = plt.cm.coolwarm\n","plt.matshow(waffle_chart, cmap=colormap)\n","plt.colorbar()\n","\n","# get the axis\n","ax = plt.gca()\n","\n","# set minor ticks\n","ax.set_xticks(np.arange(-.5, (width), 1), minor=True)\n","ax.set_yticks(np.arange(-.5, (height), 1), minor=True)\n"," \n","# add gridlines based on minor ticks\n","ax.grid(which='minor', color='w', linestyle='-', linewidth=2)\n","\n","plt.xticks([])\n","plt.yticks([])"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["**Step 7.** Create a legend and add it to chart.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# instantiate a new figure object\n","fig = plt.figure()\n","\n","# use matshow to display the waffle chart\n","colormap = plt.cm.coolwarm\n","plt.matshow(waffle_chart, cmap=colormap)\n","plt.colorbar()\n","\n","# get the axis\n","ax = plt.gca()\n","\n","# set minor ticks\n","ax.set_xticks(np.arange(-.5, (width), 1), minor=True)\n","ax.set_yticks(np.arange(-.5, (height), 1), minor=True)\n"," \n","# add gridlines based on minor ticks\n","ax.grid(which='minor', color='w', linestyle='-', linewidth=2)\n","\n","plt.xticks([])\n","plt.yticks([])\n","\n","# compute cumulative sum of individual categories to match color schemes between chart and legend\n","values_cumsum = np.cumsum(df_dsn['Total'])\n","total_values = values_cumsum[len(values_cumsum) - 1]\n","\n","# create legend\n","legend_handles = []\n","for i, category in enumerate(df_dsn.index.values):\n"," label_str = category + ' (' + str(df_dsn['Total'][i]) + ')'\n"," color_val = colormap(float(values_cumsum[i])/total_values)\n"," legend_handles.append(mpatches.Patch(color=color_val, label=label_str))\n","\n","# add legend to chart\n","plt.legend(handles=legend_handles,\n"," loc='lower center', \n"," ncol=len(df_dsn.index.values),\n"," bbox_to_anchor=(0., -0.2, 0.95, .1)\n"," )"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["And there you go! What a good looking _delicious_ `waffle` chart, don't you think?\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Now it would very inefficient to repeat these seven steps every time we wish to create a `waffle` chart. So let's combine all seven steps into one function called _create_waffle_chart_. This function would take the following parameters as input:\n","\n","> 1. **categories**: Unique categories or classes in dataframe.\n","> 2. **values**: Values corresponding to categories or classes.\n","> 3. **height**: Defined height of waffle chart.\n","> 4. **width**: Defined width of waffle chart.\n","> 5. **colormap**: Colormap class\n","> 6. **value_sign**: In order to make our function more generalizable, we will add this parameter to address signs that could be associated with a value such as %, $, and so on. **value_sign** has a default value of empty string.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["def create_waffle_chart(categories, values, height, width, colormap, value_sign=''):\n","\n"," # compute the proportion of each category with respect to the total\n"," total_values = sum(values)\n"," category_proportions = [(float(value) / total_values) for value in values]\n","\n"," # compute the total number of tiles\n"," total_num_tiles = width * height # total number of tiles\n"," print ('Total number of tiles is', total_num_tiles)\n"," \n"," # compute the number of tiles for each catagory\n"," tiles_per_category = [round(proportion * total_num_tiles) for proportion in category_proportions]\n","\n"," # print out number of tiles per category\n"," for i, tiles in enumerate(tiles_per_category):\n"," print (df_dsn.index.values[i] + ': ' + str(tiles))\n"," \n"," # initialize the waffle chart as an empty matrix\n"," waffle_chart = np.zeros((height, width))\n","\n"," # define indices to loop through waffle chart\n"," category_index = 0\n"," tile_index = 0\n","\n"," # populate the waffle chart\n"," for col in range(width):\n"," for row in range(height):\n"," tile_index += 1\n","\n"," # if the number of tiles populated for the current category \n"," # is equal to its corresponding allocated tiles...\n"," if tile_index > sum(tiles_per_category[0:category_index]):\n"," # ...proceed to the next category\n"," category_index += 1 \n"," \n"," # set the class value to an integer, which increases with class\n"," waffle_chart[row, col] = category_index\n"," \n"," # instantiate a new figure object\n"," fig = plt.figure()\n","\n"," # use matshow to display the waffle chart\n"," colormap = plt.cm.coolwarm\n"," plt.matshow(waffle_chart, cmap=colormap)\n"," plt.colorbar()\n","\n"," # get the axis\n"," ax = plt.gca()\n","\n"," # set minor ticks\n"," ax.set_xticks(np.arange(-.5, (width), 1), minor=True)\n"," ax.set_yticks(np.arange(-.5, (height), 1), minor=True)\n"," \n"," # add dridlines based on minor ticks\n"," ax.grid(which='minor', color='w', linestyle='-', linewidth=2)\n","\n"," plt.xticks([])\n"," plt.yticks([])\n","\n"," # compute cumulative sum of individual categories to match color schemes between chart and legend\n"," values_cumsum = np.cumsum(values)\n"," total_values = values_cumsum[len(values_cumsum) - 1]\n","\n"," # create legend\n"," legend_handles = []\n"," for i, category in enumerate(categories):\n"," if value_sign == '%':\n"," label_str = category + ' (' + str(values[i]) + value_sign + ')'\n"," else:\n"," label_str = category + ' (' + value_sign + str(values[i]) + ')'\n"," \n"," color_val = colormap(float(values_cumsum[i])/total_values)\n"," legend_handles.append(mpatches.Patch(color=color_val, label=label_str))\n","\n"," # add legend to chart\n"," plt.legend(\n"," handles=legend_handles,\n"," loc='lower center', \n"," ncol=len(categories),\n"," bbox_to_anchor=(0., -0.2, 0.95, .1)\n"," )"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Now to create a `waffle` chart, all we have to do is call the function `create_waffle_chart`. Let's define the input parameters:\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["width = 40 # width of chart\n","height = 10 # height of chart\n","\n","categories = df_dsn.index.values # categories\n","values = df_dsn['Total'] # correponding values of categories\n","\n","colormap = plt.cm.coolwarm # color map class"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["And now let's call our function to create a `waffle` chart.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["create_waffle_chart(categories, values, height, width, colormap)"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["There seems to be a new Python package for generating `waffle charts` called [PyWaffle](https://github.com/ligyxy/PyWaffle), but it looks like the repository is still being built. But feel free to check it out and play with it.\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["# Word Clouds <a id=\"8\"></a>\n","\n","`Word` clouds (also known as text clouds or tag clouds) work in a simple way: the more a specific word appears in a source of textual data (such as a speech, blog post, or database), the bigger and bolder it appears in the word cloud.\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Luckily, a Python package already exists in Python for generating `word` clouds. The package, called `word_cloud` was developed by **Andreas Mueller**. You can learn more about the package by following this [link](https://github.com/amueller/word_cloud/).\n","\n","Let's use this package to learn how to generate a word cloud for a given text document.\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["First, let's install the package.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# install wordcloud\n","!conda install -c conda-forge wordcloud==1.4.1 --yes\n","\n","# import package and its set of stopwords\n","from wordcloud import WordCloud, STOPWORDS\n","\n","print ('Wordcloud is installed and imported!')"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["`Word` clouds are commonly used to perform high-level analysis and visualization of text data. Accordinly, let's digress from the immigration dataset and work with an example that involves analyzing text data. Let's try to analyze a short novel written by **Lewis Carroll** titled _Alice's Adventures in Wonderland_. Let's go ahead and download a _.txt_ file of the novel.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# download file and save as alice_novel.txt\n","!wget --quiet https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/Data%20Files/alice_novel.txt\n","\n","# open the file and read it into a variable alice_novel\n","alice_novel = open('alice_novel.txt', 'r').read()\n"," \n","print ('File downloaded and saved!')"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Next, let's use the stopwords that we imported from `word_cloud`. We use the function _set_ to remove any redundant stopwords.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["stopwords = set(STOPWORDS)"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Create a word cloud object and generate a word cloud. For simplicity, let's generate a word cloud using only the first 2000 words in the novel.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# instantiate a word cloud object\n","alice_wc = WordCloud(\n"," background_color='white',\n"," max_words=2000,\n"," stopwords=stopwords\n",")\n","\n","# generate the word cloud\n","alice_wc.generate(alice_novel)"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Awesome! Now that the `word` cloud is created, let's visualize it.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":true},"outputs":[],"source":["# display the word cloud\n","plt.imshow(alice_wc, interpolation='bilinear')\n","plt.axis('off')\n","plt.show()"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Interesting! So in the first 2000 words in the novel, the most common words are **Alice**, **said**, **little**, **Queen**, and so on. Let's resize the cloud so that we can see the less frequent words a little better.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["fig = plt.figure()\n","fig.set_figwidth(14) # set width\n","fig.set_figheight(18) # set height\n","\n","# display the cloud\n","plt.imshow(alice_wc, interpolation='bilinear')\n","plt.axis('off')\n","plt.show()"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Much better! However, **said** isn't really an informative word. So let's add it to our stopwords and re-generate the cloud.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["stopwords.add('said') # add the words said to stopwords\n","\n","# re-generate the word cloud\n","alice_wc.generate(alice_novel)\n","\n","# display the cloud\n","fig = plt.figure()\n","fig.set_figwidth(14) # set width\n","fig.set_figheight(18) # set height\n","\n","plt.imshow(alice_wc, interpolation='bilinear')\n","plt.axis('off')\n","plt.show()"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Excellent! This looks really interesting! Another cool thing you can implement with the `word_cloud` package is superimposing the words onto a mask of any shape. Let's use a mask of Alice and her rabbit. We already created the mask for you, so let's go ahead and download it and call it _alice_mask.png_.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# download image\n","!wget --quiet https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/labs/Module%204/images/alice_mask.png\n"," \n","# save mask to alice_mask\n","alice_mask = np.array(Image.open('alice_mask.png'))\n"," \n","print('Image downloaded and saved!')"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Let's take a look at how the mask looks like.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["fig = plt.figure()\n","fig.set_figwidth(14) # set width\n","fig.set_figheight(18) # set height\n","\n","plt.imshow(alice_mask, cmap=plt.cm.gray, interpolation='bilinear')\n","plt.axis('off')\n","plt.show()"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Shaping the `word` cloud according to the mask is straightforward using `word_cloud` package. For simplicity, we will continue using the first 2000 words in the novel.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# instantiate a word cloud object\n","alice_wc = WordCloud(background_color='white', max_words=2000, mask=alice_mask, stopwords=stopwords)\n","\n","# generate the word cloud\n","alice_wc.generate(alice_novel)\n","\n","# display the word cloud\n","fig = plt.figure()\n","fig.set_figwidth(14) # set width\n","fig.set_figheight(18) # set height\n","\n","plt.imshow(alice_wc, interpolation='bilinear')\n","plt.axis('off')\n","plt.show()"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Really impressive!\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Unfortunately, our immmigration data does not have any text data, but where there is a will there is a way. Let's generate sample text data from our immigration dataset, say text data of 90 words.\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Let's recall how our data looks like.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["df_can.head()"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["And what was the total immigration from 1980 to 2013?\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["total_immigration = df_can['Total'].sum()\n","total_immigration"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Using countries with single-word names, let's duplicate each country's name based on how much they contribute to the total immigration.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["max_words = 90\n","word_string = ''\n","for country in df_can.index.values:\n"," # check if country's name is a single-word name\n"," if len(country.split(' ')) == 1:\n"," repeat_num_times = int(df_can.loc[country, 'Total']/float(total_immigration)*max_words)\n"," word_string = word_string + ((country + ' ') * repeat_num_times)\n"," \n","# display the generated text\n","word_string"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["We are not dealing with any stopwords here, so there is no need to pass them when creating the word cloud.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# create the word cloud\n","wordcloud = WordCloud(background_color='white').generate(word_string)\n","\n","print('Word cloud created!')"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# display the cloud\n","fig = plt.figure()\n","fig.set_figwidth(14)\n","fig.set_figheight(18)\n","\n","plt.imshow(wordcloud, interpolation='bilinear')\n","plt.axis('off')\n","plt.show()"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["According to the above word cloud, it looks like the majority of the people who immigrated came from one of 15 countries that are displayed by the word cloud. One cool visual that you could build, is perhaps using the map of Canada and a mask and superimposing the word cloud on top of the map of Canada. That would be an interesting visual to build!\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["# Regression Plots <a id=\"10\"></a>\n","\n","> Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. You can learn more about _seaborn_ by following this [link](https://seaborn.pydata.org?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ) and more about _seaborn_ regression plots by following this [link](http://seaborn.pydata.org/generated/seaborn.regplot.html?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["In lab _Pie Charts, Box Plots, Scatter Plots, and Bubble Plots_, we learned how to create a scatter plot and then fit a regression line. It took ~20 lines of code to create the scatter plot along with the regression fit. In this final section, we will explore _seaborn_ and see how efficient it is to create regression lines and fits using this library!\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Let's first install _seaborn_\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# install seaborn\n","!conda install -c anaconda seaborn --yes\n","\n","# import library\n","import seaborn as sns\n","\n","print('Seaborn installed and imported!')"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Create a new dataframe that stores that total number of landed immigrants to Canada per year from 1980 to 2013.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# we can use the sum() method to get the total population per year\n","df_tot = pd.DataFrame(df_can[years].sum(axis=0))\n","\n","# change the years to type float (useful for regression later on)\n","df_tot.index = map(float, df_tot.index)\n","\n","# reset the index to put in back in as a column in the df_tot dataframe\n","df_tot.reset_index(inplace=True)\n","\n","# rename columns\n","df_tot.columns = ['year', 'total']\n","\n","# view the final dataframe\n","df_tot.head()"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["With _seaborn_, generating a regression plot is as simple as calling the **regplot** function.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":true},"outputs":[],"source":["import seaborn as sns\n","ax = sns.regplot(x='year', y='total', data=df_tot)"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["This is not magic; it is _seaborn_! You can also customize the color of the scatter plot and regression line. Let's change the color to green.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["import seaborn as sns\n","ax = sns.regplot(x='year', y='total', data=df_tot, color='green')"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["You can always customize the marker shape, so instead of circular markers, let's use '+'.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["import seaborn as sns\n","ax = sns.regplot(x='year', y='total', data=df_tot, color='green', marker='+')"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Let's blow up the plot a little bit so that it is more appealing to the sight.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["plt.figure(figsize=(15, 10))\n","ax = sns.regplot(x='year', y='total', data=df_tot, color='green', marker='+')"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["And let's increase the size of markers so they match the new size of the figure, and add a title and x- and y-labels.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["plt.figure(figsize=(15, 10))\n","ax = sns.regplot(x='year', y='total', data=df_tot, color='green', marker='+', scatter_kws={'s': 200})\n","\n","ax.set(xlabel='Year', ylabel='Total Immigration') # add x- and y-labels\n","ax.set_title('Total Immigration to Canada from 1980 - 2013') # add title"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["And finally increase the font size of the tickmark labels, the title, and the x- and y-labels so they don't feel left out!\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["plt.figure(figsize=(15, 10))\n","\n","sns.set(font_scale=1.5)\n","\n","ax = sns.regplot(x='year', y='total', data=df_tot, color='green', marker='+', scatter_kws={'s': 200})\n","ax.set(xlabel='Year', ylabel='Total Immigration')\n","ax.set_title('Total Immigration to Canada from 1980 - 2013')"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Amazing! A complete scatter plot with a regression fit with 5 lines of code only. Isn't this really amazing?\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["If you are not a big fan of the purple background, you can easily change the style to a white plain background.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["plt.figure(figsize=(15, 10))\n","\n","sns.set(font_scale=1.5)\n","sns.set_style('ticks') # change background to white background\n","\n","ax = sns.regplot(x='year', y='total', data=df_tot, color='green', marker='+', scatter_kws={'s': 200})\n","ax.set(xlabel='Year', ylabel='Total Immigration')\n","ax.set_title('Total Immigration to Canada from 1980 - 2013')"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Or to a white background with gridlines.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["plt.figure(figsize=(15, 10))\n","\n","sns.set(font_scale=1.5)\n","sns.set_style('whitegrid')\n","\n","ax = sns.regplot(x='year', y='total', data=df_tot, color='green', marker='+', scatter_kws={'s': 200})\n","ax.set(xlabel='Year', ylabel='Total Immigration')\n","ax.set_title('Total Immigration to Canada from 1980 - 2013')"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["**Question**: Use seaborn to create a scatter plot with a regression line to visualize the total immigration from Denmark, Sweden, and Norway to Canada from 1980 to 2013.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["### type your answer here\n","\n","\n","\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["<details><summary>Click here for a sample python solution</summary>\n","\n","```python\n"," #The correct answer is:\n"," \n"," # create df_countries dataframe\n"," df_countries = df_can.loc[['Denmark', 'Norway', 'Sweden'], years].transpose()\n","\n"," # create df_total by summing across three countries for each year\n"," df_total = pd.DataFrame(df_countries.sum(axis=1))\n","\n"," # reset index in place\n"," df_total.reset_index(inplace=True)\n","\n"," # rename columns\n"," df_total.columns = ['year', 'total']\n","\n"," # change column year from string to int to create scatter plot\n"," df_total['year'] = df_total['year'].astype(int)\n","\n"," # define figure size\n"," plt.figure(figsize=(15, 10))\n","\n"," # define background style and font size\n"," sns.set(font_scale=1.5)\n"," sns.set_style('whitegrid')\n","\n"," # generate plot and add title and axes labels\n"," ax = sns.regplot(x='year', y='total', data=df_total, color='green', marker='+', scatter_kws={'s': 200})\n"," ax.set(xlabel='Year', ylabel='Total Immigration')\n"," ax.set_title('Total Immigrationn from Denmark, Sweden, and Norway to Canada from 1980 - 2013')\n","\n","```\n","\n","</details>\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["### Thank you for completing this lab!\n","\n","## Author\n","\n","<a href=\"https://www.linkedin.com/in/aklson/\" target=\"_blank\">Alex Aklson</a>\n","\n","## Change Log\n","\n","| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n","| ----------------- | ------- | ------------- | ---------------------------------- |\n","| 2020-11-03 | 2.1 | Lakshmi Holla | Changed URL of excel file |\n","| 2020-08-27 | 2.0 | Lavanya | Moved lab to course repo in GitLab |\n","| | | | |\n","| | | | |\n","\n","## <h3 align=\"center\"> © IBM Corporation 2020. All rights reserved. <h3/>\n"]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.7.3"},"widgets":{"state":{},"version":"1.1.2"}},"nbformat":4,"nbformat_minor":2}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment