Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save jandot/ab819eeda99d345f26f9aea042ff4086 to your computer and use it in GitHub Desktop.

Select an option

Save jandot/ab819eeda99d345f26f9aea042ff4086 to your computer and use it in GitHub Desktop.
i0p16a-dimensionality-reduction.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"name": "i0p16a-dimensionality-reduction.ipynb",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/jandot/ab819eeda99d345f26f9aea042ff4086/copy-of-unsupervised-learning-dimensionality-reduction.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"# **Unsupervised Learning: Dimensionality Reduction**"
],
"metadata": {
"id": "_7px72X4OP9O"
}
},
{
"cell_type": "markdown",
"source": [
"As part of unsupervised learning, we will focus on Dimensionality Reduction and Clustering. This is the **Dimensionality Reduction** part.\n",
"\n",
"**Dimensionality reduction helps simplify high-dimensional data while preserving important structures and patterns, making it easier to visualize and analyze.**"
],
"metadata": {
"id": "afSWi6Ua_S2f"
}
},
{
"cell_type": "markdown",
"source": [
"# **1. Practical matters**"
],
"metadata": {
"id": "chf0GveEyrcr"
}
},
{
"cell_type": "markdown",
"source": [
"For this exercise, we will use colab to run this python notebook.\n",
"\n",
"**IMPORTANT: First make your own copy of this notebook, using File -> Save a copy in Drive/Github**\n",
"\n",
"\n",
"We will cover several popular methods for dimensionality reduction, including:\n",
"\n",
"1. Singular Value Decomposition (SVD)\n",
"2. Principal Component Analysis (PCA)\n",
"3. t-Distributed Stochastic Neighbor Embedding (t-SNE)\n",
"4. Uniform Manifold Approximation and Projection (UMAP)\n",
"\n",
"Each technique offers unique strengths and is suited to different types of data and analysis objectives.\n",
"\n",
"\n",
"---\n",
"<br>\n",
"\n",
"\n",
"We will mainly use the **Scikit-learn** Python library (https://scikit-learn.org/stable/), which provides all tools we need to do descriptive and predictive analysis on complex data.\n",
"\n",
"Scikit-learn not only offers **machine learning algorithms** but also comes with a **variety of built-in datasets that are perfect for learning and experimentation.**\n",
"\n",
"Scikit-learn contains a **`datasets`** module that provides several ways to load and generate data, we will use both of them:\n",
"\n",
"1. **Built-in Datasets**: Ready-to-use real-world datasets.\n",
"2. **Data Generators**: Functions to create synthetic datasets for specific patterns.\n",
"\n",
"These datasets typically return a special object, we will explain when we load it.\n",
"\n",
"\n",
"---\n",
"<br>\n",
"\n",
"\n",
"\n",
"<img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/0/05/Scikit_learn_logo_small.svg/1200px-Scikit_learn_logo_small.svg.png\" width=400px />\n",
"\n",
"**Before each exercise, make sure to check out the related documentation on the scikit-learn website!**\n",
"\n"
],
"metadata": {
"id": "yffLhMtpswUw"
}
},
{
"cell_type": "markdown",
"source": [
"## **1.1 Setting things up**"
],
"metadata": {
"id": "GHuaoW38Eztt"
}
},
{
"cell_type": "markdown",
"source": [
"See https://scikit-learn.org/stable/install.html for instructions on how to install the scikit-learn library. We will also introduce two powerful visualization libraries: **Seaborn** and **Vega-Altair**.\n",
"\n",
"\n",
"**Seaborn** is **built on top of Matplotlib** and provides a **high-level interface for creating attractive statistical graphics**. It comes with built-in themes and color palettes that make your visualizations look professional with minimal effort. Seaborn is particularly well-suited for working with pandas DataFrames and showing the relationship between variables.\n",
"\n",
"\n",
"**Vega-Altair** (or simply Altair) is a declarative statistical visualization library that offers a powerful and **flexible way to create interactive plots**. It follows the grammar of graphics principles and allows you to create complex visualizations with concise, intuitive code.\n",
"\n",
"\n",
"---\n",
"\n",
"<br>\n",
"<br>\n",
"<br>\n",
"Before we begin, let's ensure we have all the required libraries installed. We'll need several libraries for data manipulation, visualization, and dimensionality reduction:"
],
"metadata": {
"id": "_qVc1F0yFwUD"
}
},
{
"cell_type": "code",
"source": [
"try:\n",
" # Data manipulation libraries\n",
" import pandas as pd\n",
" import numpy as np\n",
"\n",
" # Visualization libraries\n",
" import matplotlib.pyplot as plt\n",
" import seaborn as sns\n",
" import altair as alt\n",
"\n",
" # Machine learning\n",
" from sklearn import datasets\n",
" from sklearn.preprocessing import LabelEncoder\n",
" from sklearn.preprocessing import StandardScaler\n",
" from sklearn.decomposition import TruncatedSVD\n",
" from sklearn.decomposition import PCA\n",
" from sklearn.manifold import TSNE\n",
" import umap\n",
"\n",
" print(\"All required libraries are already installed!\")\n",
"except ImportError as e:\n",
" print(f\"Missing library: {e}\")"
],
"metadata": {
"id": "SZlT4cEMgMIy"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"This is an example of using the pip command to install these libraries. If the cell above returns a \"Missing library\" error, run the cell below (`!pip install ...`) and then retry."
],
"metadata": {
"id": "_9JXi2yugz7k"
}
},
{
"cell_type": "code",
"source": [
"!pip install numpy pandas seaborn matplotlib altair scikit-learn umap-learn"
],
"metadata": {
"id": "IrSLx3kkF3dN",
"collapsed": true
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Make sure you re-import all packages**"
],
"metadata": {
"id": "KarGk4ZeRiv9"
}
},
{
"cell_type": "markdown",
"source": [
"## **1.2 Datasets**"
],
"metadata": {
"id": "bKPrfSIG3hmy"
}
},
{
"cell_type": "markdown",
"source": [
"We will use three datasets:\n",
"1. Fisher iris dataset\n",
"2. Swiss Roll\n",
"3. MNIST (Modified National Institute of Standards and Technology database)"
],
"metadata": {
"id": "b4N0wxHNAmDk"
}
},
{
"cell_type": "markdown",
"source": [
"### **1.2.1 Fisher Iris dataset**"
],
"metadata": {
"id": "8NVHv5nE3qPM"
}
},
{
"cell_type": "markdown",
"source": [
"The iris dataset (https://www.kaggle.com/datasets/uciml/iris) used in a paper by Sir Ronald Fisher in 1936 describes 150 iris flowers, equally distributed across 3 species: Iris setosa, Iris versicolor and Iris virginica. For each flower, the following features are recorded:\n",
"\n",
"1. sepal length\n",
"2. sepal width\n",
"3. petal length\n",
"4. petal width\n",
"5. species\n",
"\n",
"<img src=\"https://miro.medium.com/v2/resize:fit:1400/format:webp/1*f6KbPXwksAliMIsibFyGJw.png\" width=500px >\n"
],
"metadata": {
"id": "vhsRwdyUQtxJ"
}
},
{
"cell_type": "code",
"source": [
"df_iris = pd.read_csv('http://vda-lab.github.io/assets/iris.csv')"
],
"metadata": {
"id": "INjkMFds72aV"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Print the first 5 records and show the data type for each column\n",
"print(df_iris.head(5))\n",
"print(\"\\nData types of each column:\")\n",
"for col in df_iris.columns:\n",
" print(f'{col}: {df_iris[col].dtype}')"
],
"metadata": {
"id": "BlZnjzqeWjku"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Let's make a scatterplot matrix of the numerical features, coloured by species. This will give us a first idea of the data so that we know what we're working with.\n",
"\n",
"We can do this using a Seaborn `pairplot`. Seaborn is a plotting library in python. See here for more information about the pairplot: https://seaborn.pydata.org/tutorial/introduction.html"
],
"metadata": {
"id": "M_8k9U45Ehea"
}
},
{
"cell_type": "code",
"source": [
"sns.pairplot(df_iris, hue=\"species\")"
],
"metadata": {
"id": "KDVPDTIWQbrh"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Since there is one column which is not numeric, we need to create a new data frame with only the numeric features."
],
"metadata": {
"id": "pJoPPEeRjbzs"
}
},
{
"cell_type": "code",
"source": [
"df_iris_numeric = df_iris.select_dtypes(include=[float, int])"
],
"metadata": {
"id": "z0cZV0u3uBAy"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Think about these questions and put your answer on google forms:**\n",
"\n",
"https://forms.gle/BFJ2apm6WBZjEb6h6\n",
"\n",
"(keep this form open, you will need this later)\n",
"\n",
"•\t**E1**: Describe the scatterplot matrix. What patterns do you observe among the different species?\n",
"\n",
"•\t**E2**: Based on the scatterplots, which features seem to separate the species well?"
],
"metadata": {
"id": "C1KWHGGpSATX"
}
},
{
"cell_type": "markdown",
"source": [
"### **1.2.2 Swiss Roll**"
],
"metadata": {
"id": "PDsJWtFv3qYE"
}
},
{
"cell_type": "markdown",
"source": [
"The Swiss Rolls dataset is a simple 1600 x 3 dimension dataset. It consists of 1600 observations of 3 variables which represent 3 dimensional coordinates. The purpose of the data set was to use it for testing dimensionality reduction techniques. It gets its name due to the fact that the 3 dimensional version (which is generated by taking the coordinates in a 2 dimensional plot created from a Gaussian distribution and mapping it to a 3 dimensional table with (x,y) -> (x cos x, y, x sin x)) looks similar to the dessert snack, generically called \"Swiss Rolls\".\n",
"\n",
"\n"
],
"metadata": {
"id": "hSKpx7i-IQ4w"
}
},
{
"cell_type": "code",
"source": [
"# Let's generate our swiss roll dataset\n",
"sr_points, sr_color = datasets.make_swiss_roll(n_samples=1600, random_state=0)"
],
"metadata": {
"id": "7zBlLdDFQ3Ky"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Now we have two arrays: one is the coordinates of the Swiss roll in three-dimensional space, and the other is the color of each data point.\n",
"\n",
"Keep in mind the type of input data the library requires, as it could be either a DataFrame or an array. Be careful to use the correct format."
],
"metadata": {
"id": "ytBNZy9QYyLq"
}
},
{
"cell_type": "code",
"source": [
"print (\"The first 5 Swiss roll datapoints:\")\n",
"print (sr_points[:5])\n",
"print (\"\\nThe colour of those 5 datapoints:\")\n",
"print (sr_color[:5])"
],
"metadata": {
"id": "Y4QLchNJbdim"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Let plot this Siwss roll"
],
"metadata": {
"id": "yq5EI-I3cLff"
}
},
{
"cell_type": "code",
"source": [
"fig = plt.figure(figsize=(8, 6))\n",
"ax = fig.add_subplot(111, projection=\"3d\")\n",
"fig.add_axes(ax)\n",
"ax.scatter(\n",
" sr_points[:, 0], sr_points[:, 1], sr_points[:, 2], c=sr_color, s=50, alpha=0.8\n",
")\n",
"ax.set_title(\"Swiss Roll in Ambient Space\")\n",
"ax.view_init(azim=-66, elev=12)\n",
"_ = ax.text2D(0.8, 0.05, s=\"n_samples=1600\", transform=ax.transAxes)"
],
"metadata": {
"id": "pNSgwR5-YnpZ"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Let's try to make a pairplot for this data as well."
],
"metadata": {
"id": "hiR7xC4zcbzr"
}
},
{
"cell_type": "code",
"source": [
"df_swiss_roll = pd.DataFrame(sr_points)\n",
"\n",
"# YOUR CODE HERE TO PLOT"
],
"metadata": {
"id": "Ve9YqEGRboXc"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Think about this question and put your answer on google forms:**:\n",
"\n",
"• **E3**: Does a pairplot for the Swiss roll data make sense? Why or why not?\n",
"\n",
"•\t**E4**: Is there any feature that could be used to interpret this dataset, like we did for the Iris dataset?"
],
"metadata": {
"id": "1mO6w5COSSCx"
}
},
{
"cell_type": "markdown",
"source": [
"### **1.2.3 MNIST**"
],
"metadata": {
"id": "qfTASK-s3qgo"
}
},
{
"cell_type": "markdown",
"source": [
"Beyond dataframes and arrays, there are some bigger and more complex data types used in practice to store images.\n",
"\n",
"The **MNIST** (Modified National Institute of Standards and Technology) dataset is a widely used benchmark in the field of machine learning and computer vision. **It consists of 70,000 grayscale images of handwritten digits, ranging from 0 to 9.** Each image is 28x28 pixels, resulting in a total of 784 pixels per image. The dataset is divided into a training set of 60,000 images and a test set of 10,000 images. Due to its simplicity and the extensive amount of research conducted using this dataset, MNIST serves as a standard for evaluating image classification algorithms, allowing researchers and developers to test and compare their models’ performance on a common task.\n",
"\n",
"**We will use a smaller version of mnist from `sklearn`**, which consists of 8*8 pixel images of digits, instead of 28x28 pixels to speed up. This dataset is available within `sklearn`. Let's start and see what this dataset contais:"
],
"metadata": {
"id": "21QRyrXZLiN-"
}
},
{
"cell_type": "code",
"source": [
"digits = datasets.load_digits()"
],
"metadata": {
"id": "U8UGbkgNRSSL"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The object returned by `datasets.load_digits()` in scikit-learn is a Bunch object, which is similar to a Python dictionary but with some added functionality. The **Bunch object** stores various components of the dataset.\n",
"\n"
],
"metadata": {
"id": "3s0y5x2ULBRL"
}
},
{
"cell_type": "code",
"source": [
"print('The type of object returns by `load_digits` is: ')\n",
"print(type(digits))\n",
"print('\\nDigits dictionary content \\n{}'.format(digits.keys()))\n",
"print('\\n')\n",
"for key in digits.keys():\n",
" print(f\"{key}: {type(digits[key])}\")"
],
"metadata": {
"id": "Gwy2wcgfo-Au"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Each dataset is well-documented and comes with metadata that helps understand its structure and purpose.\n",
"\n",
"\n",
"Let's visualize some examples from the digits dataset. We'll display the first **four handwritten digit images along with their corresponding labels**. Each image is shown in grayscale, where darker pixels indicate the pen strokes that form the digit."
],
"metadata": {
"id": "bkvytyQ5o6YV"
}
},
{
"cell_type": "code",
"source": [
"images_and_labels = list(zip(digits.images, digits.target))\n",
"for index, (image, label) in enumerate(images_and_labels[:4]):\n",
" plt.subplot(2, 4, index + 1)\n",
" plt.axis('off')\n",
" plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')\n",
" plt.title('Training: %i' % label)"
],
"metadata": {
"id": "SERC6PLTRNvY"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Think about these questions and put your answer on google forms:**\n",
"\n",
"• **E5**. Does it make sense to create a `pairplot` for the MNIST dataset? Think about the number of dimensions we have and what they represent exactly.\n",
"\n",
"• **E6**. Given your answer for E5: is there any feature that can distinguish handwritten 0 and 1?\n",
"\n",
"\n",
"•\t**E7**. What challenges do high-dimensional datasets like MNIST pose for machine learning algorithms?"
],
"metadata": {
"id": "rx7ExTnBKAxt"
}
},
{
"cell_type": "markdown",
"source": [
"# **2. `scikit-learn` approach**"
],
"metadata": {
"id": "q10qOdPdqEDT"
}
},
{
"cell_type": "markdown",
"source": [
"In scikit-learn, `fit` and `fit_transform` are fundamental methods that you'll encounter frequently. Understanding these methods is crucial as they follow scikit-learn's consistent API design:\n",
"\n",
"\n",
"* `fit(X[, y])`: This method is used to train or compute the parameters of your model/transformer using your training data. **Think of it as the \"learning\" step where the algorithm learns from your data.**\n",
"* `transform(X)`: After fitting, this method applies the learned transformation to your data. **You can use it on both training data and new, unseen data**\n",
"* `fit_transform(X[, y])`: This is a convenience method that combines both fit and transform in one step. **It's equivalent to calling fit() followed by transform(), but is often more efficient**."
],
"metadata": {
"id": "udtM0r2cqeJ0"
}
},
{
"cell_type": "markdown",
"source": [
"# **3. Linear Dimensionality Reduction**"
],
"metadata": {
"id": "GTzLJHI96bKL"
}
},
{
"cell_type": "markdown",
"source": [
"Dimensionality reduction techniques can be broadly categorized into linear and nonlinear methods. Linear methods assume that the data lies approximately on a lower-dimensional linear subspace of the original high-dimensional space. Two of the most commonly used linear approaches are Principal Component Analysis (PCA) and Singular Value Decomposition (SVD), they both rely heavily on linear algebra techniques for their mathematical formulations and computations."
],
"metadata": {
"id": "IeCEzcCgMOQ2"
}
},
{
"cell_type": "markdown",
"source": [
"### **3.1 Singular value decomposition (SVD)**"
],
"metadata": {
"id": "-bj2N-0Q60A4"
}
},
{
"cell_type": "markdown",
"source": [
"SVD is linear dimensionality reduction technique closely related to PCA but can be more computationally efficient for certain types of data.\n",
"\n",
"SVD decomposes a matrix A (with dimensions m \\times n ) into three other matrices:\n",
"\n",
" A = $\\mathbf{U}$ $\\Sigma$ $\\mathbf{V}^*$\n",
"\n",
"\n",
"* $\\mathbf{U}$ : An $m \\times m$ orthogonal matrix whose columns are called the left singular vectors of A .\n",
"* $\\Sigma$ : An $m \\times n$ diagonal matrix that contains the singular values of A . These singular values are non-negative and typically arranged in descending order.\n",
"* $\\mathbf{V}^*$\n",
" : The transpose of an $n \\times n$ orthogonal matrix whose columns are called the right singular vectors of A.\n",
"The singular values in $\\Sigma$ give us a measure of the importance or “weight” of the corresponding singular vectors in U and V .\n",
"\n",
"\n",
"---\n",
"\n",
"Let's apply SVD for dimensionality reduction using **scikit-learn's TruncatedSVD** on the iris dataset:"
],
"metadata": {
"id": "dA88AH-JXTK-"
}
},
{
"cell_type": "code",
"source": [
"from sklearn.decomposition import TruncatedSVD"
],
"metadata": {
"id": "QHbl43lNsI6K"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"svd = TruncatedSVD(n_components=2)\n",
"sklearn_svd = svd.fit_transform(df_iris_numeric)"
],
"metadata": {
"id": "1zpzqxxzZp2R"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"This code performs two main operations:\n",
"\n",
"* First, `fit()` computes the SVD decomposition:\n",
"\n",
"* Finds the singular vectors (U and V^T matrices), calculates the singular values (Σ matrix), and only keeps the top 2 components (as specified by n_components=2)\n",
"\n",
"* Then, `transform()` projects the original data onto these components:\n",
" taking the original data (df_iris_numeric) with 4 features, projects these onto the 2 most important singular vectors, and returns a new array (sklearn_svd) with just 2 dimensions.\n",
"\n",
"The `fit_transform()` method combines these steps efficiently into a single operation. The resulting sklearn_svd array will have the same number of samples as df_iris_numeric but only 2 features, making it suitable for visualization while preserving the most important patterns in the data."
],
"metadata": {
"id": "Y-vqr3QX5WMM"
}
},
{
"cell_type": "code",
"source": [
"# Now let's examine the components of SVD\n",
"VT = svd.components_ # Right singular vectors\n",
"Sigma = svd.singular_values_ # Singular values\n",
"\n",
"# To get U, we need to do more work\n",
"# We know that fit_transform(X) return U*Sigma\n",
"U_Sigma = svd.transform(df_iris_numeric)\n",
"U = U_Sigma/svd.singular_values_\n",
"\n",
"# Print the matrices to understand the decomposition\n",
"print(\"U matrix (left singular vectors): first 5 rows\")\n",
"print(U[:5])\n",
"\n",
"print(\"\\nSigma matrix (diagonal matrix of singular values):\")\n",
"print(np.diag(Sigma))\n",
"\n",
"print(\"\\nV^T matrix (right singular vectors):\")\n",
"print(VT)"
],
"metadata": {
"id": "t17AfGYfbWRO",
"collapsed": true
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Let's see how much variance is explained by each component\n",
"explained_variance_ratio = svd.explained_variance_ratio_\n",
"print(\"\\nExplained variance ratio:\", explained_variance_ratio)\n",
"print(f\"Total variance explained: {sum(explained_variance_ratio)*100:.2f}%\")"
],
"metadata": {
"id": "U52tSlqSv0LN"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Key points about SVD for dimensionality reduction:\n",
"\n",
"* The singular values in Σ tell us how important each\n",
"* The explained variance ratio helps us understand how much information we retain\n",
"* The reduced data can be used for visualization or as input to other algorithms\n",
"\n",
"\n",
"\n"
],
"metadata": {
"id": "xTYlK96hxN8S"
}
},
{
"cell_type": "code",
"source": [
"# Visualize the reduced data\n",
"plt.figure(figsize=(10, 6))\n",
"\n",
"# add plot data\n",
"df_iris['svd1'] = sklearn_svd[:, 0]\n",
"df_iris['svd2'] = sklearn_svd[:, 1]\n",
"\n",
"sns.scatterplot(data=df_iris, x='svd1', y='svd2', hue=df_iris.species,\n",
" palette='Set1', s=100, edgecolor='w', alpha=0.7)\n",
"\n",
"plt.xlabel('First component')\n",
"plt.ylabel('Second component')\n",
"plt.title('Iris Dataset Reduced to 2D Using SVD')\n",
"\n",
"plt.show()"
],
"metadata": {
"id": "YTW4NkCwv6uq"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"In the code above, we wrote `df_iris['svd1'] = something`. What we basically do here, is add new features to the `df_iris` dataset. Have a look at the updated dataset after that addition:"
],
"metadata": {
"id": "3BtEN7rEt05j"
}
},
{
"cell_type": "code",
"source": [
"df_iris"
],
"metadata": {
"id": "6ZsetX-_uETE"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Try this and put your answer on google forms:**\n",
"\n",
"\n",
"* E8. Can you try to plot this scatterplot by setting the colour like this: `hue=df_iris.species`? What happens? And why?\n",
"\n",
"\n"
],
"metadata": {
"id": "9RVxpPMCyPqf"
}
},
{
"cell_type": "markdown",
"source": [
"\n",
"\n",
"---\n",
"\n",
"**Alternatives**:\n",
"\n",
"We can also use **`altair`** to plot. This is a python library that uses the **vega** API. See here for more information: https://altair-viz.github.io/"
],
"metadata": {
"id": "w1egmyJXXK6i"
}
},
{
"cell_type": "code",
"source": [
"# Create an interactive scatter plot\n",
"chart = alt.Chart(df_iris).mark_circle(size=60).encode(\n",
" x=alt.X('svd1', title='First Component'),\n",
" y=alt.Y('svd2', title='Second Component'),\n",
" color=alt.Color('species', title='Species Type'),\n",
" tooltip=['species'] # Add tooltip for interactivity\n",
").properties(\n",
" width=500,\n",
" height=400,\n",
" title='Iris Dataset Reduced to 2D Using SVD'\n",
").configure_title(\n",
" fontSize=16,\n",
" anchor='middle'\n",
").interactive()\n",
"\n",
"# Display the chart\n",
"chart"
],
"metadata": {
"id": "JRydSJpqXHeH"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"\n",
"SVD can also be done using **`numpy`** instead of scikit-learn:"
],
"metadata": {
"id": "FVIA1Y9S2Q0g"
}
},
{
"cell_type": "code",
"source": [
"X_standardized = StandardScaler().fit_transform(df_iris_numeric)\n",
"\n",
"# Perform SVD using numpy\n",
"U, Sigma, VT = np.linalg.svd(X_standardized, full_matrices=False)\n",
"\n",
"# Keep only 2 components (equivalent to n_components=2 in TruncatedSVD)\n",
"n_components = 2\n",
"U_numpy = U[:, :n_components]\n",
"Sigma_numpy = Sigma[:n_components]\n",
"VT_numpy = VT[:n_components, :]\n",
"\n",
"# Transform the data (equivalent to fit_transform in sklearn)\n",
"numpy_svd = U_numpy * Sigma_numpy\n",
"\n",
"# Print the matrices to understand the decomposition\n",
"print(\"U matrix (left singular vectors):\")\n",
"print(U_numpy[:5])\n",
"\n",
"print(\"\\nSigma matrix (diagonal matrix of singular values):\")\n",
"print(np.diag(Sigma_numpy))\n",
"\n",
"# Calculate explained variance ratio (equivalent to explained_variance_ratio_ in sklearn)\n",
"explained_variance = (Sigma ** 2) / (len(X_standardized) - 1)\n",
"total_variance = explained_variance.sum()\n",
"explained_variance_ratio = explained_variance / total_variance\n",
"\n",
"print(\"\\nExplained variance ratio:\", explained_variance_ratio[:n_components])\n",
"print(f\"Total variance explained: {sum(explained_variance_ratio[:n_components])*100:.2f}%\")\n",
"\n",
"print(\"\\nV^T matrix (right singular vectors):\")\n",
"print(VT_numpy)\n",
"\n",
"# Visualize the reduced data\n",
"plt.figure(figsize=(10, 6))\n",
"\n",
"df_iris['svd1_numpy'] = numpy_svd[:, 0]\n",
"df_iris['svd2_numpy'] = numpy_svd[:, 1]\n",
"\n",
"sns.scatterplot(data=df_iris, x='svd1_numpy', y='svd2_numpy', hue='species',\n",
" palette='Set1', s=100, edgecolor='w', alpha=0.7)\n",
"\n",
"plt.xlabel('First component')\n",
"plt.ylabel('Second component')\n",
"plt.title('Iris Dataset Reduced to 2D Using NumPy SVD')\n",
"\n",
"plt.show()"
],
"metadata": {
"id": "nH-l5WEg2QMW",
"collapsed": true
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### **3.2 Principal component analysis (PCA)**"
],
"metadata": {
"id": "ef3veY7A6yhD"
}
},
{
"cell_type": "markdown",
"source": [
"PCA involves finding the principal components of the data, which are essentially the eigenvectors of the covariance matrix. The computation of eigenvalues and eigenvectors, a core part of PCA, is a fundamental concept in linear algebra.\n",
"\n",
"PCA transforms the data into a new coordinate system by projecting it onto the directions (principal components) that maximize variance. This transformation is done using matrix operations, another key element of linear algebra.\n",
"\n",
"\n",
"---\n",
"\n",
"\n",
"Let's initialize a PCA object that will reduce the dimensionality of your data to 2 components."
],
"metadata": {
"id": "VuehpLGFNNMY"
}
},
{
"cell_type": "code",
"source": [
"from sklearn.decomposition import PCA"
],
"metadata": {
"id": "q6aCxR5rsM8Y"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"pca = PCA(n_components=2)"
],
"metadata": {
"id": "sJlOAq7uV73N"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"pca_transformed = pca.fit_transform(df_iris_numeric)"
],
"metadata": {
"id": "aMxBTlCG5GSu"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"This function first “fits” the PCA model to your data (df_iris_numeric), meaning it computes the principal components.\n",
"\n",
"After fitting, it “transforms” the data, projecting it from its original higher-dimensional space into a new 2-dimensional space"
],
"metadata": {
"id": "DFHt-Vnq4-AB"
}
},
{
"cell_type": "markdown",
"source": [
"Let's plot the 2D PCA projection of the 4D iris data:"
],
"metadata": {
"id": "PBiXgSyBNlOb"
}
},
{
"cell_type": "code",
"source": [
"# you can get the X and Y to plot pca projection\n",
"df_iris['pca1'] = pca_transformed[:, 0]\n",
"df_iris['pca2'] = pca_transformed[:, 1]"
],
"metadata": {
"id": "23bDUJFtWm6t"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Plot using Seaborn\n",
"plt.figure(figsize=(8, 6))\n",
"sns.scatterplot(data=df_iris, x='pca1', y='pca2', hue='species',\n",
" palette='Set1', s=100, edgecolor='w', alpha=0.7)\n",
"\n",
"# Add labels and title\n",
"plt.xlabel('Principal Component 1')\n",
"plt.ylabel('Principal Component 2')\n",
"plt.title('PCA of Iris Data')\n",
"\n",
"# Show the plot\n",
"plt.show()"
],
"metadata": {
"id": "dqVAEN_BtW9G"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Plot using altair\n",
"pca_chart = alt.Chart(df_iris).mark_circle(\n",
" size=100,\n",
" opacity=0.7\n",
").encode(\n",
" x=alt.X('pca1', title='Principal Component 1'),\n",
" y=alt.Y('pca2', title='Principal Component 2'),\n",
" color=alt.Color('species',\n",
" title='Species',\n",
" scale=alt.Scale(scheme='set1')),\n",
" tooltip=['species']\n",
").properties(\n",
" width=600,\n",
" height=400,\n",
" title='PCA of Iris Data'\n",
")\n",
"\n",
"pca_chart"
],
"metadata": {
"id": "SwW7RmYsYi0L"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**In practice, you can color your datapoint by any feature, it will help you to explore data.**"
],
"metadata": {
"id": "v1rLJlX0vUq5"
}
},
{
"cell_type": "code",
"source": [],
"metadata": {
"id": "wN30_2enVAjN"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Review how PCA works and then interpret the principal components in the plot\n",
"\n",
"pca_test = PCA(n_components=4)\n",
"pca_test_transformed = pca_test.fit_transform(df_iris_numeric)\n",
"\n",
"eigenvalues = pca_test.explained_variance_\n",
"explained_variance_ratio = pca_test.explained_variance_ratio_\n",
"\n",
"# proportion of variance explained by each principal component\n",
"variance_explained = pca_test.explained_variance_ratio_\n",
"\n",
"# print the eigenvalues\n",
"print(\"Eigenvalues (variance explained by each PC):\", eigenvalues)\n",
"\n",
"# Print the explained variance ratio\n",
"print(\"Explained variance ratio:\", variance_explained)\n",
"\n",
"# Plot the explained variance to visualize eigenvalues decay\n",
"plt.figure(figsize=(8, 5))\n",
"plt.plot(range(1, len(eigenvalues) + 1), eigenvalues, marker='o', linestyle='-')\n",
"plt.title('Eigenvalues Decay (Variance Explained by Each Principal Component)')\n",
"plt.xlabel('Principal Component')\n",
"plt.ylabel('Eigenvalue')\n",
"plt.show()"
],
"metadata": {
"id": "qW6Qg4oKWteP"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Think about these questions and put your answer on google forms:**\n",
"\n",
"* **E9**. What do the principal components represent in this plot?\n",
"\n",
"* **E10**. How do you interpret the PCA projection plot?\n",
"\n",
"* **E11**. How do you interpret the decay of the eigenvalues in the plot?\n",
"\n",
"* **E12**. What does this tell you about the intrinsic dimensionality of the Iris dataset?\n",
"\n",
"* **E13**. Can you apply PCA to reduce its dimensionality to 3 components? or even more? Then how do you plot it?\n",
"\n",
"* **E14**. How much variance is explained by the first two principal components?"
],
"metadata": {
"id": "8SGa7OYHPpAV"
}
},
{
"cell_type": "markdown",
"source": [
"There is important concept in PCA, which is `Loadings`, which are the coefficients (weights) of the original variables in the principal components. They indicate how much each original variable contributes to each principal component."
],
"metadata": {
"id": "2x9B2pMmlnDE"
}
},
{
"cell_type": "code",
"source": [
"# Check first component's loading, and what kind of information you get?\n",
"loadings = pca.components_\n",
"# Convert loadings to a DataFrame for better readability\n",
"loadings_df = pd.DataFrame(loadings.T, columns=[f'PC{i+1}' for i in range(loadings.shape[0])], index=df_iris_numeric.columns)\n",
"\n",
"# Print the original `pca.components_`\n",
"print(\"PCA components (i.e. loadings):\")\n",
"print(pca.components_)\n",
"\n",
"# Print the loadings for the first principal component (PC1)\n",
"print(\"Loadings for PC1:\\n\")\n",
"print(loadings_df['PC1'])"
],
"metadata": {
"id": "Bpc6ke8OSjpJ"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"A biplot is a type of plot that simultaneously displays the scores of the observations (data points) and the loadings of the variables (features) in a principal component analysis (PCA). This can help you visualize the principal components along with the contribution of each original feature to these components."
],
"metadata": {
"id": "9GXUaGXfTpxp"
}
},
{
"cell_type": "code",
"source": [
"# Create a biplot\n",
"plt.figure(figsize=(8, 6))\n",
"sns.scatterplot(data=df_iris, x='pca1', y='pca2', hue='species', palette='Set1', s=100, edgecolor='w', alpha=0.7)\n",
"\n",
"# Plot the loadings (arrows for the original features)\n",
"for i, feature in enumerate(df_iris_numeric):\n",
" plt.arrow(0, 0,\n",
" pca.components_[0, i] * max(df_iris['pca1']),\n",
" pca.components_[1, i] * max(df_iris['pca2']),\n",
" alpha=0.5, width=0.02)\n",
" plt.text(pca.components_[0, i] * max(df_iris['pca1']) * 1.15,\n",
" pca.components_[1, i] * max(df_iris['pca2']) * 1.15,\n",
" feature, ha='center', va='center')\n",
"\n",
"# Label the plot\n",
"plt.xlabel('Principal Component 1')\n",
"plt.ylabel('Principal Component 2')\n",
"plt.title('Biplot of Iris Dataset')\n",
"\n",
"# Add grid and legend\n",
"plt.legend(title='Species')\n",
"plt.show()"
],
"metadata": {
"id": "tQlwzxdmXNou"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Extra exercise when you finish the rest of the exercise:**\n",
"\n",
"Let's create a PCA object that reduces data into 3 components.\n",
"\n",
"Important note:\n",
"While reducing the iris dataset (which has 4 features) to 3 components might seem unnecessary, this example helps illustrate a key point: Dimensionality Reduction (DR) techniques are versatile tools that can be used for various purposes, **not just for visualizing high-dimensional data in 2D**!"
],
"metadata": {
"id": "2IeVqzYgaDIm"
}
},
{
"cell_type": "code",
"source": [
"# YOUR CODE"
],
"metadata": {
"id": "nSfbuyJybgVy"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# **4. Non-Linear Dimensionality Reduction**"
],
"metadata": {
"id": "8sNvcW9D6iw_"
}
},
{
"cell_type": "markdown",
"source": [
"### **4.1 Does linear algebra always work in any datasets?**\n",
"\n",
"Can you try to apply PCA on swiss row dataset, and do above analysis by yourself."
],
"metadata": {
"id": "p_Lb3le0VAlE"
}
},
{
"cell_type": "code",
"source": [
"# YOUR CODE"
],
"metadata": {
"id": "tteqBCy9vISt"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Can you try to apply PCA on MNIST dataset by yourself?"
],
"metadata": {
"id": "Hxn6y8I7V3ls"
}
},
{
"cell_type": "code",
"source": [
"# YOUR CODE"
],
"metadata": {
"id": "ZVVHfXJmy0nE"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Think about these questions and put your answer on google forms:**\n",
"\n",
"\n",
"* **E15**. Are PCA’s projections on these two datasets meaningful? Is there any useful pattern?\n",
"* **E16**. Does t-SNE Component 1, Component 2 represent the similar meaning as PCA's components?\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
],
"metadata": {
"id": "qi0LaSWGpxvx"
}
},
{
"cell_type": "markdown",
"source": [
"\n",
"\n",
"---\n",
"\n",
"\n",
"In the PCA projection, points that are actually close together on the manifold (e.g., along the same part of the roll) may be far apart in the PCA plot, and vice versa. PCA assumes that the data lies approximately on a lower-dimensional linear subspace of the original space. The Swiss roll dataset, however, is inherently non-linear. The data points lie on a two-dimensional manifold that has been “rolled” into a spiral in three-dimensional space. This structure cannot be adequately captured by a linear projection.\n",
"\n",
"PCA will try to find the two principal components that maximize the variance in the data. However, due to the rolled structure, PCA will end up choosing directions that do not meaningfully unroll the dataset into two dimensions. Instead, it will flatten the Swiss roll onto a plane, losing the manifold structure and making points that are distant in the original space appear closer in the PCA-transformed space."
],
"metadata": {
"id": "5IAHOSi2kmOz"
}
},
{
"cell_type": "markdown",
"source": [
"### **4.2 What is a manifold**"
],
"metadata": {
"id": "Bms2NyMn6ss6"
}
},
{
"cell_type": "markdown",
"source": [
"\n",
"\n",
"> In mathematics, a manifold is a topological space that locally resembles Euclidean space near each point. More precisely, an {n}-dimensional manifold, or {n}-manifold for short, is a topological space with the property that each point has a neighborhood that is homeomorphic to an open subset of {n}-dimensional Euclidean space. One-dimensional manifolds include lines and circles, but not self-crossing curves such as a figure 8. Two-dimensional manifolds are also called surfaces. Examples include the plane, the sphere, and the torus, and also the Klein bottle and real projective plane.\n",
"\n",
"<div align=\"right\">-- from [Wikipedia](https://en.wikipedia.org/wiki/Manifold)</div>\n",
"\n",
"\n",
"Take one example from the Swiss roll dataset, it is a synthetic three-dimensional dataset that consists of points lying on a two-dimensional manifold that is rolled up in a spiral shape in three-dimensional space. This means that, locally, around every small neighborhood of a point on the Swiss roll, the surface looks like a flat, two-dimensional plane.\n",
"\n",
"In the PCA projection, **PCA cannot capture the non-linear manifold structure** of the Swiss roll because it only considers linear combinations of the features. It does not account for the fact that the data is rolled up in 3D space."
],
"metadata": {
"id": "D_KAMsVjTlHb"
}
},
{
"cell_type": "code",
"source": [
"# Plotting a line\n",
"plt.figure(figsize=(12, 4))\n",
"\n",
"plt.subplot(1, 3, 1)\n",
"x = np.linspace(-1, 1, 100)\n",
"y = x\n",
"plt.plot(x, y, label='Line')\n",
"plt.title('Line')\n",
"plt.axis('equal')\n",
"\n",
"# Plotting a circle\n",
"plt.subplot(1, 3, 2)\n",
"theta = np.linspace(0, 2 * np.pi, 100)\n",
"x = np.cos(theta)\n",
"y = np.sin(theta)\n",
"plt.plot(x, y, label='Circle')\n",
"plt.title('Circle')\n",
"plt.axis('equal')\n",
"\n",
"# Plotting a figure 8 (not a manifold)\n",
"plt.subplot(1, 3, 3)\n",
"t = np.linspace(0, 2 * np.pi, 100)\n",
"x = np.sin(t)\n",
"y = np.sin(t) * np.cos(t)\n",
"plt.plot(x, y, label='Figure 8')\n",
"plt.title('Figure 8 (Not a Manifold)')\n",
"plt.axis('equal')\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
],
"metadata": {
"id": "UpB_hvoXX0cs"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Despite being embedded in three-dimensional space, if we were to “unroll” or “flatten” the Swiss roll, it would become a simple two-dimensional rectangle. This illustrates that the Swiss roll has only two intrinsic dimensions (width and length), not three."
],
"metadata": {
"id": "jhGOugTSsnvv"
}
},
{
"cell_type": "code",
"source": [
"# let's see again this Swiss Roll\n",
"\n",
"fig = plt.figure(figsize=(8, 6))\n",
"ax = fig.add_subplot(111, projection=\"3d\")\n",
"fig.add_axes(ax)\n",
"ax.scatter(\n",
" sr_points[:, 0], sr_points[:, 1], sr_points[:, 2], c=sr_color, s=50, alpha=0.8\n",
")\n",
"ax.set_title(\"Swiss Roll in Ambient Space\")\n",
"ax.view_init(azim=-66, elev=12)\n",
"_ = ax.text2D(0.8, 0.05, s=\"n_samples=1600\", transform=ax.transAxes)"
],
"metadata": {
"id": "yM1wJC6GX3my"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### **4.3 Distance Metric**"
],
"metadata": {
"id": "-Kunf_NAtpZj"
}
},
{
"cell_type": "markdown",
"source": [
"How does nonlinear dimensionality reduction work with distance metrics?\n",
"\n",
"In nonlinear dimensionality reduction algorithms, measuring distances locally is crucial. The choice of the distance metric can significantly affect the performance of these algorithms, as they often rely on capturing the relationships between nearby data points. When data is projected into a lower-dimensional space, preserving local structure (i.e., distances between neighboring points) is essential to ensure that the reduced representation reflects the true geometry of the original data.\n",
"\n",
"**Common Distance Metrics**:\n",
"\n",
"\n",
"* Euclidean Distance: The most commonly used metric, Euclidean distance measures the “straight-line” distance between two points in a Euclidean space. It assumes that the data lies in a space where this straight-line measurement accurately reflects the relationships between points. Euclidean distance is widely applicable, especially when the geometry of the data matters.\n",
"* Cosine Distance: Cosine distance measures the cosine of the angle between two vectors. This metric is particularly useful in high-dimensional spaces (e.g., text analysis or word embeddings), where the magnitude of the vectors is less important than their direction. Cosine distance helps capture the similarity between objects based on their orientation rather than absolute magnitude.\n",
"\n",
"* Jaccard Distance: Used to measure dissimilarity between two sets, Jaccard distance is useful for binary or categorical data. It quantifies the ratio of non-overlapping elements in the two sets. This metric is especially helpful in scenarios like comparing user preferences or attributes of items in recommendation systems.\n",
"\n",
"**Special Distance Metrics**:\n",
"\n",
"* Kullback-Leibler (KL) Divergence:\n",
"Also called relative entropy, KL divergence measures the difference between two probability distributions. It is especially useful in machine learning tasks involving probability distributions, such as in information theory or Bayesian models. Unlike symmetric distance measures, KL divergence is not a true distance metric because it is asymmetric (i.e., KL(p||q) \\neq KL(q||p)).\n",
"* UniFrac: This is a specialized distance metric designed for comparing biological communities. It takes into account the phylogenetic relationships between species, allowing the comparison of communities based on their evolutionary distances. It’s frequently used in ecological and microbiome studies.\n",
"\n",
"**Similarity Metrics**:\n",
"*\tBray-Curtis Dissimilarity: While not strictly a distance metric, Bray-Curtis is a measure of dissimilarity often used to compare the composition of two different sites (or samples) based on counts of species or other features. It’s particularly useful in environmental science and ecology.\n",
"\n",
"Choosing the Right Metric\n",
"\n",
"The selection of an appropriate distance metric is key to ensuring that the dimensionality reduction technique accurately captures the structure of your data. Different distance metrics will emphasize different aspects of the data:\n",
"\n",
"* Geometric relationships (e.g., Euclidean distance) might be best when physical proximity is important.\n",
"* Orientation or similarity (e.g., cosine distance) is more suitable for high-dimensional spaces or vector-based comparisons.\n",
"* Set-based comparisons (e.g., Jaccard distance) are useful for categorical or binary data.\n",
"\n",
"When choosing a distance metric, it’s important to consider the nature of your data and the goals of your analysis. The right metric will ensure that your dimensionality reduction not only simplifies the data but also preserves the relationships that are most relevant to your problem."
],
"metadata": {
"id": "zWX6_JcWtt5n"
}
},
{
"cell_type": "markdown",
"source": [
"### **4.4 t-SNE**"
],
"metadata": {
"id": "CsPNS6wDrg56"
}
},
{
"cell_type": "markdown",
"source": [
"t-SNE (t-Distributed Stochastic Neighbor Embedding) is a popular non-linear dimensionality reduction technique developed by Laurens van der Maaten and Geoffrey Hinton in 2008 ([original paper](https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf)). It is especially well-suited for visualizing high-dimensional datasets by reducing them to two or three dimensions while preserving the local structure of the data.\n",
"\n",
"t-SNE calculates a probability distribution over pairs of high-dimensional objects in such a way that similar objects have a higher probability of being chosen. It then defines a similar probability distribution for the points in the lower-dimensional embedding. The algorithm tries to minimize the difference (Kullback-Leibler divergence) between these two probability distributions, effectively positioning similar points together in the low-dimensional space."
],
"metadata": {
"id": "fWL5zbCwxsSE"
}
},
{
"cell_type": "code",
"source": [
"from sklearn.manifold import TSNE"
],
"metadata": {
"id": "kXlkUP-kwuns"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"X_embedded = TSNE(n_components=2, learning_rate='auto',init='random', perplexity=30).fit_transform(df_iris_numeric)\n",
"tsne1 = list(map(lambda x:x[0], X_embedded))\n",
"tsne2 = list(map(lambda x:x[1], X_embedded))\n",
"df_iris['tsne1'] = tsne1\n",
"df_iris['tsne2'] = tsne2"
],
"metadata": {
"id": "lwv7gMOgwwUX"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"sns.scatterplot(data=df_iris,x='tsne1',y='tsne2',hue=\"species\")"
],
"metadata": {
"id": "NyJfaJ-DX8bo"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Plot using altair\n",
"tsne_chart = alt.Chart(df_iris).mark_circle(\n",
" size=100,\n",
" opacity=0.7\n",
").encode(\n",
" x=alt.X('tsne1', title='t-SNE Component 1'),\n",
" y=alt.Y('tsne2', title='t-SNE Component 2'),\n",
" color=alt.Color('species:N',\n",
" title='Species',\n",
" scale=alt.Scale(scheme='set1')),\n",
" tooltip=['species', 'sepal_length', 'sepal_width','petal_length','petal_width']\n",
").properties(\n",
" width=600,\n",
" height=400,\n",
" title='t-SNE Visualization of Iris Dataset'\n",
")\n",
"\n",
"tsne_chart"
],
"metadata": {
"id": "DmbfFSOhcemF"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Apply t-SNE to Swiss roll dataset by yourself\n",
"# YOUR CODE"
],
"metadata": {
"id": "RhcOmqcY0-1u"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Apply t-SNE to MNIST dataset by yourself\n",
"# YOUR CODE\n",
"# tip: You can assign one of key in Bunch object like this: X = digits.data"
],
"metadata": {
"id": "RJ4jwwj-wYGj"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"t-SNE is sensitive to its parameters (like perplexity and learning rate), and different settings can produce different visualizations.\n",
"\n",
"Check sklearn's [documentation for t-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) , try to play around with different parameters and see how plot change.\n",
"\n",
"Select at least 3 different perplexity values and see if the results are the same.\n",
"\n",
"You can use perplexity values in the range (5 - 50) suggested by van der Maaten & Hinton.\n",
"But you are encouraged to choose perplextiy at a bigger range in this exercise, just get yourself familer with it."
],
"metadata": {
"id": "OjX2sFPGzd1w"
}
},
{
"cell_type": "code",
"source": [
"# perplexity_1\n",
"# YOUR CODE"
],
"metadata": {
"id": "bfodyuNydRxI"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# perplexity_2\n",
"# YOUR CODE"
],
"metadata": {
"id": "ILHQ-wqqdUHH"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# perplexity_3\n",
"# YOUR CODE"
],
"metadata": {
"id": "-wK9q74RdVpe"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Think about these questions and put your answer on google forms:**\n",
"\n",
"\n",
"* **E17**. We are using scatterplot to plot this two dimentions projection, but is that a real scatterplot?\n",
"* **E18**. After playing different `perplexity`, does cluster always keep the same sizes?\n",
"* **E19**. After playing different `perplexity`, does distance between clusters represent the real distance between them?\n",
"* **E20**. After playing different `perplexity`, does noise always a noise?\n"
],
"metadata": {
"id": "AQVMcAeg3Nu-"
}
},
{
"cell_type": "markdown",
"source": [
"Are you always scrolling to compare plots? Let's combine them in a single plot."
],
"metadata": {
"id": "uY8dIableAXV"
}
},
{
"cell_type": "code",
"source": [
"# Create label encoder\n",
"le = LabelEncoder()\n",
"color_labels = le.fit_transform(df_iris.species)"
],
"metadata": {
"id": "zAH7NeJ84VKF"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# add three different perplexity values\n",
"perplexities = [ ]\n",
"\n",
"# Create subplots (1 row, 3 columns)\n",
"fig, axes = plt.subplots(1, 3, figsize=(20, 6))\n",
"\n",
"# Loop through each perplexity, fit t-SNE, and plot\n",
"for i, perplexity in enumerate(perplexities):\n",
" tsne = TSNE(n_components=2, perplexity=perplexity, max_iter=1000, random_state=42, metric='cosine')\n",
" X_tsne = tsne.fit_transform(df_iris_numeric)\n",
"\n",
" scatter = axes[i].scatter(X_tsne[:, 0], X_tsne[:, 1], c=color_labels, cmap='Spectral', alpha=0.7)\n",
" axes[i].set_title(f't-SNE with Perplexity {perplexity}')\n",
" plt.gca().axes.get_xaxis().set_visible(False)\n",
" plt.gca().axes.get_yaxis().set_visible(False)\n",
"\n",
"# Show the plots\n",
"plt.tight_layout()\n",
"plt.show()"
],
"metadata": {
"id": "umauf70kd-wU",
"collapsed": true
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The 'metric' parameter of TSNE could be a str among {'braycurtis', 'seuclidean', 'minkowski', 'sokalmichener', 'chebyshev', 'sqeuclidean', 'mahalanobis', 'haversine', 'canberra', 'jaccard', 'cityblock', 'nan_euclidean', 'matching', 'l1', 'precomputed', 'dice', 'wminkowski', 'russellrao', 'sokalsneath', 'correlation', 'hamming', 'euclidean', 'rogerstanimoto', 'l2', 'yule', 'cosine', 'manhattan'} or a callable.\n",
"\n",
"**You can also take distance matrix as input data, and set 'metric' as \"precomputed\"**"
],
"metadata": {
"id": "uoCa45RRlmzq"
}
},
{
"cell_type": "code",
"source": [
"# add three different metrics you want to explore\n",
"metrics = []\n",
"\n",
"# Create subplots (1 row, 3 columns)\n",
"fig, axes = plt.subplots(1, 3, figsize=(20, 6))\n",
"\n",
"# Loop through each metric, fit t-SNE, and plot\n",
"for i, m in enumerate(metrics):\n",
" tsne = TSNE(n_components=2, perplexity=30, max_iter=1000, random_state=42, metric=m)\n",
" X_tsne = tsne.fit_transform(df_iris_numeric)\n",
"\n",
" scatter = axes[i].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='Spectral', alpha=0.7)\n",
" axes[i].set_title(f't-SNE with {m} metric')\n",
"\n",
" # Hide x and y axes for each subplot\n",
" axes[i].get_xaxis().set_visible(False)\n",
" axes[i].get_yaxis().set_visible(False)\n",
"\n",
"# Show the plots\n",
"plt.tight_layout()\n",
"plt.show()"
],
"metadata": {
"id": "xCZfVp6Nc7w7",
"collapsed": true
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Here is a conclusion draw from *How to Use t-SNE Effectively* by Distill\n",
"\n",
"\n",
"> There’s a reason that t-SNE has become so popular: it’s incredibly flexible, and can often find structure where other dimensionality-reduction algorithms cannot. Unfortunately, that very flexibility makes it tricky to interpret. Out of sight from the user, the algorithm makes all sorts of adjustments that tidy up its visualizations. Don’t let the hidden “magic” scare you away from the whole technique, though. The good news is that by studying how t-SNE behaves in simple cases, it’s possible to develop an intuition for what’s going on.\n",
"\n",
"Check how they getting there by yourself and how they plot t-SNE projection. [origial website](https://distill.pub/2016/misread-tsne/)\n"
],
"metadata": {
"id": "7W-0f8GR5hrA"
}
},
{
"cell_type": "markdown",
"source": [
"However, by t-sne, global structure is not explicitly preserved. This problem is mitigated by initializing points with PCA (using init='pca')."
],
"metadata": {
"id": "5oteqU9J7pLs"
}
},
{
"cell_type": "code",
"source": [
"# Define different initializations to explore\n",
"initialization = ['random', 'pca']\n",
"\n",
"# Create subplots (1 row, 2 columns)\n",
"fig, axes = plt.subplots(1, 2, figsize=(20, 6))\n",
"\n",
"# Loop through each initialization, fit t-SNE, and plot\n",
"for i, ini in enumerate(initialization):\n",
" tsne = TSNE(n_components=2, perplexity=30, max_iter=1000, random_state=42, init=ini)\n",
" X_tsne = tsne.fit_transform(df_iris_numeric)\n",
"\n",
" scatter = axes[i].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='Spectral', alpha=0.7)\n",
" axes[i].set_title(f't-SNE with {ini} initialization')\n",
"\n",
" # Hide x and y axes for each subplot\n",
" axes[i].get_xaxis().set_visible(False)\n",
" axes[i].get_yaxis().set_visible(False)\n",
"\n",
"# Show the plots\n",
"plt.tight_layout()\n",
"plt.show()"
],
"metadata": {
"id": "TLVJJBHomQPY"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"And what's the differences between random and pca initiazation"
],
"metadata": {
"id": "aZo_VET8-PQe"
}
},
{
"cell_type": "markdown",
"source": [
"Apply t-SNE on Swiss Roll dataset by yourself, also play around with different parameters."
],
"metadata": {
"id": "Sj8XTJMIovG3"
}
},
{
"cell_type": "code",
"source": [
"# Using defult parameters\n",
"# YOUR CODE"
],
"metadata": {
"id": "U_Bss1oWpA4y"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Try different perplexities\n",
"# YOUR CODE"
],
"metadata": {
"id": "gm7qX9QIpLPV"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Try different metrics\n",
"# YOUR CODE"
],
"metadata": {
"id": "d000HJ1KpTKb"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Try different initialization\n",
"# YOUR CODE"
],
"metadata": {
"id": "3nuG5QblpXhL"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### 4.5 UMAP"
],
"metadata": {
"id": "ojqVmOpbrk6W"
}
},
{
"cell_type": "markdown",
"source": [
"UMAP (Uniform Manifold Approximation and Projection) is a popular non-linear dimensionality reduction technique that was developed by [Leland McInnes, John Healy, and James Melville in 2018](https://arxiv.org/abs/1802.03426). UMAP is designed to create low-dimensional representations of high-dimensional data while preserving the global and local structure of the data more effectively than traditional methods like PCA or even t-SNE in some scenarios.\n",
"\n",
"check [UMAP's documentation](https://umap-learn.readthedocs.io/en/latest/index.html) by youself.\n"
],
"metadata": {
"id": "cH08_E9V6_cg"
}
},
{
"cell_type": "markdown",
"source": [
"We can also use UMAP.plot to plot result, this built-in function asks for additional libraries: `matplotlib`, `pandas`, `datashader`, `bokeh`, `holoviews`. Make sure you have them all."
],
"metadata": {
"id": "fJxmRCORjIYp"
}
},
{
"cell_type": "code",
"source": [
"import umap\n",
"import umap.plot"
],
"metadata": {
"id": "lXasUc7li7vN",
"collapsed": true
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"! pip install datashader bokeh holoviews"
],
"metadata": {
"collapsed": true,
"id": "z03Yxc7ujgri"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# let's apply UMAP on iris dataset, and plot umap projection\n",
"umap_reducer = umap.UMAP(n_components=2, random_state=42)\n",
"mapper = umap_reducer.fit(df_iris_numeric)\n",
"\n",
"umap.plot.points(mapper)"
],
"metadata": {
"id": "jwPSpXbQ9nOd"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Add the embeddings to df_iris as new columns\n",
"df_iris['umap1'] = mapper.embedding_[:, 0]\n",
"df_iris['umap2'] = mapper.embedding_[:, 1]"
],
"metadata": {
"id": "AxKyEjReuhtT"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Think about ththis questions and put your answer on google forms:**:\n",
"\n",
"* **E21**. Does UMAP giving a similar projection like PCA and t-SNE?"
],
"metadata": {
"id": "sqplSt0AtoWA"
}
},
{
"cell_type": "code",
"source": [
"# Check UMAP documentation, and apply it to MNIST dataset, then plot the result\n",
"# YOUR CODE"
],
"metadata": {
"id": "99q3ZiCaEeY8"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Applying UMAP to Swiss Roll dataset, then plot the result\n",
"# YOUR CODE"
],
"metadata": {
"id": "n2W0YtKmraXv"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Check 'Plotting UMAP results' in UMAP's documentation, and plot the connectivity for both two datasets\n",
"# YOUR CODE"
],
"metadata": {
"id": "JbQ0iP9crmY3"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"It's similar that hyperpemeter is important to UMAP like t-SNE. UMAP has several hyperparameters that can have a significant impact on the resulting embedding. In this notebook we will be covering the four major ones:\n",
"\n",
"\n",
"\n",
"* n_neighbors\n",
"* min_dist\n",
"* n_components\n",
"* metric\n",
"\n",
"\n",
"Using Swiss Roll dataset to play around with them."
],
"metadata": {
"id": "sIKrlNT5J7JQ"
}
},
{
"cell_type": "code",
"source": [
"# Try different n_neighbors\n",
"n_neighbors_collection=[ ]\n",
"# YOUR CODE"
],
"metadata": {
"id": "ZFnnUnKiq-AE"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Try different min_dist\n",
"min_dist_collection=[ ]\n",
"# YOUR CODE"
],
"metadata": {
"id": "fmfJNNuvsACc"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Try different metric\n",
"metric_collection=[ ]\n",
"# YOUR CODE"
],
"metadata": {
"id": "lzqMzkuOsIa5"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"[Understand UMAP](https://pair-code.github.io/understanding-umap/)"
],
"metadata": {
"id": "A_RgnvV58ugw"
}
},
{
"cell_type": "markdown",
"source": [
"# **5. How to choose a suitable dimensionality reduction algorithm?**"
],
"metadata": {
"id": "MdEiXKcBp67l"
}
},
{
"cell_type": "markdown",
"source": [
"When we get a high dimention dataset, we do not know which DR tecniq works better, the first thing we can do is trying all of them, and see which one give you a better idea."
],
"metadata": {
"id": "XjZzgJ-vqE3R"
}
},
{
"cell_type": "code",
"source": [
"from scipy.stats import spearmanr\n",
"from sklearn.metrics import pairwise_distances\n",
"\n",
"# Function to calculate stress\n",
"def calculate_stress(orig_dist, embedded_dist):\n",
" return np.sqrt(np.sum((orig_dist - embedded_dist)**2)) / np.sqrt(np.sum(orig_dist**2))\n",
"\n",
"# Function to create Shepard diagram\n",
"def plot_shepard_diagram(ax, original_distances, embedded_distances, method_name):\n",
" # Flatten the distance matrices\n",
" orig_dist_flat = original_distances[np.triu_indices(original_distances.shape[0], k=1)]\n",
" emb_dist_flat = embedded_distances[np.triu_indices(embedded_distances.shape[0], k=1)]\n",
"\n",
" # Calculate stress\n",
" stress = calculate_stress(orig_dist_flat, emb_dist_flat)\n",
"\n",
" # Calculate Spearman correlation\n",
" correlation, _ = spearmanr(orig_dist_flat, emb_dist_flat)\n",
"\n",
" # Create scatter plot\n",
" ax.scatter(orig_dist_flat, emb_dist_flat, alpha=0.5, s=20)\n",
"\n",
" # Add perfect correlation line\n",
" min_dist = min(orig_dist_flat.min(), emb_dist_flat.min())\n",
" max_dist = max(orig_dist_flat.max(), emb_dist_flat.max())\n",
" ax.plot([min_dist, max_dist], [min_dist, max_dist], 'r--', alpha=0.8)\n",
"\n",
" ax.set_xlabel('Original Distances')\n",
" ax.set_ylabel('Embedded Distances')\n",
" ax.set_title(f'{method_name}\\nStress: {stress:.3f}, Correlation: {correlation:.3f}')\n",
"\n",
"# Create label encoder\n",
"le = LabelEncoder()\n",
"color_labels = le.fit_transform(df_iris.species)\n",
"\n",
"# Calculate original distances\n",
"original_distances = pairwise_distances(df_iris_numeric)\n",
"\n",
"# Get embeddings\n",
"embeddings = {\n",
" 'SVD': TruncatedSVD(n_components=2).fit_transform(df_iris_numeric),\n",
" 'PCA': PCA(n_components=2).fit_transform(df_iris_numeric),\n",
" 't-SNE': TSNE(n_components=2, random_state=42).fit_transform(df_iris_numeric),\n",
" 'UMAP': umap.UMAP(n_components=2, random_state=42).fit_transform(df_iris_numeric)\n",
"}\n",
"\n",
"# Calculate embedded distances for each method\n",
"embedded_distances = {\n",
" method: pairwise_distances(embedding)\n",
" for method, embedding in embeddings.items()\n",
"}\n",
"\n",
"# Create figure with 2 rows and 4 columns\n",
"fig = plt.figure(figsize=(20, 10))\n",
"\n",
"# Get unique species for legend\n",
"unique_species = df_iris.species.unique()\n",
"colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_species)))\n",
"\n",
"# Plot projections (top row) and Shepard diagrams (bottom row)\n",
"for idx, (method, embedding) in enumerate(embeddings.items()):\n",
" # Projection plot (top row)\n",
" ax_proj = plt.subplot(2, 4, idx + 1)\n",
"\n",
" # Create scatter plots for each species separately\n",
" for i, species in enumerate(unique_species):\n",
" species_mask = df_iris.species == species\n",
" ax_proj.scatter(embedding[species_mask, 0],\n",
" embedding[species_mask, 1],\n",
" c=[colors[i]],\n",
" label=species,\n",
" s=50,\n",
" alpha=0.7)\n",
"\n",
" ax_proj.set_title(method)\n",
"\n",
" # Hide axes for t-SNE and UMAP\n",
" if method in ['t-SNE', 'UMAP']:\n",
" ax_proj.set_xticks([])\n",
" ax_proj.set_yticks([])\n",
"\n",
" # Add legend to the first plot only\n",
" if idx == 0:\n",
" ax_proj.legend(title=\"Species\",\n",
" loc=\"center left\",\n",
" bbox_to_anchor=(1, 0.5))\n",
"\n",
" # Shepard diagram (bottom row)\n",
" ax_shep = plt.subplot(2, 4, idx + 5)\n",
" plot_shepard_diagram(ax_shep, original_distances, embedded_distances[method], method)\n",
"\n",
"# Print stress values\n",
"print(\"\\nStress values:\")\n",
"print(\"-\" * 50)\n",
"for method, distances in embedded_distances.items():\n",
" stress = calculate_stress(original_distances, distances)\n",
" print(f\"{method}: {stress:.3f}\")\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
],
"metadata": {
"id": "mys-tX1HbuRQ"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Create brush selection that will be shared across charts\n",
"brush = alt.selection_interval(resolve='global')\n",
"\n",
"# Base chart function to avoid repetition and ensure consistency\n",
"def make_scatter(data, x, y, title):\n",
" return alt.Chart(data).mark_circle(size=60).encode(\n",
" x=alt.X(x, title=x.replace('_', ' ')),\n",
" y=alt.Y(y, title=y.replace('_', ' ')),\n",
" color=alt.condition(brush, 'species', alt.value('lightgray'),\n",
" scale=alt.Scale(scheme='set1')),\n",
" tooltip=['species', 'sepal_length', 'sepal_width','petal_length','petal_width']\n",
" ).properties(\n",
" width=300,\n",
" height=300,\n",
" title=title\n",
" )\n",
"\n",
"# Create the four plots\n",
"svd_plot = make_scatter(df_iris, 'svd1', 'svd2', 'SVD').add_selection(brush)\n",
"pca_plot = make_scatter(df_iris, 'pca1', 'pca2', 'PCA').add_selection(brush)\n",
"tsne_plot = make_scatter(df_iris, 'tsne1', 'tsne2', 't-SNE').add_selection(brush)\n",
"umap_plot = make_scatter(df_iris, 'umap1', 'umap2', 'UMAP').add_selection(brush)\n",
"\n",
"# Create two rows of plots\n",
"row1 = alt.hconcat(svd_plot, pca_plot)\n",
"row2 = alt.hconcat(tsne_plot, umap_plot)\n",
"\n",
"# Combine rows into a 2x2 grid\n",
"combined_plot = alt.vconcat(\n",
" row1, row2\n",
").properties(\n",
" title=alt.TitleParams(\n",
" text='Comparison of Dimensionality Reduction Methods',\n",
" fontSize=20\n",
" )\n",
").configure_axis(\n",
" labelFontSize=12,\n",
" titleFontSize=14\n",
").configure_legend(\n",
" titleFontSize=12,\n",
" labelFontSize=11\n",
")\n",
"\n",
"# Display the visualization\n",
"combined_plot"
],
"metadata": {
"id": "_OwGvjPLIjSg"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Try to compare them on iris and Swiss Roll datasets by yourself. Then try to think how to use different DR, write your answer on google form."
],
"metadata": {
"id": "DEfVcBt8ySUZ"
}
},
{
"cell_type": "code",
"source": [
"# Apply different DR on Swiss Roll dataset\n",
"# YOUR CODE"
],
"metadata": {
"id": "uolqzhD4y0DJ"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Apply different DR on iris dataset\n",
"# YOUR CODE"
],
"metadata": {
"id": "xb47Qv-Iy5sD"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Think about ththis questions and put your answer on google forms:**:\n",
"\n",
"* **E22**. For MNIST dataset, which one is better? Why?\n",
"* **E23**. For Swiss Roll dataset, which one is better? Why?\n",
"* **E24**. For iris dataset, which one is better? Why?"
],
"metadata": {
"id": "B-19AfHHjh62"
}
},
{
"cell_type": "markdown",
"source": [
"# Feel free to apply these different approach to your own dataset !!!"
],
"metadata": {
"id": "m4eI4di1rCes"
}
},
{
"cell_type": "code",
"source": [
"# Do you have any data sets to work on? Try different DRs and compare them."
],
"metadata": {
"id": "rlQa7ljwJUYi"
},
"execution_count": null,
"outputs": []
}
]
}
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment