HubLot · August 29, 2015 14:20
diff --git a/Hclust R vs Python b/Hclust R vs Python
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Comparison between hierarchical clustering in R and Python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For R, the version used is 3.2.0.\n",
    "\n",
    "For Python, the [scipy library](http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html) will be used. scikit-learn is also used even if, for hierarchical clustering only, it's based on scipy.\n",
    "\n",
    "**An important note: ** The input matrix for scipy function is a **condensed matrix** (a condensed distance matrix is a flat array containing the upper triangular of the distance matrix). The documentation on that is wrong that point. ([stackoverflow post](https://stackoverflow.com/questions/18952587/use-distance-matrix-in-scipy-cluster-hierarchy-linkage), [issues](https://github.com/scipy/scipy/issues/2614) and [PR](https://github.com/scipy/scipy/pull/2615) on github's scipy).\n",
    "\n",
    "For R, the whole sparse distance matrix can be used.\n",
    "\n",
    "For scikit-learn, we need to set the `affinity=\"precomputed\"` when the input is a distance matrix. By default, scikit-learn use as input a matrix of observations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prepare the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import PBlib as PB\n",
    "import PBclust\n",
    "import numpy as np\n",
    "from sklearn.cluster import AgglomerativeClustering\n",
    "import scipy.cluster.hierarchy as scipyh\n",
    "import scipy.spatial.distance as ssd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def medoids(dist, clusters_sklearn):\n",
    "    medo = []\n",
    "    for cluster in np.unique(clusters_sklearn):\n",
    "        medo.append(np.argmin(np.sum(dist[clusters_sklearn == cluster],axis=0)))\n",
    "    return medo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "read 270 sequences in demo2_tmp/psi_md_traj_all.PB.fasta\n"
     ]
    }
   ],
   "source": [
    "#Get the data\n",
    "header_lst, seq_lst = PB.read_several_fasta([\"demo2_tmp/psi_md_traj_all.PB.fasta\"])\n",
    "substitution_mat = PB.load_substitution_matrix(PB.SUBSTITUTION_MATRIX_NAME)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Building distance matrix\n",
      "100%\n",
      "Building distance matrix\n",
      "100%\n"
     ]
    }
   ],
   "source": [
    "#Distances\n",
    "distance_mat = PB.distance_matrix(seq_lst, substitution_mat)\n",
    "\n",
    "#condensed one for scipy\n",
    "# The diagonal of the whole matrix has to be set at 0 first.\n",
    "dist_mat = PB.distance_matrix(seq_lst, substitution_mat)\n",
    "np.fill_diagonal(dist_mat, 0)\n",
    "dist_mat_cond = ssd.squareform(dist_mat)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Ward method"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For R versions > 3.0.3, the good ward method is known as `ward.D2` (see [stats.exchange](https://stats.stackexchange.com/questions/109949/what-algorithm-does-ward-d-in-hclust-implement-if-it-is-not-wards-criteria) and the corresponding [paper](http://arxiv.org/pdf/1111.6285.pdf). For R versions <= 3.0.3, the only method is `ward`. \n",
    "\n",
    "To obtain the same results in both methods, the distance matrix has to be squared method before using `ward` method. See my [example notebook](http://nbviewer.ipython.org/gist/HubLot/7f0824c33d0b200bb10b).\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### R"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "R clustering\n",
      "cluster 1: 90 sequences (33%)\n",
      "cluster 3: 71 sequences (26%)\n",
      "cluster 5: 58 sequences (21%)\n",
      "cluster 4: 32 sequences (12%)\n",
      "cluster 2: 19 sequences ( 7%)\n",
      "Index of medoids: [65, 94, 164, 180, 267]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "nclusters = 5\n",
    "\n",
    "#R way\n",
    "clusters_R_D2, medoid_R_D2 = PB.hclust(distance_mat, nclusters=nclusters, method='ward.D2')\n",
    "\n",
    "print(\"R clustering\")\n",
    "PBclust.display_clust_report(clusters_R_D2)\n",
    "print(\"Index of medoids: {0}\".format(medoid_R_D2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scipy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "Valid methods when the raw observations are omitted are 'single', 'complete', 'weighted', and 'average'.",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-60-7ab9ead058a3>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[1;31m# Scipy way condensed\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mdata_link_cond\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mscipyh\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mward\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdist_mat_cond\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m      3\u001b[0m \u001b[0mclusters_scipy_cond\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mscipyh\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfcluster\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdata_link_cond\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnclusters\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcriterion\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'maxclust'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      4\u001b[0m \u001b[0mmedoid_scipy_cond\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mmedoids\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdist_array\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mclusters_scipy_cond\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      5\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32m/home/temple/santuz/.virtualenvs/pbxplore/local/lib/python2.7/site-packages/scipy/cluster/hierarchy.pyc\u001b[0m in \u001b[0;36mward\u001b[1;34m(y)\u001b[0m\n\u001b[0;32m    463\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    464\u001b[0m     \"\"\"\n\u001b[1;32m--> 465\u001b[1;33m     \u001b[1;32mreturn\u001b[0m \u001b[0mlinkage\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmethod\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'ward'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmetric\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'euclidean'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    466\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    467\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32m/home/temple/santuz/.virtualenvs/pbxplore/local/lib/python2.7/site-packages/scipy/cluster/hierarchy.pyc\u001b[0m in \u001b[0;36mlinkage\u001b[1;34m(y, method, metric)\u001b[0m\n\u001b[0;32m    625\u001b[0m         \u001b[0md\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mdistance\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mnum_obs_y\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    626\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[0mmethod\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32min\u001b[0m \u001b[0m_cpy_non_euclid_methods\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 627\u001b[1;33m             raise ValueError(\"Valid methods when the raw observations are \"\n\u001b[0m\u001b[0;32m    628\u001b[0m                              \u001b[1;34m\"omitted are 'single', 'complete', 'weighted', \"\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    629\u001b[0m                              \"and 'average'.\")\n",
      "\u001b[1;31mValueError\u001b[0m: Valid methods when the raw observations are omitted are 'single', 'complete', 'weighted', and 'average'."
     ]
    }
   ],
   "source": [
    "# Scipy way condensed\n",
    "data_link_cond = scipyh.ward(dist_mat_cond)\n",
    "clusters_scipy_cond = scipyh.fcluster(data_link_cond, nclusters, criterion='maxclust')\n",
    "medoid_scipy_cond = medoids(dist_array, clusters_scipy_cond)\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You see a nice exception when using a condensed matrix with the ward method...\n",
    "\n",
    "In fact, for now, it's not possible to use a condensed distance matrix and the ward method with scipy ([issue on github](https://github.com/scipy/scipy/issues/1798))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### scikit-learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 113,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "precomputed was provided as affinity. Ward can only work with euclidean distances.",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-113-40de67e77546>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[1;31m#The input for scikit-learn is a matrix of observations\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      2\u001b[0m \u001b[1;31m#Since we used a distance matrix, we need to set affinity=\"precomputed\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[0mward\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mAgglomerativeClustering\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mn_clusters\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m5\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mlinkage\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'ward'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maffinity\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m\"precomputed\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdistance_mat\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;32m/home/temple/santuz/.virtualenvs/pbxplore/local/lib/python2.7/site-packages/sklearn/cluster/hierarchical.pyc\u001b[0m in \u001b[0;36mfit\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m    717\u001b[0m             raise ValueError(\"%s was provided as affinity. Ward can only \"\n\u001b[0;32m    718\u001b[0m                              \u001b[1;34m\"work with euclidean distances.\"\u001b[0m \u001b[1;33m%\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 719\u001b[1;33m                              (self.affinity, ))\n\u001b[0m\u001b[0;32m    720\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    721\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mlinkage\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32min\u001b[0m \u001b[0m_TREE_BUILDERS\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mValueError\u001b[0m: precomputed was provided as affinity. Ward can only work with euclidean distances."
     ]
    }
   ],
   "source": [
    "#The input for scikit-learn is a matrix of observations\n",
    "#Since we used a distance matrix, we need to set affinity=\"precomputed\"\n",
    "ward = AgglomerativeClustering(n_clusters=5, linkage='ward', affinity=\"precomputed\").fit(distance_mat)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's not possible either with scikit-learn as the `ward` linkage only works with `affinity` set at \"euclidian\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Average method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 121,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "nclusters = 5\n",
    "\n",
    "#R way\n",
    "clusters_R_ave, medoid_R_ave = PB.hclust(distance_mat, nclusters=nclusters, method='average')\n",
    "\n",
    "# Scipy way\n",
    "data_link_ave = scipyh.linkage(dist_mat_cond, method='average')\n",
    "clusters_scipy_ave = scipyh.fcluster(data_link_ave, nclusters, criterion='maxclust')\n",
    "medoid_scipy_ave = medoids(distance_mat, clusters_scipy_ave)\n",
    "\n",
    "#scikit-learn\n",
    "average = AgglomerativeClustering(n_clusters=5, linkage='average', affinity=\"precomputed\").fit(distance_mat)\n",
    "clusters_sklearn_ave = list(average.labels_)\n",
    "medoid_sklearn_ave = medoids(distance_mat, clusters_sklearn_ave)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "R clustering\n",
      "cluster 1: 90 sequences (33%)\n",
      "cluster 3: 69 sequences (26%)\n",
      "cluster 4: 57 sequences (21%)\n",
      "cluster 2: 53 sequences (20%)\n",
      "cluster 5: 1 sequences ( 0%)\n",
      "Index of medoids: [65, 180, 164, 267, 230]\n",
      "\n",
      "scipy clustering\n",
      "cluster 1: 90 sequences (33%)\n",
      "cluster 2: 69 sequences (26%)\n",
      "cluster 4: 57 sequences (21%)\n",
      "cluster 3: 53 sequences (20%)\n",
      "cluster 5: 1 sequences ( 0%)\n",
      "Index of medoids: [65, 164, 180, 267, 230]\n",
      "\n",
      "scikit-learn clustering\n",
      "cluster 3: 90 sequences (33%)\n",
      "cluster 2: 69 sequences (26%)\n",
      "cluster 0: 57 sequences (21%)\n",
      "cluster 1: 53 sequences (20%)\n",
      "cluster 4: 1 sequences ( 0%)\n",
      "Index of medoids: [267, 180, 164, 65, 230]\n"
     ]
    }
   ],
   "source": [
    "#Sum up the results\n",
    "\n",
    "print(\"R clustering\")\n",
    "PBclust.display_clust_report(clusters_R_ave)\n",
    "print(\"Index of medoids: {0}\".format(medoid_R_ave))\n",
    "print\n",
    "\n",
    "print(\"scipy clustering\")\n",
    "PBclust.display_clust_report(clusters_scipy_ave)\n",
    "print(\"Index of medoids: {0}\".format(medoid_scipy_ave))\n",
    "print\n",
    "\n",
    "print(\"scikit-learn clustering\")\n",
    "PBclust.display_clust_report(clusters_sklearn_ave)\n",
    "print(\"Index of medoids: {0}\".format(medoid_sklearn_ave))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### Great same results!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Complete method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "nclusters = 5\n",
    "\n",
    "#R way\n",
    "clusters_R_comp, medoid_R_comp = PB.hclust(distance_mat, nclusters=nclusters, method=\"complete\")\n",
    "\n",
    "# Scipy way\n",
    "data_link_comp = scipyh.linkage(dist_mat_cond, method=\"complete\")\n",
    "clusters_scipy_comp = scipyh.fcluster(data_link_comp, nclusters, criterion='maxclust')\n",
    "medoid_scipy_comp = medoids(distance_mat, clusters_scipy_comp)\n",
    "\n",
    "#scikit-learn\n",
    "complete = AgglomerativeClustering(n_clusters=5, linkage='complete', affinity=\"precomputed\").fit(distance_mat)\n",
    "clusters_sklearn_comp = list(complete.labels_)\n",
    "medoid_sklearn_comp = medoids(distance_mat, clusters_sklearn_comp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "R clustering\n",
      "cluster 1: 90 sequences (33%)\n",
      "cluster 3: 62 sequences (23%)\n",
      "cluster 5: 54 sequences (20%)\n",
      "cluster 4: 36 sequences (13%)\n",
      "cluster 2: 28 sequences (10%)\n",
      "Index of medoids: [65, 94, 164, 180, 267]\n",
      "\n",
      "scipy clustering\n",
      "cluster 3: 90 sequences (33%)\n",
      "cluster 4: 62 sequences (23%)\n",
      "cluster 2: 54 sequences (20%)\n",
      "cluster 1: 36 sequences (13%)\n",
      "cluster 5: 28 sequences (10%)\n",
      "Index of medoids: [180, 267, 65, 164, 94]\n",
      "\n",
      "scikit-learn clustering\n",
      "cluster 2: 90 sequences (33%)\n",
      "cluster 4: 62 sequences (23%)\n",
      "cluster 0: 54 sequences (20%)\n",
      "cluster 1: 36 sequences (13%)\n",
      "cluster 3: 28 sequences (10%)\n",
      "Index of medoids: [267, 180, 65, 94, 164]\n"
     ]
    }
   ],
   "source": [
    "#Sum up the results\n",
    "\n",
    "print(\"R clustering\")\n",
    "PBclust.display_clust_report(clusters_R_comp)\n",
    "print(\"Index of medoids: {0}\".format(medoid_R_comp))\n",
    "print\n",
    "\n",
    "print(\"scipy clustering\")\n",
    "PBclust.display_clust_report(clusters_scipy_comp)\n",
    "print(\"Index of medoids: {0}\".format(medoid_scipy_comp))\n",
    "print\n",
    "\n",
    "print(\"scikit-learn clustering\")\n",
    "PBclust.display_clust_report(clusters_sklearn_comp)\n",
    "print(\"Index of medoids: {0}\".format(medoid_sklearn_comp))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Great same results!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Comparison between hierarchical clustering in R and Python"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"For R, the version used is 3.2.0.\n",
	"\n",
	"For Python, the [scipy library](http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html) will be used. scikit-learn is also used even if, for hierarchical clustering only, it's based on scipy.\n",
	"\n",
	"An important note: The input matrix for scipy function is a condensed matrix (a condensed distance matrix is a flat array containing the upper triangular of the distance matrix). The documentation on that is wrong that point. ([stackoverflow post](https://stackoverflow.com/questions/18952587/use-distance-matrix-in-scipy-cluster-hierarchy-linkage), [issues](https://github.com/scipy/scipy/issues/2614) and [PR](https://github.com/scipy/scipy/pull/2615) on github's scipy).\n",
	"\n",
	"For R, the whole sparse distance matrix can be used.\n",
	"\n",
	"For scikit-learn, we need to set the `affinity=\"precomputed\"` when the input is a distance matrix. By default, scikit-learn use as input a matrix of observations."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Prepare the data"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 89,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"import PBlib as PB\n",
	"import PBclust\n",
	"import numpy as np\n",
	"from sklearn.cluster import AgglomerativeClustering\n",
	"import scipy.cluster.hierarchy as scipyh\n",
	"import scipy.spatial.distance as ssd"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 47,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"def medoids(dist, clusters_sklearn):\n",
	" medo = []\n",
	" for cluster in np.unique(clusters_sklearn):\n",
	" medo.append(np.argmin(np.sum(dist[clusters_sklearn == cluster],axis=0)))\n",
	" return medo"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 90,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"read 270 sequences in demo2_tmp/psi_md_traj_all.PB.fasta\n"
	]
	}
	],
	"source": [
	"#Get the data\n",
	"header_lst, seq_lst = PB.read_several_fasta([\"demo2_tmp/psi_md_traj_all.PB.fasta\"])\n",
	"substitution_mat = PB.load_substitution_matrix(PB.SUBSTITUTION_MATRIX_NAME)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 91,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Building distance matrix\n",
	"100%\n",
	"Building distance matrix\n",
	"100%\n"
	]
	}
	],
	"source": [
	"#Distances\n",
	"distance_mat = PB.distance_matrix(seq_lst, substitution_mat)\n",
	"\n",
	"#condensed one for scipy\n",
	"# The diagonal of the whole matrix has to be set at 0 first.\n",
	"dist_mat = PB.distance_matrix(seq_lst, substitution_mat)\n",
	"np.fill_diagonal(dist_mat, 0)\n",
	"dist_mat_cond = ssd.squareform(dist_mat)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Ward method"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"For R versions > 3.0.3, the good ward method is known as `ward.D2` (see [stats.exchange](https://stats.stackexchange.com/questions/109949/what-algorithm-does-ward-d-in-hclust-implement-if-it-is-not-wards-criteria) and the corresponding [paper](http://arxiv.org/pdf/1111.6285.pdf). For R versions <= 3.0.3, the only method is `ward`. \n",
	"\n",
	"To obtain the same results in both methods, the distance matrix has to be squared method before using `ward` method. See my [example notebook](http://nbviewer.ipython.org/gist/HubLot/7f0824c33d0b200bb10b).\n",
	"\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### R"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 95,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"R clustering\n",
	"cluster 1: 90 sequences (33%)\n",
	"cluster 3: 71 sequences (26%)\n",
	"cluster 5: 58 sequences (21%)\n",
	"cluster 4: 32 sequences (12%)\n",
	"cluster 2: 19 sequences ( 7%)\n",
	"Index of medoids: [65, 94, 164, 180, 267]\n",
	"\n"
	]
	}
	],
	"source": [
	"nclusters = 5\n",
	"\n",
	"#R way\n",
	"clusters_R_D2, medoid_R_D2 = PB.hclust(distance_mat, nclusters=nclusters, method='ward.D2')\n",
	"\n",
	"print(\"R clustering\")\n",
	"PBclust.display_clust_report(clusters_R_D2)\n",
	"print(\"Index of medoids: {0}\".format(medoid_R_D2))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Scipy"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 60,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"ename": "ValueError",
	"evalue": "Valid methods when the raw observations are omitted are 'single', 'complete', 'weighted', and 'average'.",
	"output_type": "error",
	"traceback": [
	"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
	"\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)",
	"\u001b[1;32m<ipython-input-60-7ab9ead058a3>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;31m# Scipy way condensed\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mdata_link_cond\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mscipyh\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mward\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdist_mat_cond\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 3\u001b[0m \u001b[0mclusters_scipy_cond\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mscipyh\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfcluster\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdata_link_cond\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnclusters\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcriterion\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'maxclust'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[0mmedoid_scipy_cond\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mmedoids\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdist_array\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mclusters_scipy_cond\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
	"\u001b[1;32m/home/temple/santuz/.virtualenvs/pbxplore/local/lib/python2.7/site-packages/scipy/cluster/hierarchy.pyc\u001b[0m in \u001b[0;36mward\u001b[1;34m(y)\u001b[0m\n\u001b[0;32m 463\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 464\u001b[0m \"\"\"\n\u001b[1;32m--> 465\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mlinkage\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmethod\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'ward'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmetric\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'euclidean'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 466\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 467\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
	"\u001b[1;32m/home/temple/santuz/.virtualenvs/pbxplore/local/lib/python2.7/site-packages/scipy/cluster/hierarchy.pyc\u001b[0m in \u001b[0;36mlinkage\u001b[1;34m(y, method, metric)\u001b[0m\n\u001b[0;32m 625\u001b[0m \u001b[0md\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mdistance\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mnum_obs_y\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 626\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mmethod\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32min\u001b[0m \u001b[0m_cpy_non_euclid_methods\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 627\u001b[1;33m raise ValueError(\"Valid methods when the raw observations are \"\n\u001b[0m\u001b[0;32m 628\u001b[0m \u001b[1;34m\"omitted are 'single', 'complete', 'weighted', \"\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 629\u001b[0m \"and 'average'.\")\n",
	"\u001b[1;31mValueError\u001b[0m: Valid methods when the raw observations are omitted are 'single', 'complete', 'weighted', and 'average'."
	]
	}
	],
	"source": [
	"# Scipy way condensed\n",
	"data_link_cond = scipyh.ward(dist_mat_cond)\n",
	"clusters_scipy_cond = scipyh.fcluster(data_link_cond, nclusters, criterion='maxclust')\n",
	"medoid_scipy_cond = medoids(dist_array, clusters_scipy_cond)\n",
	" "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"You see a nice exception when using a condensed matrix with the ward method...\n",
	"\n",
	"In fact, for now, it's not possible to use a condensed distance matrix and the ward method with scipy ([issue on github](https://github.com/scipy/scipy/issues/1798))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### scikit-learn"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 113,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"ename": "ValueError",
	"evalue": "precomputed was provided as affinity. Ward can only work with euclidean distances.",
	"output_type": "error",
	"traceback": [
	"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
	"\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)",
	"\u001b[1;32m<ipython-input-113-40de67e77546>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;31m#The input for scikit-learn is a matrix of observations\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2\u001b[0m \u001b[1;31m#Since we used a distance matrix, we need to set affinity=\"precomputed\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[0mward\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mAgglomerativeClustering\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mn_clusters\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m5\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mlinkage\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'ward'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maffinity\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m\"precomputed\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdistance_mat\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
	"\u001b[1;32m/home/temple/santuz/.virtualenvs/pbxplore/local/lib/python2.7/site-packages/sklearn/cluster/hierarchical.pyc\u001b[0m in \u001b[0;36mfit\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m 717\u001b[0m raise ValueError(\"%s was provided as affinity. Ward can only \"\n\u001b[0;32m 718\u001b[0m \u001b[1;34m\"work with euclidean distances.\"\u001b[0m \u001b[1;33m%\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 719\u001b[1;33m (self.affinity, ))\n\u001b[0m\u001b[0;32m 720\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 721\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mlinkage\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32min\u001b[0m \u001b[0m_TREE_BUILDERS\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
	"\u001b[1;31mValueError\u001b[0m: precomputed was provided as affinity. Ward can only work with euclidean distances."
	]
	}
	],
	"source": [
	"#The input for scikit-learn is a matrix of observations\n",
	"#Since we used a distance matrix, we need to set affinity=\"precomputed\"\n",
	"ward = AgglomerativeClustering(n_clusters=5, linkage='ward', affinity=\"precomputed\").fit(distance_mat)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"It's not possible either with scikit-learn as the `ward` linkage only works with `affinity` set at \"euclidian\""
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Average method"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 121,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"nclusters = 5\n",
	"\n",
	"#R way\n",
	"clusters_R_ave, medoid_R_ave = PB.hclust(distance_mat, nclusters=nclusters, method='average')\n",
	"\n",
	"# Scipy way\n",
	"data_link_ave = scipyh.linkage(dist_mat_cond, method='average')\n",
	"clusters_scipy_ave = scipyh.fcluster(data_link_ave, nclusters, criterion='maxclust')\n",
	"medoid_scipy_ave = medoids(distance_mat, clusters_scipy_ave)\n",
	"\n",
	"#scikit-learn\n",
	"average = AgglomerativeClustering(n_clusters=5, linkage='average', affinity=\"precomputed\").fit(distance_mat)\n",
	"clusters_sklearn_ave = list(average.labels_)\n",
	"medoid_sklearn_ave = medoids(distance_mat, clusters_sklearn_ave)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 122,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"R clustering\n",
	"cluster 1: 90 sequences (33%)\n",
	"cluster 3: 69 sequences (26%)\n",
	"cluster 4: 57 sequences (21%)\n",
	"cluster 2: 53 sequences (20%)\n",
	"cluster 5: 1 sequences ( 0%)\n",
	"Index of medoids: [65, 180, 164, 267, 230]\n",
	"\n",
	"scipy clustering\n",
	"cluster 1: 90 sequences (33%)\n",
	"cluster 2: 69 sequences (26%)\n",
	"cluster 4: 57 sequences (21%)\n",
	"cluster 3: 53 sequences (20%)\n",
	"cluster 5: 1 sequences ( 0%)\n",
	"Index of medoids: [65, 164, 180, 267, 230]\n",
	"\n",
	"scikit-learn clustering\n",
	"cluster 3: 90 sequences (33%)\n",
	"cluster 2: 69 sequences (26%)\n",
	"cluster 0: 57 sequences (21%)\n",
	"cluster 1: 53 sequences (20%)\n",
	"cluster 4: 1 sequences ( 0%)\n",
	"Index of medoids: [267, 180, 164, 65, 230]\n"
	]
	}
	],
	"source": [
	"#Sum up the results\n",
	"\n",
	"print(\"R clustering\")\n",
	"PBclust.display_clust_report(clusters_R_ave)\n",
	"print(\"Index of medoids: {0}\".format(medoid_R_ave))\n",
	"print\n",
	"\n",
	"print(\"scipy clustering\")\n",
	"PBclust.display_clust_report(clusters_scipy_ave)\n",
	"print(\"Index of medoids: {0}\".format(medoid_scipy_ave))\n",
	"print\n",
	"\n",
	"print(\"scikit-learn clustering\")\n",
	"PBclust.display_clust_report(clusters_sklearn_ave)\n",
	"print(\"Index of medoids: {0}\".format(medoid_sklearn_ave))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"### Great same results!"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Complete method"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 117,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"nclusters = 5\n",
	"\n",
	"#R way\n",
	"clusters_R_comp, medoid_R_comp = PB.hclust(distance_mat, nclusters=nclusters, method=\"complete\")\n",
	"\n",
	"# Scipy way\n",
	"data_link_comp = scipyh.linkage(dist_mat_cond, method=\"complete\")\n",
	"clusters_scipy_comp = scipyh.fcluster(data_link_comp, nclusters, criterion='maxclust')\n",
	"medoid_scipy_comp = medoids(distance_mat, clusters_scipy_comp)\n",
	"\n",
	"#scikit-learn\n",
	"complete = AgglomerativeClustering(n_clusters=5, linkage='complete', affinity=\"precomputed\").fit(distance_mat)\n",
	"clusters_sklearn_comp = list(complete.labels_)\n",
	"medoid_sklearn_comp = medoids(distance_mat, clusters_sklearn_comp)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 118,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"R clustering\n",
	"cluster 1: 90 sequences (33%)\n",
	"cluster 3: 62 sequences (23%)\n",
	"cluster 5: 54 sequences (20%)\n",
	"cluster 4: 36 sequences (13%)\n",
	"cluster 2: 28 sequences (10%)\n",
	"Index of medoids: [65, 94, 164, 180, 267]\n",
	"\n",
	"scipy clustering\n",
	"cluster 3: 90 sequences (33%)\n",
	"cluster 4: 62 sequences (23%)\n",
	"cluster 2: 54 sequences (20%)\n",
	"cluster 1: 36 sequences (13%)\n",
	"cluster 5: 28 sequences (10%)\n",
	"Index of medoids: [180, 267, 65, 164, 94]\n",
	"\n",
	"scikit-learn clustering\n",
	"cluster 2: 90 sequences (33%)\n",
	"cluster 4: 62 sequences (23%)\n",
	"cluster 0: 54 sequences (20%)\n",
	"cluster 1: 36 sequences (13%)\n",
	"cluster 3: 28 sequences (10%)\n",
	"Index of medoids: [267, 180, 65, 94, 164]\n"
	]
	}
	],
	"source": [
	"#Sum up the results\n",
	"\n",
	"print(\"R clustering\")\n",
	"PBclust.display_clust_report(clusters_R_comp)\n",
	"print(\"Index of medoids: {0}\".format(medoid_R_comp))\n",
	"print\n",
	"\n",
	"print(\"scipy clustering\")\n",
	"PBclust.display_clust_report(clusters_scipy_comp)\n",
	"print(\"Index of medoids: {0}\".format(medoid_scipy_comp))\n",
	"print\n",
	"\n",
	"print(\"scikit-learn clustering\")\n",
	"PBclust.display_clust_report(clusters_sklearn_comp)\n",
	"print(\"Index of medoids: {0}\".format(medoid_sklearn_comp))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Great same results!"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.6"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}
No results found