Skip to content

Instantly share code, notes, and snippets.

@HubLot
Last active August 29, 2015 14:20
Show Gist options
  • Select an option

  • Save HubLot/aef0ecd56f40ec2ceaab to your computer and use it in GitHub Desktop.

Select an option

Save HubLot/aef0ecd56f40ec2ceaab to your computer and use it in GitHub Desktop.
Hclust R vs Python
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Comparison between hierarchical clustering in R and Python"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For R, the version used is 3.2.0.\n",
"\n",
"For Python, the [scipy library](http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html) will be used. scikit-learn is also used even if, for hierarchical clustering only, it's based on scipy.\n",
"\n",
"**An important note: ** The input matrix for scipy function is a **condensed matrix** (a condensed distance matrix is a flat array containing the upper triangular of the distance matrix). The documentation on that is wrong that point. ([stackoverflow post](https://stackoverflow.com/questions/18952587/use-distance-matrix-in-scipy-cluster-hierarchy-linkage), [issues](https://github.com/scipy/scipy/issues/2614) and [PR](https://github.com/scipy/scipy/pull/2615) on github's scipy).\n",
"\n",
"For R, the whole sparse distance matrix can be used.\n",
"\n",
"For scikit-learn, we need to set the `affinity=\"precomputed\"` when the input is a distance matrix. By default, scikit-learn use as input a matrix of observations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prepare the data"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import PBlib as PB\n",
"import PBclust\n",
"import numpy as np\n",
"from sklearn.cluster import AgglomerativeClustering\n",
"import scipy.cluster.hierarchy as scipyh\n",
"import scipy.spatial.distance as ssd"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def medoids(dist, clusters_sklearn):\n",
" medo = []\n",
" for cluster in np.unique(clusters_sklearn):\n",
" medo.append(np.argmin(np.sum(dist[clusters_sklearn == cluster],axis=0)))\n",
" return medo"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"read 270 sequences in demo2_tmp/psi_md_traj_all.PB.fasta\n"
]
}
],
"source": [
"#Get the data\n",
"header_lst, seq_lst = PB.read_several_fasta([\"demo2_tmp/psi_md_traj_all.PB.fasta\"])\n",
"substitution_mat = PB.load_substitution_matrix(PB.SUBSTITUTION_MATRIX_NAME)"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Building distance matrix\n",
"100%\n",
"Building distance matrix\n",
"100%\n"
]
}
],
"source": [
"#Distances\n",
"distance_mat = PB.distance_matrix(seq_lst, substitution_mat)\n",
"\n",
"#condensed one for scipy\n",
"# The diagonal of the whole matrix has to be set at 0 first.\n",
"dist_mat = PB.distance_matrix(seq_lst, substitution_mat)\n",
"np.fill_diagonal(dist_mat, 0)\n",
"dist_mat_cond = ssd.squareform(dist_mat)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ward method"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For R versions > 3.0.3, the good ward method is known as `ward.D2` (see [stats.exchange](https://stats.stackexchange.com/questions/109949/what-algorithm-does-ward-d-in-hclust-implement-if-it-is-not-wards-criteria) and the corresponding [paper](http://arxiv.org/pdf/1111.6285.pdf). For R versions <= 3.0.3, the only method is `ward`. \n",
"\n",
"To obtain the same results in both methods, the distance matrix has to be squared method before using `ward` method. See my [example notebook](http://nbviewer.ipython.org/gist/HubLot/7f0824c33d0b200bb10b).\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### R"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"R clustering\n",
"cluster 1: 90 sequences (33%)\n",
"cluster 3: 71 sequences (26%)\n",
"cluster 5: 58 sequences (21%)\n",
"cluster 4: 32 sequences (12%)\n",
"cluster 2: 19 sequences ( 7%)\n",
"Index of medoids: [65, 94, 164, 180, 267]\n",
"\n"
]
}
],
"source": [
"nclusters = 5\n",
"\n",
"#R way\n",
"clusters_R_D2, medoid_R_D2 = PB.hclust(distance_mat, nclusters=nclusters, method='ward.D2')\n",
"\n",
"print(\"R clustering\")\n",
"PBclust.display_clust_report(clusters_R_D2)\n",
"print(\"Index of medoids: {0}\".format(medoid_R_D2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Scipy"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"collapsed": false
},
"outputs": [
{
"ename": "ValueError",
"evalue": "Valid methods when the raw observations are omitted are 'single', 'complete', 'weighted', and 'average'.",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-60-7ab9ead058a3>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;31m# Scipy way condensed\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mdata_link_cond\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mscipyh\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mward\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdist_mat_cond\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 3\u001b[0m \u001b[0mclusters_scipy_cond\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mscipyh\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfcluster\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdata_link_cond\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnclusters\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcriterion\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'maxclust'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[0mmedoid_scipy_cond\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mmedoids\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdist_array\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mclusters_scipy_cond\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m/home/temple/santuz/.virtualenvs/pbxplore/local/lib/python2.7/site-packages/scipy/cluster/hierarchy.pyc\u001b[0m in \u001b[0;36mward\u001b[1;34m(y)\u001b[0m\n\u001b[0;32m 463\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 464\u001b[0m \"\"\"\n\u001b[1;32m--> 465\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mlinkage\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmethod\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'ward'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmetric\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'euclidean'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 466\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 467\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m/home/temple/santuz/.virtualenvs/pbxplore/local/lib/python2.7/site-packages/scipy/cluster/hierarchy.pyc\u001b[0m in \u001b[0;36mlinkage\u001b[1;34m(y, method, metric)\u001b[0m\n\u001b[0;32m 625\u001b[0m \u001b[0md\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mdistance\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mnum_obs_y\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 626\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mmethod\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32min\u001b[0m \u001b[0m_cpy_non_euclid_methods\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 627\u001b[1;33m raise ValueError(\"Valid methods when the raw observations are \"\n\u001b[0m\u001b[0;32m 628\u001b[0m \u001b[1;34m\"omitted are 'single', 'complete', 'weighted', \"\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 629\u001b[0m \"and 'average'.\")\n",
"\u001b[1;31mValueError\u001b[0m: Valid methods when the raw observations are omitted are 'single', 'complete', 'weighted', and 'average'."
]
}
],
"source": [
"# Scipy way condensed\n",
"data_link_cond = scipyh.ward(dist_mat_cond)\n",
"clusters_scipy_cond = scipyh.fcluster(data_link_cond, nclusters, criterion='maxclust')\n",
"medoid_scipy_cond = medoids(dist_array, clusters_scipy_cond)\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You see a nice exception when using a condensed matrix with the ward method...\n",
"\n",
"In fact, for now, it's not possible to use a condensed distance matrix and the ward method with scipy ([issue on github](https://github.com/scipy/scipy/issues/1798))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### scikit-learn"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {
"collapsed": false
},
"outputs": [
{
"ename": "ValueError",
"evalue": "precomputed was provided as affinity. Ward can only work with euclidean distances.",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-113-40de67e77546>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;31m#The input for scikit-learn is a matrix of observations\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2\u001b[0m \u001b[1;31m#Since we used a distance matrix, we need to set affinity=\"precomputed\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[0mward\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mAgglomerativeClustering\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mn_clusters\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m5\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mlinkage\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'ward'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maffinity\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m\"precomputed\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdistance_mat\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;32m/home/temple/santuz/.virtualenvs/pbxplore/local/lib/python2.7/site-packages/sklearn/cluster/hierarchical.pyc\u001b[0m in \u001b[0;36mfit\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m 717\u001b[0m raise ValueError(\"%s was provided as affinity. Ward can only \"\n\u001b[0;32m 718\u001b[0m \u001b[1;34m\"work with euclidean distances.\"\u001b[0m \u001b[1;33m%\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 719\u001b[1;33m (self.affinity, ))\n\u001b[0m\u001b[0;32m 720\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 721\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mlinkage\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32min\u001b[0m \u001b[0m_TREE_BUILDERS\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;31mValueError\u001b[0m: precomputed was provided as affinity. Ward can only work with euclidean distances."
]
}
],
"source": [
"#The input for scikit-learn is a matrix of observations\n",
"#Since we used a distance matrix, we need to set affinity=\"precomputed\"\n",
"ward = AgglomerativeClustering(n_clusters=5, linkage='ward', affinity=\"precomputed\").fit(distance_mat)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's not possible either with scikit-learn as the `ward` linkage only works with `affinity` set at \"euclidian\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Average method"
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"nclusters = 5\n",
"\n",
"#R way\n",
"clusters_R_ave, medoid_R_ave = PB.hclust(distance_mat, nclusters=nclusters, method='average')\n",
"\n",
"# Scipy way\n",
"data_link_ave = scipyh.linkage(dist_mat_cond, method='average')\n",
"clusters_scipy_ave = scipyh.fcluster(data_link_ave, nclusters, criterion='maxclust')\n",
"medoid_scipy_ave = medoids(distance_mat, clusters_scipy_ave)\n",
"\n",
"#scikit-learn\n",
"average = AgglomerativeClustering(n_clusters=5, linkage='average', affinity=\"precomputed\").fit(distance_mat)\n",
"clusters_sklearn_ave = list(average.labels_)\n",
"medoid_sklearn_ave = medoids(distance_mat, clusters_sklearn_ave)"
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"R clustering\n",
"cluster 1: 90 sequences (33%)\n",
"cluster 3: 69 sequences (26%)\n",
"cluster 4: 57 sequences (21%)\n",
"cluster 2: 53 sequences (20%)\n",
"cluster 5: 1 sequences ( 0%)\n",
"Index of medoids: [65, 180, 164, 267, 230]\n",
"\n",
"scipy clustering\n",
"cluster 1: 90 sequences (33%)\n",
"cluster 2: 69 sequences (26%)\n",
"cluster 4: 57 sequences (21%)\n",
"cluster 3: 53 sequences (20%)\n",
"cluster 5: 1 sequences ( 0%)\n",
"Index of medoids: [65, 164, 180, 267, 230]\n",
"\n",
"scikit-learn clustering\n",
"cluster 3: 90 sequences (33%)\n",
"cluster 2: 69 sequences (26%)\n",
"cluster 0: 57 sequences (21%)\n",
"cluster 1: 53 sequences (20%)\n",
"cluster 4: 1 sequences ( 0%)\n",
"Index of medoids: [267, 180, 164, 65, 230]\n"
]
}
],
"source": [
"#Sum up the results\n",
"\n",
"print(\"R clustering\")\n",
"PBclust.display_clust_report(clusters_R_ave)\n",
"print(\"Index of medoids: {0}\".format(medoid_R_ave))\n",
"print\n",
"\n",
"print(\"scipy clustering\")\n",
"PBclust.display_clust_report(clusters_scipy_ave)\n",
"print(\"Index of medoids: {0}\".format(medoid_scipy_ave))\n",
"print\n",
"\n",
"print(\"scikit-learn clustering\")\n",
"PBclust.display_clust_report(clusters_sklearn_ave)\n",
"print(\"Index of medoids: {0}\".format(medoid_sklearn_ave))"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"### Great same results!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Complete method"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"nclusters = 5\n",
"\n",
"#R way\n",
"clusters_R_comp, medoid_R_comp = PB.hclust(distance_mat, nclusters=nclusters, method=\"complete\")\n",
"\n",
"# Scipy way\n",
"data_link_comp = scipyh.linkage(dist_mat_cond, method=\"complete\")\n",
"clusters_scipy_comp = scipyh.fcluster(data_link_comp, nclusters, criterion='maxclust')\n",
"medoid_scipy_comp = medoids(distance_mat, clusters_scipy_comp)\n",
"\n",
"#scikit-learn\n",
"complete = AgglomerativeClustering(n_clusters=5, linkage='complete', affinity=\"precomputed\").fit(distance_mat)\n",
"clusters_sklearn_comp = list(complete.labels_)\n",
"medoid_sklearn_comp = medoids(distance_mat, clusters_sklearn_comp)"
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"R clustering\n",
"cluster 1: 90 sequences (33%)\n",
"cluster 3: 62 sequences (23%)\n",
"cluster 5: 54 sequences (20%)\n",
"cluster 4: 36 sequences (13%)\n",
"cluster 2: 28 sequences (10%)\n",
"Index of medoids: [65, 94, 164, 180, 267]\n",
"\n",
"scipy clustering\n",
"cluster 3: 90 sequences (33%)\n",
"cluster 4: 62 sequences (23%)\n",
"cluster 2: 54 sequences (20%)\n",
"cluster 1: 36 sequences (13%)\n",
"cluster 5: 28 sequences (10%)\n",
"Index of medoids: [180, 267, 65, 164, 94]\n",
"\n",
"scikit-learn clustering\n",
"cluster 2: 90 sequences (33%)\n",
"cluster 4: 62 sequences (23%)\n",
"cluster 0: 54 sequences (20%)\n",
"cluster 1: 36 sequences (13%)\n",
"cluster 3: 28 sequences (10%)\n",
"Index of medoids: [267, 180, 65, 94, 164]\n"
]
}
],
"source": [
"#Sum up the results\n",
"\n",
"print(\"R clustering\")\n",
"PBclust.display_clust_report(clusters_R_comp)\n",
"print(\"Index of medoids: {0}\".format(medoid_R_comp))\n",
"print\n",
"\n",
"print(\"scipy clustering\")\n",
"PBclust.display_clust_report(clusters_scipy_comp)\n",
"print(\"Index of medoids: {0}\".format(medoid_scipy_comp))\n",
"print\n",
"\n",
"print(\"scikit-learn clustering\")\n",
"PBclust.display_clust_report(clusters_sklearn_comp)\n",
"print(\"Index of medoids: {0}\".format(medoid_sklearn_comp))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Great same results!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment