Skip to content

Instantly share code, notes, and snippets.

@jbarnoud
Last active September 28, 2015 09:48
Show Gist options
  • Select an option

  • Save jbarnoud/fc27c5048d6e8f394598 to your computer and use it in GitHub Desktop.

Select an option

Save jbarnoud/fc27c5048d6e8f394598 to your computer and use it in GitHub Desktop.
Hereby is a notebook that explores K-means in the context of PBxplore.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@alexdb27
Copy link

@jbarnoud, About clustering reproducibility

I am not sure I understood everything you asked. I carried out 100 times the clustering of the same 270 PB sequences from a MD trajectory, and we can observe that the succession of clusters along the trajectory is not always the same. The figure is difficult to analyze, I will try to come up with something more quantitative and more readable.

-> Ok, i've fixed our own confusion point. You cannot look at the cluster i in simulation S(t) and look if it looks at cluster i in simulation S(t+1). What you need is to do a (i) a confusion table that is based only on data associated to each cluster. The principle is that to define the number of data found both in each cluster in simulation S(t) and in simulation S(t+1). (b) you take the max for each line (or column) and it gives you the correspondance between one cluster [S(t] and another [resp S(t+1)]. (c) you sum all and you have the true confusion. You do that cycle (t) after cycle and you will see if it reproducible

@alexdb27
Copy link

@ jbarnoud On testing the method

I would like to test the pertinence of the clustering on structure similarity within the clusters. What do you usually use to compute GDT TS and TM-scores?

It is mainly RMSD. GDT TS and TM-scores will be quite not sensitive for so highly similar structures. :-)

@pierrepo
Copy link

Impressive indeed and very nice.
About scipy, I was juste wondering if the built-in k-means clustering implemented in scipy was easier-to-use / quicker.

@HubLot
Copy link

HubLot commented Sep 28, 2015

Great job Jonathan!
RMSD could be a nice measure for the different clusters. The issue on a regular MD (the 270 sequences you tested) is to know the good number of clusters. Maybe '4' is not a good one, hence the reproducibility is hard to assess.
The issue, I think, about built-in k-means it is really difficult to have a custom distance metrix and a custom representation of the centroids.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment