jbarnoud/kmeans.ipynb

Last active September 28, 2015 09:48

Star (1) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/jbarnoud/fc27c5048d6e8f394598.js"></script>
Save jbarnoud/fc27c5048d6e8f394598 to your computer and use it in GitHub Desktop.

Download ZIP

Hereby is a notebook that explores K-means in the context of PBxplore.

Raw

kmeans.ipynb

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

alexdb27 commented Sep 23, 2015

Impressive.

alexdb27 commented Sep 23, 2015

@jbarnoud, a point about empty cluster. It was a problem in early R implementation. If you have in fact x clusters and you provide 10x possibilities. At some points, some of the 10x clusters remain not used.
I supose it will not arrive with this kind of data to be honnest, but it would be nice to have (in case of).

Concerning the comparison between clustering, I'm not sure i've understood everyrhing.
In the exemple, i would like to know the number of frames,we need to have a view with (a) a large number of snapshots (perhaps even from different MDs - for the same protein of course), and (b) a quite large number of clusters (at least 10+).
In my mind, i would like to know if i do 10 times the k-means which is the number of snapshots that are always find together (not only with two simulations).

alexdb27 commented Sep 23, 2015

@jbarnoud, there already is a maximum number of iterations, and that number must be set by the user.
ok, 1 iteration means complete use of all the data ?
A default must be provided (=10)

What could be nice for the user is to have a plot of "non-change (NC)"*

I explain: x = 10 clusters, N = 10000 sequence of PBs
iteration 1 -> random initialisation of centers (profiles), association of N with the x for the first time,
update of profiles
iteration 2 -> association of N with the x
NC is the number of N associated with the same cluster as for previous iteration.
iteration i+1 -> association of N with the x
NC would increase .... i hope

can it be done ?

alexdb27 commented Sep 23, 2015

@jbarnoud

Are all the figures provided to the user?
@pierrepo and @HubLot, there is a lot of information. Not sure how to present it to a "not-advance" user....

Author

jbarnoud commented Sep 23, 2015

A lot of stuff from @alexdb27.

About the empty clusters

It is in theory possible to have empty clusters with the k-means algorithm. The workaround is easy, though: here I choose my initial centers among the records, and I make sure they are not redundant. I think it is enough to avoid empty clusters, indeed there will at the very least be one record by cluster at the beginning.

Therefore there is no test about empty clusters as they should not happen.

About clustering reproducibility

I am not sure I understood everything you asked. I carried out 100 times the clustering of the same 270 PB sequences from a MD trajectory, and we can observe that the succession of clusters along the trajectory is not always the same. The figure is difficult to analyze, I will try to come up with something more quantitative and more readable.

About non-change plot

It s easy to get the number of sequences that change group at each iteration. I made a crude prototype, and indeed the number decreases at each iteration until it reaches 0. I'll have the plot available in a future version of the notebook. I may even use that as a criterion for convergence as it is faster to compute as what I currently do.

About the user interface

This notebook is just a prototype to validate the algorithm. Once the method will be validated, I will implement it in PBclust. At that point I will set a default value for the number of iterations and I can make some plots available to the user.

I will come to you all later to define what are the most pertinent information and plots to expose to the user. My feeling is that we should expose only what is the most useful through the command line and have the rest accessible through the API that will mostly be used by advanced users.

On testing the method

I would like to test the pertinence of the clustering on structure similarity within the clusters. What do you usually use to compute GDT TS and TM-scores?

alexdb27 commented Sep 23, 2015

@jbarnoud, About clustering reproducibility

I am not sure I understood everything you asked. I carried out 100 times the clustering of the same 270 PB sequences from a MD trajectory, and we can observe that the succession of clusters along the trajectory is not always the same. The figure is difficult to analyze, I will try to come up with something more quantitative and more readable.

-> Ok, i've fixed our own confusion point. You cannot look at the cluster i in simulation S(t) and look if it looks at cluster i in simulation S(t+1). What you need is to do a (i) a confusion table that is based only on data associated to each cluster. The principle is that to define the number of data found both in each cluster in simulation S(t) and in simulation S(t+1). (b) you take the max for each line (or column) and it gives you the correspondance between one cluster [S(t] and another [resp S(t+1)]. (c) you sum all and you have the true confusion. You do that cycle (t) after cycle and you will see if it reproducible

alexdb27 commented Sep 23, 2015

@ jbarnoud On testing the method

I would like to test the pertinence of the clustering on structure similarity within the clusters. What do you usually use to compute GDT TS and TM-scores?

It is mainly RMSD. GDT TS and TM-scores will be quite not sensitive for so highly similar structures. :-)

pierrepo commented Sep 23, 2015

Impressive indeed and very nice.
About scipy, I was juste wondering if the built-in k-means clustering implemented in scipy was easier-to-use / quicker.

HubLot commented Sep 28, 2015

Great job Jonathan!
RMSD could be a nice measure for the different clusters. The issue on a regular MD (the 270 sequences you tested) is to know the good number of clusters. Maybe '4' is not a good one, hence the reproducibility is hard to assess.
The issue, I think, about built-in k-means it is really difficult to have a custom distance metrix and a custom representation of the centroids.

jbarnoud/kmeans.ipynb

Select an option

No results found

Select an option

No results found

alexdb27 commented Sep 23, 2015

Uh oh!

alexdb27 commented Sep 23, 2015

Uh oh!

alexdb27 commented Sep 23, 2015

Uh oh!

alexdb27 commented Sep 23, 2015

Uh oh!

jbarnoud commented Sep 23, 2015

Uh oh!

alexdb27 commented Sep 23, 2015

Uh oh!

alexdb27 commented Sep 23, 2015

Uh oh!

pierrepo commented Sep 23, 2015

Uh oh!

HubLot commented Sep 28, 2015

Uh oh!