-
-
Save jbarnoud/fc27c5048d6e8f394598 to your computer and use it in GitHub Desktop.
@jbarnoud, a point about empty cluster. It was a problem in early R implementation. If you have in fact x clusters and you provide 10x possibilities. At some points, some of the 10x clusters remain not used.
I supose it will not arrive with this kind of data to be honnest, but it would be nice to have (in case of).
Concerning the comparison between clustering, I'm not sure i've understood everyrhing.
In the exemple, i would like to know the number of frames,we need to have a view with (a) a large number of snapshots (perhaps even from different MDs - for the same protein of course), and (b) a quite large number of clusters (at least 10+).
In my mind, i would like to know if i do 10 times the k-means which is the number of snapshots that are always find together (not only with two simulations).
@jbarnoud, there already is a maximum number of iterations, and that number must be set by the user.
ok, 1 iteration means complete use of all the data ?
A default must be provided (=10)
What could be nice for the user is to have a plot of "non-change (NC)"*
- I explain: x = 10 clusters, N = 10000 sequence of PBs
iteration 1 -> random initialisation of centers (profiles), association of N with the x for the first time,
update of profiles
iteration 2 -> association of N with the x
NC is the number of N associated with the same cluster as for previous iteration.
iteration i+1 -> association of N with the x
NC would increase .... i hope
can it be done ?
A lot of stuff from @alexdb27.
About the empty clusters
It is in theory possible to have empty clusters with the k-means algorithm. The workaround is easy, though: here I choose my initial centers among the records, and I make sure they are not redundant. I think it is enough to avoid empty clusters, indeed there will at the very least be one record by cluster at the beginning.
Therefore there is no test about empty clusters as they should not happen.
About clustering reproducibility
I am not sure I understood everything you asked. I carried out 100 times the clustering of the same 270 PB sequences from a MD trajectory, and we can observe that the succession of clusters along the trajectory is not always the same. The figure is difficult to analyze, I will try to come up with something more quantitative and more readable.
About non-change plot
It s easy to get the number of sequences that change group at each iteration. I made a crude prototype, and indeed the number decreases at each iteration until it reaches 0. I'll have the plot available in a future version of the notebook. I may even use that as a criterion for convergence as it is faster to compute as what I currently do.
About the user interface
This notebook is just a prototype to validate the algorithm. Once the method will be validated, I will implement it in PBclust. At that point I will set a default value for the number of iterations and I can make some plots available to the user.
I will come to you all later to define what are the most pertinent information and plots to expose to the user. My feeling is that we should expose only what is the most useful through the command line and have the rest accessible through the API that will mostly be used by advanced users.
On testing the method
I would like to test the pertinence of the clustering on structure similarity within the clusters. What do you usually use to compute GDT TS and TM-scores?
@jbarnoud, About clustering reproducibility
I am not sure I understood everything you asked. I carried out 100 times the clustering of the same 270 PB sequences from a MD trajectory, and we can observe that the succession of clusters along the trajectory is not always the same. The figure is difficult to analyze, I will try to come up with something more quantitative and more readable.
-> Ok, i've fixed our own confusion point. You cannot look at the cluster i in simulation S(t) and look if it looks at cluster i in simulation S(t+1). What you need is to do a (i) a confusion table that is based only on data associated to each cluster. The principle is that to define the number of data found both in each cluster in simulation S(t) and in simulation S(t+1). (b) you take the max for each line (or column) and it gives you the correspondance between one cluster [S(t] and another [resp S(t+1)]. (c) you sum all and you have the true confusion. You do that cycle (t) after cycle and you will see if it reproducible
@ jbarnoud On testing the method
I would like to test the pertinence of the clustering on structure similarity within the clusters. What do you usually use to compute GDT TS and TM-scores?
It is mainly RMSD. GDT TS and TM-scores will be quite not sensitive for so highly similar structures. :-)
Impressive indeed and very nice.
About scipy, I was juste wondering if the built-in k-means clustering implemented in scipy was easier-to-use / quicker.
Great job Jonathan!
RMSD could be a nice measure for the different clusters. The issue on a regular MD (the 270 sequences you tested) is to know the good number of clusters. Maybe '4' is not a good one, hence the reproducibility is hard to assess.
The issue, I think, about built-in k-means it is really difficult to have a custom distance metrix and a custom representation of the centroids.
Impressive.