-
-
Save zhicongchen/9e23d5c3f1e5b1293b16133485cd17d8 to your computer and use it in GitHub Desktop.
| def smart_procrustes_align_gensim(base_embed, other_embed, words=None): | |
| """ | |
| Original script: https://gist.github.com/quadrismegistus/09a93e219a6ffc4f216fb85235535faf | |
| Procrustes align two gensim word2vec models (to allow for comparison between same word across models). | |
| Code ported from HistWords <https://github.com/williamleif/histwords> by William Hamilton <[email protected]>. | |
| First, intersect the vocabularies (see `intersection_align_gensim` documentation). | |
| Then do the alignment on the other_embed model. | |
| Replace the other_embed model's syn0 and syn0norm numpy matrices with the aligned version. | |
| Return other_embed. | |
| If `words` is set, intersect the two models' vocabulary with the vocabulary in words (see `intersection_align_gensim` documentation). | |
| """ | |
| # patch by Richard So [https://twitter.com/richardjeanso) (thanks!) to update this code for new version of gensim | |
| # base_embed.init_sims(replace=True) | |
| # other_embed.init_sims(replace=True) | |
| # make sure vocabulary and indices are aligned | |
| in_base_embed, in_other_embed = intersection_align_gensim(base_embed, other_embed, words=words) | |
| # get the (normalized) embedding matrices | |
| base_vecs = in_base_embed.wv.get_normed_vectors() | |
| other_vecs = in_other_embed.wv.get_normed_vectors() | |
| # just a matrix dot product with numpy | |
| m = other_vecs.T.dot(base_vecs) | |
| # SVD method from numpy | |
| u, _, v = np.linalg.svd(m) | |
| # another matrix operation | |
| ortho = u.dot(v) | |
| # Replace original array with modified one, i.e. multiplying the embedding matrix by "ortho" | |
| other_embed.wv.vectors = (other_embed.wv.vectors).dot(ortho) | |
| return other_embed | |
| def intersection_align_gensim(m1, m2, words=None): | |
| """ | |
| Intersect two gensim word2vec models, m1 and m2. | |
| Only the shared vocabulary between them is kept. | |
| If 'words' is set (as list or set), then the vocabulary is intersected with this list as well. | |
| Indices are re-organized from 0..N in order of descending frequency (=sum of counts from both m1 and m2). | |
| These indices correspond to the new syn0 and syn0norm objects in both gensim models: | |
| -- so that Row 0 of m1.syn0 will be for the same word as Row 0 of m2.syn0 | |
| -- you can find the index of any word on the .index2word list: model.index2word.index(word) => 2 | |
| The .vocab dictionary is also updated for each model, preserving the count but updating the index. | |
| """ | |
| # Get the vocab for each model | |
| vocab_m1 = set(m1.wv.index_to_key) | |
| vocab_m2 = set(m2.wv.index_to_key) | |
| # Find the common vocabulary | |
| common_vocab = vocab_m1 & vocab_m2 | |
| if words: common_vocab &= set(words) | |
| # If no alignment necessary because vocab is identical... | |
| if not vocab_m1 - common_vocab and not vocab_m2 - common_vocab: | |
| return (m1,m2) | |
| # Otherwise sort by frequency (summed for both) | |
| common_vocab = list(common_vocab) | |
| common_vocab.sort(key=lambda w: m1.wv.get_vecattr(w, "count") + m2.wv.get_vecattr(w, "count"), reverse=True) | |
| # print(len(common_vocab)) | |
| # Then for each model... | |
| for m in [m1, m2]: | |
| # Replace old syn0norm array with new one (with common vocab) | |
| indices = [m.wv.key_to_index[w] for w in common_vocab] | |
| old_arr = m.wv.vectors | |
| new_arr = np.array([old_arr[index] for index in indices]) | |
| m.wv.vectors = new_arr | |
| # Replace old vocab dictionary with new one (with common vocab) | |
| # and old index2word with new one | |
| new_key_to_index = {} | |
| new_index_to_key = [] | |
| for new_index, key in enumerate(common_vocab): | |
| new_key_to_index[key] = new_index | |
| new_index_to_key.append(key) | |
| m.wv.key_to_index = new_key_to_index | |
| m.wv.index_to_key = new_index_to_key | |
| print(len(m.wv.key_to_index), len(m.wv.vectors)) | |
| return (m1,m2) |
I'm trying to run the code but the output from smart_procrustes_align_gensim is the "other_embed" (ie. "other_embed == aligned_embed" returns True). Is anyone else having this issue?
Here is the code I was using (I'm not super familiar with gensim so I might have an issue elsewhere in the code):
`import pandas as pd
import gensim
from gensim.models import Word2Vec
import numpy as np
base_model = gensim.models.Word2Vec(list_of_tokens_1)
other_model = gensim.models.Word2Vec(list_of_tokens_2)
aligned_mod = smart_procrustes_align_gensim(base_model, other_model)
(Line below returns True)
aligned_mod == other_model`
I'm trying to run the code but the output from smart_procrustes_align_gensim is the "other_embed" (ie. "other_embed == aligned_embed" returns True). Is anyone else having this issue?
Hey @ajalvero , I have the same issue. Did you find a solution?
@krkryger other_embed.wv.vectors = (other_embed.wv.vectors).dot(ortho) updates other_embed, too. This works for me:
other_embed_copy = copy.deepcopy(other_embed)
other_embed_copy.wv.vectors = (other_embed.wv.vectors).dot(ortho)
return other_embed_copy
How we can convert the aligned models to the format required by the visualization scripts. There are the scripts 'closest_over_time_with_anns.py' and others to plot a given word in different time spans, and it loads embeddings which need to be in a specific format e.g. '1910-w.npy' and '1910-vocab.pkl'. Any suggestions on this?
Which gensim version this was designed? for me it is not working in 4.3, and 4.0 doesn't work with Python 3.10
re-filling the normed vectors
in_base_embed.wv.fill_norms(force=True)
in_other_embed.wv.fill_norms(force=True)get the (normalized) embedding matrices
base_vecs = in_base_embed.wv.get_normed_vectors()
other_vecs = in_other_embed.wv.get_normed_vectors()
I met the same issue. Any guide would be appreciated.
I ended up using https://github.com/theochem/procrustes instead. Something like this:
from procrustes import rotational
common_words = sorted(CURRENT_WORDS.intersection(base_words))
print(f" Common words: {len(common_words)}")
common_words_embeddings_base = np.array([base_embeddings[word] for word in common_words])
common_words_embeddings_current = np.array([current_embeddings[word] for word in common_words])
# find the rotation matrix using orthogonal procrustes
rotation_matrix = rotational(common_words_embeddings_base, common_words_embeddings_current)
# apply the rotation matrix to the embeddings in words old
base_words_embeddings_rotated = rotation_matrix.new_a
rotated_model = KeyedVectors(300)
rotated_model.add_vectors(common_words, base_words_embeddings_rotated)
rotated_model.save("aligned.kv")
# Now release the memory and load the aligned vectors again@estebarb Thank you so much for sharing!
Met the same error. Totally agree with your solutions. :D