Skip to content

Instantly share code, notes, and snippets.

@mogumogu2333
Last active June 1, 2017 18:08
Show Gist options
  • Select an option

  • Save mogumogu2333/09bae3948039a6126f93d9ab93cf643c to your computer and use it in GitHub Desktop.

Select an option

Save mogumogu2333/09bae3948039a6126f93d9ab93cf643c to your computer and use it in GitHub Desktop.
Given a list of documents, return the keywords using tfidf score
from sklearn.feature_extraction.text import TfidfVectorizer
def get_tfidf_features(docs):
tf = TfidfVectorizer(min_df=1, max_df=0.8)
tfidf_matrix = tf.fit_transform(docs)
idf = tf.idf_
feature_names = tf.get_feature_names()
keywords_list = []
for doc in range(len(docs)):
feature_index = tfidf_matrix[doc, :].nonzero()[1]
tfidf_scores = zip(feature_index, [tfidf_matrix[doc, x] for x in feature_index])
tfidf_scores.sort(key=lambda t: t[1], reverse=True)
data = []
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
if s < 0.1:
break
# print w, s
data.append(w)
keywords_list.append(','.join(data))
return keywords_list
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment