Last active
September 2, 2017 11:16
-
-
Save bittlingmayer/802471ced61d0d905b9a1adfd3f579a1 to your computer and use it in GitHub Desktop.
ngrams.py [moved to language.ngrams - pip install language / https://github.com/SignalN/language/]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| def __ngrams(s, n=3): | |
| # Raw n-grams on sequences | |
| # If given a string, it will return char-level n-grams. | |
| # If given a list of words, it will return word-level n-grams. | |
| return list(zip(*[s[i:] for i in range(n)])) | |
| def ngrams(s, n=3): | |
| # Does not take n-grams across word boundaries (' ') | |
| # If a word is shorter than n, the n-gram is the word. | |
| unpack = lambda l: sum(l, []) | |
| return unpack([__ngrams(w, n=min(len(w), n)) for w in s.split()]) | |
| def matching_ngrams(s1, s2, n=5): | |
| # See also: SequenceMatcher.get_matching_blocks | |
| ngrams1, ngrams2 = set(ngrams(s1, n=n)), set(ngrams(s2, n=n)) | |
| return ngrams1.intersection(ngrams2) | |
| def diff_ngrams(s1, s2, n=5): | |
| ngrams1, ngrams2 = set(ngrams(s1, n=n)), set(ngrams(s2, n=n)) | |
| matches = ngrams1.intersection(ngrams2) | |
| return 2 * len(matches) / (len(ngrams1) + len(ngrams2)) |
Author
This is great -- thank you! :-D
Added your commented text / additions to your 'ngrams.py' and edited these two lines, to get the script working (Python 3.6):
#def __test(s1, s2, n=n):
def __test(s1, s2, n):
#def test(n=n):
def test(n):
Author
@victoriastuart My pleasure and thank you for the fix, I've updated now.
We will be releasing this as a small repo and package on pip soon.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Prints: