Skip to content

Instantly share code, notes, and snippets.

@bittlingmayer
Last active September 2, 2017 11:16
Show Gist options
  • Select an option

  • Save bittlingmayer/802471ced61d0d905b9a1adfd3f579a1 to your computer and use it in GitHub Desktop.

Select an option

Save bittlingmayer/802471ced61d0d905b9a1adfd3f579a1 to your computer and use it in GitHub Desktop.
ngrams.py [moved to language.ngrams - pip install language / https://github.com/SignalN/language/]
def __ngrams(s, n=3):
# Raw n-grams on sequences
# If given a string, it will return char-level n-grams.
# If given a list of words, it will return word-level n-grams.
return list(zip(*[s[i:] for i in range(n)]))
def ngrams(s, n=3):
# Does not take n-grams across word boundaries (' ')
# If a word is shorter than n, the n-gram is the word.
unpack = lambda l: sum(l, [])
return unpack([__ngrams(w, n=min(len(w), n)) for w in s.split()])
def matching_ngrams(s1, s2, n=5):
# See also: SequenceMatcher.get_matching_blocks
ngrams1, ngrams2 = set(ngrams(s1, n=n)), set(ngrams(s2, n=n))
return ngrams1.intersection(ngrams2)
def diff_ngrams(s1, s2, n=5):
ngrams1, ngrams2 = set(ngrams(s1, n=n)), set(ngrams(s2, n=n))
matches = ngrams1.intersection(ngrams2)
return 2 * len(matches) / (len(ngrams1) + len(ngrams2))
@bittlingmayer
Copy link
Author

bittlingmayer commented Jun 24, 2017

def __test(s1, s2, n=3):
    print(s1)
    print(s2)
    print(diff_ngrams(s1, s2, n=n))

def test(n):
    __test("This is a test.", "Eto ne test.", n=n)
    __test("Go to youtube.com", "Idi na youtube.com", n=n)
    __test("Microsoft is a company.", "Microsoft - компания.", n=n)
    __test("Microsoft is a company.", "Microsoft- ը ընկերություն է:", n=n)
    __test("Microsoft is a company.", "Majkrosoft je kompanija.", n=n)
    __test("Happy birthday.", "Happy birthday", n=n)
    __test("Happy birthday.", "Happy birthday.", n=n)
    __test("Happy birthday.", "С днем рождения!", n=n)

test(5)

Prints:

This is a test.
Eto ne test.
0.2857142857142857
Go to youtube.com
Idi na youtube.com
0.7777777777777778
Microsoft is a company.
Microsoft - компания.
0.45454545454545453
Microsoft is a company.
Microsoft- ը ընկերություն է:
0.37037037037037035
Microsoft is a company.
Majkrosoft je kompanija.
0.25
Happy birthday.
Happy birthday
0.9090909090909091
Happy birthday.
Happy birthday.
1.0
Happy birthday.
С днем рождения!
0.0

@victoriastuart
Copy link

This is great -- thank you! :-D

Added your commented text / additions to your 'ngrams.py' and edited these two lines, to get the script working (Python 3.6):

#def __test(s1, s2, n=n):
def __test(s1, s2, n):

#def test(n=n):
def test(n):

@bittlingmayer
Copy link
Author

@victoriastuart My pleasure and thank you for the fix, I've updated now.

We will be releasing this as a small repo and package on pip soon.

@bittlingmayer
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment