This is a very rough set of notes taken in 2022 to help engineers understand the logic as well as the nontrivial Haskell concepts (e.g. GHC extensions) embodied in that logic.
It is not an official guide from the Duckling authors, and there are likely at least a few errors.
These notes only cover the core Duckling engine, not any logic specific to a specific Dimension. Most dimensions are relatively straightforward, but some are quite complicated - particularly Time - and these notes do not cover those.
Let's trace a request in debug mode, something like
debug <some_locale> "some text" [Seal <SomeDimension>]
--
-- locale input targets
--
-- ^ these are the arg names, sort of (they change unnecessarily as we
-- go down the stack, I'm planning to put a commit out unifying them where
-- I can soon)
debugis reallydebugCustom, and it- converts the Seal Dimension list into a hash set
- calls
analyze sentence context options <that_hash_set> - calls
debugTokens sentence <output_of_analyze>
debugTokensisn't critical to understand for core duckling- it maps
formatToken sentenceover the tokens from analyze - and it calls
mapM_ (ptree sentence)over that (in an IO)ptreeis really just recursivepnode, an easy skim
- it further returns the formatted tokens to the caller
- it maps
analyzefromApi.hsis the real entrypoint- type:
Text -> Context -> Options -> HashSet (Seal Dimension) -> [ResolvedToken]
- type:
- quick note about some other endpoints that call
analyze:- the
parsefunction is similar todebugCustom; like it it takes the Seal Dimension collection as a list not a set and callsformatTokento convert theResolvedTokensintoEntitys - the webserver from
exe/ExampleMain.hscallsparsefrom its handler
- the
- back to
analyze, here's what it does:- calls
rulesFor locale targetsto get a[Rule]list - calls
parseAndResolve <that_rule_list> inputto get a[ResolvedToken] - filters that list to just appropriate dimensions
- calls
rankon the results, which performs naive Bayes ranking
- calls
Some of Duckling's logic involves types that aren't vanilla Haskell - even if you're used to Sigma-flavor code they could be nontrivial to understand - so let's walk through them.
Dimensionis a GADT that wraps custom dimensions (which have to implement theCustomDimensiontypeclass declaring dependent dimensions, how to findRules, etc). It's hashable, although interestingly I think the way this is done may only allow one CustomDimension at a time!- There's also a
GEqinstance, which I think is a generalized equality and uses some interesting type magic... not critical to understand now
- There's also a
Tokenbinds together aDimension aand ana- the
ahere will be a "data" type for a dim, e.g.TimeData,DurationData - it has to implement the
Resolvetypeclass inResolve.hs, which defines- defines a
ResolvedValue atype (this is usingTypeFamilies; it's not really a polymorphic type, it's a type-level function, e.g.ResolvedValue AmountOfMondeyData = AmountOfMoneyValue) resove :: Context -> Options -> a -> Maybe (ResolvedValue a, Bool)- here the
Boolis an is_latent flag, see TODO above
- here the
- defines a
- the
ResolvedValbinds together (in a type which itself is not polymorphic) a(Dimension a)and a(ResolvedValue a)... this is a nice example of whereforallis helpful; theEqinstance is also instructive- Note:
ResolvedValue atype isn't a vanilla polymorphic type, it's a type family type defined as part of theResolvetypeclass going on here with a language extension I don't know about
- Note:
- a
Nodeis aRange,Token,children :: [Node], andrule :: Maybe Text - a
ResolvedTokenis aRange,Node,ResolvedValandisLatent :: bool- it's basically a
NodewithTokenconverted toResolvedVal, but it still keeps track of the originalNode
- it's basically a
- a
Candidateis a tripleCandidate ResolvedToken Double Bool- this is used for ranking, the Double is a score and the Bool indicates whether this is a dimension explicitly requested by the caller
- a
Range Int Intis ... well, a range - the types that go into rules. Thse are familiar if you've done "user-land" Duckling:
type Production = [Token] -> Maybe Tokentype Predicate = [Token] -> Booldata PatternItem = Regex PCRE.Regex | Predicate Predicatetype Pattern = [PatternItem]- a
Ruleis a name, Pattern, and Production
- an
Entityis basically just a serializableResolvedVal...- the
ResolvedValis actually still nested inside it - but all the important fields are extracted and flattened
- and we also convert the range into the actual matching text
- see the
formatTokenfunction inApi.hsthat we already looked at - I think this is what's actually exposed via json but it's not used by the core engine
- the
- two of the
PatternItem-producing helper functions live here- the
regexhelper function that returns aRegexvalue - the
dimensionhelper function makes aPredicatefor whether aTokenmatches aDimension
- the
Let's recap the flow concisely, because having this in mind makes the Engine.hs code much easier to follow:
- a
Tokenis the raw output of aProduction, it has a dimension and a value of the "data" type - a
Nodewraps aTokenin metadata, including potentially children (soNodes can form a tree) - a
ResolvedTokencomes from converting aNodeto aResolvedVal - a
Candidatewraps aResolvedTokenwith some classifier-related metadata for ranking - an
Entityis a nicely-serializable value computed from aResolvedToken
Optional further reading on the not-so-vanilla-Haskell things here:
- For the
GADTandTypeFamiliesmagic going on inDimensionandResolve:- wikipedia has a quick intro, and the ghc wiki has painful details
- Here's a concise step-by-step walkthroughl of several related GHC extensions, and various ways to use type families
- In Duckling using the simple "type synonym" flavor of type families described here
- Here's a longer guide with more motivation, but I haven't read it fully yet
- NFData, which almost all our basic token/node/etc values either implement or derive,
is all about controlling strictness. You don't need to fully understand this to follow the Duckling core logic, but it's helps to have a vague idea what the
seqanddeepseqandforcefunctions randomly popping up in our code do.- FP Complete put out a lengthy blog about laziness, strictness, and NFData that aims to be beginner friendly
Unlike the ones discussed above, these are all internal to the Engine code (they aren't exposed when writing new rules) and they're all pretty vanilla:
Stash defined in Duckling.Types.Stash.hs, newtypes a Data.IntMap.Strict.IntMap (HashSet Node)
- each key is a start position
- the nodes form a HashSet because multiple matches can happen from the same position
Document defined in Duckling.Types.Document.hs, converts Text to bytes and wraps with metadata
Duckling a = Identity a
- also
runDuckling ma = runIdentity ma - you can think of a
Duckling aas just anabut with alternative syntax via a dummy monad
type Match = (Rule, Int, [Node]) This is a partially-matched rule
- what is in the
Rule?- recall that a
Ruleconsists of aname, pattern, production - Inside the engine, what we do is we match one
PatternItemat a time, and we make a newRulewhosepatternis one item shorter. The rule only ever produces a match (in which case theproduction) gets called - thats what the
Rulein aMatchis: it's a partially-matchedRulefrom the original DSL, whith only the trailing so-far-unmatchedPatternItems still left inpattern
- recall that a
- what is the
Int?- It's the end position of the text matched so far (by the
Nodes that we've already matched from the front ofpattern) - Remember that the
StashindexesNodes by start position. This means we can line up partial matches against potential next tokens by lining up end position and start position
- It's the end position of the text matched so far (by the
- what is the
[Node]?- It's the tokens matched by whatever
[PatternItems]we've already popped off the front ofRule.pattern. We build it back to front, so it's in reverse order. We'll need this so that we can extractTokenvalues in order to run theproductionwhenever aRuleis fully-matched.
- It's the tokens matched by whatever
The core of Engine is the parseString1 function that gets called inside a recursive loop to build parse trees. There are three high-level plumbing functions that create the recursive loop to drive it.
parseAndResolve : (rules: [Rule]) (input: Text) context options -> [ResolvedToken]
- calls
Document.fromText inputto get aDocument. - calls
parseString rules <that_document>to get aDuckling Stash- remember
Duckling = Identity, so really this is just aStash ~= IntMap (HashSet Node)
- remember
- calls
runDucklingon the resultingStash, which just extracts theStashinside - calls
toPosOrderedListto convert theStashto a[Node], and callsforcewhich is anNFData+ lazy thing that probably matters to the performance but not business logic - calls
mapMaybe (resolveNode context options)on the resulting datamapMaybeis becauseresolveNodereturns an option, it can decline to resolve aNode- this is where the
NodetoResolvedToken(hence also theTokentoResolvedVal) happens; it's where theTimecallbacks get called and latent flags get extracted, etc - Note that we don't use
contextandoptionsuntil this stage:Rules are context-free.
- the end result is a
[ResolvedToken]of nodes that resolved correctly. Recall that this list can have irrelevant values in it, we filter by dimension in theApi.hslayer
parseString :: [Rule] -> Document -> Duckling Stash builds up a stash of Nodes
- it begins with a call to
parseString1to get anew :: StashandpartialMatches :: [Match] - if that's empty it aborts, otherwise it calls
saturateParseStringwith theRules that start with a Predicate to recursively build a stash of all nested parse trees
saturateParseString :: [Rule] -> Document -> Stash -> Stash -> [Match] -> Duckling Stash the recursive loop
- this calls
parseString1 rules sentence stash new matchesto get(new', matches') - Base case: if
new'is null, we returnstash - Recursive case: tail call with
stash' = Stash.union stash new',new',matches' - ... Some notes on the invariants and why we need each argument:
- In the initial call from
parseAndResolvewe use the initialparseString1output for bothstashandnew, but in general they are actually quite different:stashis everything so far,newis just the stuff from the last iteration. - Why do we need both? Because in order to efficiently compute new
matches we shoudl only check where
newis relevant, not all ofstash
- In the initial call from
I suggest reviewing the discussion of type Match = (Rule, Int, [Node]) above real quick before you read this. Here's the signature:
parseString1 :: [Rule] -> Document -> Stash -> Stash -> [Match] -> Duckling (Stash, [Match])
Here's a quick rundown on the non-obvious arguments:
- the first Stash is stash and contains all of the Nodes we've computes so far
- the second Stash is new and contains just newly-added Nodes from last iteration
- the [Match] list is a monotonically-growing list of partial matches which are waiting
for a new Node to match the next Predicate (PatternItem) in the rule
- the output will be a Stash of new fully-matched Nodes and a possibly-larger
collection matches :: [Match] of partial matches
How does it work?
- At each stage, what we have is
- a list of
Matchvalues - partially matched Rules from the previous iteration - a bag of brand-new fully-matched
Nodes (which containTokens) from the previous iteration
- a list of
- So, there are two ways to get new partial matches:
- we might be able to advance a
Matchfrom before by one more step. Note that in this case thePatternItemwe match against will be contained within theMatch. This is whatmatchFirstdoes. - or, we might be able to start a new
Matchby matching the very firstPatternItemin one of the original DSLRules, due to a newly-availableTokencoming from aNodethat was fully-matched in the previous iteration. This is whatmatchFirstAnywheredoes.
- we might be able to advance a
- Then, there are some efficiency issues:
- we want to only use
newNodes when checking for the above - so, once we've found all matches that can advance using the above logic we want to "run them to completion" against the existing stash (and evaluating Regex PatternItems we come across as we go) before we move on to the next cycle, that way we can check only new nodes rather than the full stash in the next cycle
- moreover, anything that hits a
RegexPatternItemthat fails to match can be discarded - newNodes can't possibly turn it into a complete match - this is what the
matchAllfunction does
- we want to only use
- Once all that is done, we have a final set of partial
Matchvalues using theNodes computed thus far. We're ready to compute newNodes and iterate:- we check for any partial
Matchthat is now complete, and runrule.productionon it; if aTokenis returned we convert it to aNode - all the rest of our matches get added to the pool of potential matches. Note that we
cannot discard the old matches because, even if they didn't get advanced this cycle,
they might in the next cycle due to new
Nodes we just computed.
- we check for any partial
Our classifier is a naive Bayes classifier.
What this means is that
- we precompute a log-likelihood from the corpus for each Rule - that is, a rule that
can produce a single
Token - to produce a log likelihood for a full
ResolvedToken, we:- compute "features" from the associated
Node; there are two potential features- the concatenation of the rule names of all direct children
- the concatenation of all grains of all direct children (for some specific dimension types)
- ... note that the top level
Rulename is actually not a feature, this surprised me but I think it's because it's actually the mapping from children to parent we want to score
- sum up the log-likelihood for each feature
- sum up the result from doing ^ for all recursive
childrenof theNode
- compute "features" from the associated
The final result of this is a Candidate, and then we take all the top-ranked Candidate
(according to the Ord instance defined on Candidate
The logic for this lives in
Extraction.hs, which has the feature extraction codeRank.hs, which does the actual scoring and the recursion, as well as the final ranking