clj-duckling.util.learn

corpus->dataset

(corpus->dataset {:keys [context tests], :as corpus} rules feature-extractor logger)
Takes a corpus and a feature extractor and builds a dataset (phase 1.a. on clj-duckling.md).

extract-route-features

(extract-route-features token)
Extracts names of previous routes used to produce this route token.
This is the feature extractor we use.

judge-ml

(judge-ml stash classifiers)
Choose the winning token using a classifier.
Computes prob of each rule according to their routes.

print-dataset

(print-dataset dataset)
Print dataset to STDOUT

route-prob

(route-prob route classifiers)
Computes the _log_ prob for a route.

sentence->dataset

(sentence->dataset s context check rules feature-extractor dataset logger)
Enriches the dataset

Args:
  s (string): a sentence
  context (map): the context
  check (func): fn that determines if a winner is valid
  rules (map):
  feature-extractor (func):
  dataset (vector): the existing dataset

Returns:
  vector: an enriched dataset [{<rule-name> [features, output]}]
        Output is true if the rule was contributing
        successfully, false otherwise

simple-feature-extractor

(simple-feature-extractor token)
A very simple one to show if it works. Not used for now.
Takes a token, returns a vector of features
(can be anything as long as the model understands it).

subtokens

(subtokens token)
Get a set of all the tokens in the tree who eventually produced the given token
(including token itself)

train-classifiers

(train-classifiers corpus rules fextractor logger)
Given a corpus and a set of rules, train a classifier per rule