Evaluation

We are in the process of writing a GF implementation of a CNL which is already precisely defined in another format, i.e. the AceWiki (OWL-compatible) subset of ACE defined by the Codeco grammar here. This allows us to likewise be very precise in our evaluation, which can be divided into 3 aspects:

Coverage

How much of the target language does our grammar cover/accept as input?

This question is answered by parsing each sentence in the supplied test set and counting how many of them are parsed by the GF grammar. The target is that all test set sentences are assigned a parse tree, i.e. 100% coverage.

Note that this evaluation lets us measure how many sentences are parsed, not necessarily parsed correctly. E.g. we cannot say confidently that in the GF parse tree of the sentence

it is false that John likes Mary and Mary likes John.

Mary likes John is not under negation (which is the correct ACE parse of this sentence). To measure parsing correctness we would need to have an "official" mapping of ACE-in-GF abstract trees to APE (or Codeco) syntax trees. Without this mapping we can only approach this problem analytically, e.g. look at the it is false function and conclude that all its argument sentences are prefixed by that, because in ACE, one needs to write:

it is false that John likes Mary and that Mary likes John.

to get everything under negation.

Ambiguity

What percentage of all accepted sentences result in only one parse tree?

This question is answered by counting the number of parse trees returned by GF for each successful parse. The target is that every valid sentence returns only a single parse tree, i.e. 0% ambiguity.

Precision

Can the grammar accept sentences which are not valid in the target language (does it over-generate)?

This can be answered by generating random trees using the GF grammar, and for each seeing if they are accepted by the parser for the target language. The target is all sentences generated by the GF grammar are accepted by the target parser, i.e. 100% precision.

Problem: GF's generate_random function is not guaranteed to cover the grammar in any sort of complete way. We can only approach this by using large numbers of randomly generated sentences and determining the precision statistically.

Tree ambiguity [TODO: better name needed]

What percentage of all generated trees can be linearized into a single sentence?

The grammar can decide to treat certain constructs as syntactic sugar (GF variants), i.e. on semantic grounds (e.g. Attempto DRS equivalence) have different strings correspond to the same function. Examples:

does not vs doesn't
active vs passive
dative shift vs to-PP

This type of evaluation lets us detect where variants are used and which variant is picked first when linearizing.

Translation accuracy

Having a set of additional concrete syntaxes that correspond to the ACE abstract grammar, we can measure the accuracy of translating from ACE to the other syntaxes. This can be done in stages:

percentage of fully translated sentences, i.e. every function is linearized (this can be evaluated automatically)
percentage of syntactically correctly translated, i.e. the translation result is syntactically acceptable in the target language (needs human evaluation)
percentage of semantically correctly translated, i.e. the translation result reflects the ACE meaning (needs human evaluation)

Provide feedback

Saved searches