Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of language tags in KGCL #60

Open
cmungall opened this issue Mar 6, 2024 · 5 comments
Open

Handling of language tags in KGCL #60

cmungall opened this issue Mar 6, 2024 · 5 comments

Comments

@cmungall
Copy link
Contributor

cmungall commented Mar 6, 2024

Currently handling of language tags is under-specified in KGCL, both in terms of

  • matching (e.g. change label from X[@en] to Y)
  • applying (e.g. change label from X to Y[@en])

Recall also that most OBO ontologies use a mixture of uncommitted literals, xsd:string, and @en to denote english language labels.

As a general principle, the KGCL DSL is intended to be user-friendly. The user shouldn't have to know detailed implementation knowledge about each ontology. In fact it is very hard for them to know these details. As a case in point, for the following two terms in ENVO it's impossible to know from OLS that the first uses an explicit @en and the second does not:

At the most recent OMO meeting there was heated discussion about whether we should expect cardinality=1 of rdfs:label given that some ontologies may want to be international. It's not up to KGCL to adjudicate here. However, we can make things easy for users:

  1. matching should be liberal; if a language tag is not specified this should not be interpreted as "must match untyped literal", it should instead be interpreted as "match this at the string level"
  2. application should be configurable at the ontology level
    • if the user does not specify a language tag, and the ontology is configured to always use language tags then the configured default language should be applied
    • if the user does specify a language tag then this should be used (it is up to the ontology to configure GH actions to reject any or all language tags if their policy is always untyped literals)

2 This does place more of a burden on implementors as there needs to be some configuration mechanism, but having this default to untyped literals will work for pretty much all OBO ontologies for now

@gouttegd
Copy link

gouttegd commented Mar 6, 2024

At the most recent OMO meeting there was heated discussion about whether we should expect cardinality=1 of rdfs:label given that some ontologies may want to be international. It's not up to KGCL to adjudicate here.

Actually even if we decided that there can only be one label (and so, that we can ignore all cases where there are more than one label as being invalid and “not-our-problem”), that wouldn’t solve the general issue: KGCL supports modifying other annotation properties than just rdfs:label, including properties for which there is no doubt (or at least I hope there is no doubt!) that it is perfectly legitimate to have more than one annotations per term. All properties pertaining to synonyms, for example.

matching should be liberal; if a language tag is not specified this should not be interpreted as "must match untyped literal", it should instead be interpreted as "match this at the string level"

Given a case like this:

AnnotationAssertion(rdfs:label EX:0001 "the label")
AnnotationAssertion(rdfs:label EX:0001 "the label"@en)

What should be the behaviour of rename EX:0001 from "the label" to "the new label" ? Should it rename both the language-neutral label and the English label? What if I want to specifically edit the language-neutral label?

How about:

  • No language tag means that we look for a value that does not have a language tag (so, "the label" would match a tag-less label only);
  • We accept a @* language tag that would mean “any language tag (including no language tag) will do” (so, "the label"@* would match any literal value that is exactly "the label", regardless of any language tag).

This way the decision to ignore the language tags when matching would be an explicit decision. (Of course we could also do the opposite: no language tag means “ignore the language tags when matching”, and a @NONE or similar would mean “only match literals that do not have a language tag” – though that would seem much less natural to me.)

Then at the level of the Ontobot, individual ontologies can configure the bot to pass to the KGCL engine a --default-old-language-tag option, that would be used when no language tag is explicitly given in the KGCL command(s). By setting this option to @*, this would give the same behaviour as the one you propose, the difference being that this behaviour would not be hardcoded in the KGCL engines.

@cmungall
Copy link
Contributor Author

cmungall commented Mar 6, 2024

Thanks, I think this makes sense, but I'd like to reverse it a little

  • No language tag ("...") means that any language tag or or no language tag matches
  • As specific language tag ("..."@en) must match a language-literal
  • If a user wants to match plain literals or plain literals only, they say "..."^^rdf:PlainLiteral

Here there is a slight impedance mismatch with semantic web standards where there is always a literal type commitment (people get caught by this all the time with sparql queries, a string match "works" on one ontology but not another, without inefficient coercion to strings). However, there is less of an impedance mismatch with user expectations.

@balhoff
Copy link
Member

balhoff commented Mar 7, 2024

I wouldn't hinge anything on rdf:PlainLiteral, which is dropped for RDF 1.1. In the newer standard a simple literal is short for a literal with datatype xsd:string: https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal

You'll see this behavior when RDF passes through Jena, although confusingly the OWL spec was not updated to keep up.

@gouttegd
Copy link

gouttegd commented Mar 7, 2024

No language tag ("...") means that any language tag or or no language tag matches

OK.

Regardless of whether ignoring the language tags is the default behaviour (your proposition) or must be explicitly asked (mine, @*), we must also decide what is expected behaviour when more than one label match, as in my example:

AnnotationAssertion(rdfs:label EX:0001 "the label")
AnnotationAssertion(rdfs:label EX:0001 "the label"@en)

If the command is rename EX:0001 from "the label" to "the new label", I don’t think there is any problem. It seems clear to me that the result should be:

AnnotationAssertion(rdfs:label EX:0001 "the new label")
AnnotationAssertion(rdfs:label EX:0001 "the new label"@en)

Because:

  • No language tag on the old value, so we match both existing labels.
  • No language tag on the new value, so we preserve the existing tags (including the absence of a language tag).

This is also, I believe, what most users would expect, so this is fine.

But what if the command is rename EX:0001 from "the label" to "the new label"@en – that is, with a language tag on the new value (whether it has been explicitly specified by the user, or automatically added because the ontology is configured to do so – as envisioned in your first message)?

The “logical” (but not necessarily sensible) output would be:

AnnotationAssertion(rdfs:label EX:0001 "the new label"@en)
AnnotationAssertion(rdfs:label EX:0001 "the new label"@en)

Because:

  • No language tag on the old value, so we match both labels;
  • Language tag on the new value, so we set the language tag as specified.

Here I don’t think this is a desirable behaviour.

Even worse, let’s imagine a term that has labels in several languages, and that in two languages the labels are actually the same string (this won’t be frequent but it may happen; words that are identical across languages are not unheard of). For example, say we have:

AnnotationAssertion(rdfs:label EX:0001 "lion"@en)
AnnotationAssertion(rdfs:label EX:0001 "lion"@fr)

A command like rename EX:0001 from "lion" to "sea lion"@en should not rename the French label to "sea lion"@en! (Arguably ontologies that have multi-language labels should simply forbid the use of untagged values in KGCL commands, and force users to always be explicit.)

I propose something like:

If the new value has a language tag and the old value does not, then we do not look blindly for any matching label regardless of the language tag (as we do in the general case). Instead, we first look for a matching label with the same language tag as the new value, and then (if we don’t find such a label) we look for a matching label without a language tag. We never look for a matching label with a different language tag.

Admittedly this is a bit complicated, but I think that should cover all cases reasonably. For example, given the command rename EX:0001 from "lion" to "sea lion"@en:

  • if the term has only a matching English label ("lion"@en), it would be renamed into another English label ("sea lion"@en);
  • if the term has only a matching language-neutral label ("lion"), it would both be renamed and given an English language tag ("lion"@en);
  • if the term has both a matching English label and a matching language-neutral label, only the English label would be renamed into another English-tagged label;
  • if the term has both a matching English label and another matching label in another language, likewise: only the English label would be renamed.

Overall this should work just fine for ontologies that have a mixture of untagged and tagged labels.

Aside:

If a user wants to match plain literals or plain literals only, they say "..."^^rdf:PlainLiteral

Note that with recent versions of the OWL API, a literal without an explicit datatype (as in "the label") is interpreted as a xsd:string and so is still a typed string, not a rdf:plainLiteral.

gouttegd added a commit to gouttegd/kgcl-java that referenced this issue Mar 10, 2024
As discussed in INCATools/kgcl#60, when trying
to find an annotation value (e.g. to rename a class, we need to find the
annotation corresponding to the old label), language tags should be
compared in a relaxed fashion. We should not fail to find an annotation
just because the annotation has a language tag and the KGCL command did
not specify any language tag at all.

This necessitates some pretty important refactoring, because this means,
among other things, that more than one annotation values may match (if
several annotations have the same literal value but different language
tags, or one has a language tag and another does not).

This is still a work in progress. For now, this is implemented
specifically for the NodeRename operation. After more testing, this will
be generalized to all other operations that involve finding an existing
annotation value (e.g. RemoveDefinition, ChangeDefinition,
SynonymReplacement, etc.).
@cmungall
Copy link
Contributor Author

Sorry for the delay. Thank you for the analysis. I agree with your proposed solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants