Krippendorfs Alpha for position - unplausible results #5305

bgittel · 2025-02-21T15:29:38Z

Describe the bug
I tried to understand how Krippendorfs Alpha unitizing for position is implemented and annotated a test doc by two annotators. One annotator has 5 annotations, the other just one. If I have 1 span with exact match, the score calculated is 0,4, if I have one span with overlap match (the span differs by one token) I get 0,42. How is this possible?
In fact, I would like to understand better how KA is implemented, especially how the aggreement matrix is calculated, because I observed implausible results for other docs in my corpus as well. Also, I would like to know if it were possible to implement another metric (i.e. gamma) that seems more suitable to deal with overlapping spans.

Please complete the following information:

Version and build ID: 35.2 (2025-02-04 07:13:24, build 18f5fdc)
OS: Win
Browser: Chrome

Thanks!

The text was updated successfully, but these errors were encountered:

reckart · 2025-02-21T17:50:37Z

INCEpTION uses DKPro Agreement.

There is a paper and a couple of presentations about it for introduction:

The implementation is here:

https://github.com/dkpro/dkpro-statistics/tree/main/dkpro-statistics-agreement/src/main/java/org/dkpro/statistics/agreement/unitizing

If you want to understand it, maybe start looking at that. If you get the correct numbers there, then there might be a bug in the way that INCEpTION calls DKPro Agreement. However, if you already get unexpected numbers in DKPro Agreement, then it might have a bug itself.

I have also tried doing a port of TextGamma to DKPro Agreement here:

dkpro/dkpro-statistics#39

However, so far this port is lacking qualified review and testing to say whether it produces the expected results.
Personally, I believe that Gamma is quite strange. In particular, I uses randomly generated deviations to calculate the expected disagreement. Since these deviations are random, the expected disagreement is also random - meaning the agreement score is random. Of course there are some statistical effects which constrain the randomness of the final result. However, it seems strange for me to accept that an agreement score will fluctuate (even a little) every time it is calculated.

If you look at DKPro Agreement's Krippendorf Alpha and/or the Gamma branch, best open issues/comment in that repo.

If you find everything to be in order in DKPro Agreement and suspect INCEpTION to be calling it the wrong way, best comment here again.

reckart · 2025-02-21T18:30:41Z

What may also help you is the diff export that you can get from the agreement page.
For pairwise agreement, use the export that you get from clicking on a cell in the pairwise agreement table.
For document-wise agreement, you can use the diff export in the sidebar.
The table that is produced here is more-or-less a dump of the data that INCEpTION passes (or not) to DKPro Agreement.

In particular you can find the offsets of the positions that are passed to the agreement measure. Also look out for the USED flag which indicate if a data point has been passed on to the measure. The measure does not see any lines that are not marked with this flag.

reckart · 2025-02-22T20:07:40Z

I did a little experiment in INCEpTION in a unit test. [x-y] - offsets, (a) label

Setup 1:

User 1: [0-4](a) [8-9](a)
User 2: [0-4](a) [8-9](a)
Agreement: 0.9454

Setup 2:
Setup 1:

User 1: [0-7](a) [8-9](a)
User 2: [0-4](a) [8-9](a)
Agreement: 0.6833

At least in this little experiment, the agreement degrades when there is overlap match instead of exact match.

Code (adjust offsets of user 1 manually to test)

@Test
    void test() throws Exception
    {
        var layer = new AnnotationLayer(MULTI_VALUE_SPAN_TYPE, MULTI_VALUE_SPAN_TYPE,
                SpanLayerSupport.TYPE, project, false, SINGLE_TOKEN, NO_OVERLAP);
        layer.setId(1l);
        layers.add(layer);

        var feature = new AnnotationFeature(project, layer, "values", "values",
                TYPE_NAME_STRING_ARRAY);
        feature.setId(1l);
        feature.setLinkMode(NONE);
        feature.setMode(ARRAY);
        features.add(feature);

        var user1 = createCas(createMultiValueStringTestTypeSystem());
        user1.setDocumentText("This is a test.");
        buildAnnotation(user1, MULTI_VALUE_SPAN_TYPE) //
                .at(0, 7) //
                .withFeature("values", asList("a")) //
                .buildAndAddToIndexes();

        buildAnnotation(user1, MULTI_VALUE_SPAN_TYPE) //
                .at(8, 9) //
                .withFeature("values", asList("a")) //
                .buildAndAddToIndexes();

        var user2 = createCas(createMultiValueStringTestTypeSystem());
        user2.setDocumentText("This is a test.");
        buildAnnotation(user2, MULTI_VALUE_SPAN_TYPE) //
                .at(0, 4) //
                .withFeature("values", asList("a")) //
                .buildAndAddToIndexes();

        var measure = sut.createMeasure(feature, traits);

        var result = measure.getAgreement(Map.of( //
                "user1", user1, //
                "user2", user2));

        System.out.println(result.getAgreement());
    }

bgittel added Triage 🐛 Bug Something isn't working labels Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Krippendorfs Alpha for position - unplausible results #5305

Krippendorfs Alpha for position - unplausible results #5305

bgittel commented Feb 21, 2025

reckart commented Feb 21, 2025 •

edited

Loading

reckart commented Feb 21, 2025

reckart commented Feb 22, 2025 •

edited

Loading

Krippendorfs Alpha for position - unplausible results #5305

Krippendorfs Alpha for position - unplausible results #5305

Comments

bgittel commented Feb 21, 2025

reckart commented Feb 21, 2025 • edited Loading

reckart commented Feb 21, 2025

reckart commented Feb 22, 2025 • edited Loading

reckart commented Feb 21, 2025 •

edited

Loading

reckart commented Feb 22, 2025 •

edited

Loading