Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Krippendorfs Alpha for position - unplausible results #5305

Open
bgittel opened this issue Feb 21, 2025 · 3 comments
Open

Krippendorfs Alpha for position - unplausible results #5305

bgittel opened this issue Feb 21, 2025 · 3 comments
Labels
🐛 Bug Something isn't working Triage

Comments

@bgittel
Copy link

bgittel commented Feb 21, 2025

Describe the bug
I tried to understand how Krippendorfs Alpha unitizing for position is implemented and annotated a test doc by two annotators. One annotator has 5 annotations, the other just one. If I have 1 span with exact match, the score calculated is 0,4, if I have one span with overlap match (the span differs by one token) I get 0,42. How is this possible?
In fact, I would like to understand better how KA is implemented, especially how the aggreement matrix is calculated, because I observed implausible results for other docs in my corpus as well. Also, I would like to know if it were possible to implement another metric (i.e. gamma) that seems more suitable to deal with overlapping spans.

Please complete the following information:

  • Version and build ID: 35.2 (2025-02-04 07:13:24, build 18f5fdc)
  • OS: Win
  • Browser: Chrome

Thanks!

@bgittel bgittel added Triage 🐛 Bug Something isn't working labels Feb 21, 2025
@reckart
Copy link
Member

reckart commented Feb 21, 2025

INCEpTION uses DKPro Agreement.

There is a paper and a couple of presentations about it for introduction:

The implementation is here:

https://github.com/dkpro/dkpro-statistics/tree/main/dkpro-statistics-agreement/src/main/java/org/dkpro/statistics/agreement/unitizing

If you want to understand it, maybe start looking at that. If you get the correct numbers there, then there might be a bug in the way that INCEpTION calls DKPro Agreement. However, if you already get unexpected numbers in DKPro Agreement, then it might have a bug itself.

I have also tried doing a port of TextGamma to DKPro Agreement here:

dkpro/dkpro-statistics#39

However, so far this port is lacking qualified review and testing to say whether it produces the expected results.
Personally, I believe that Gamma is quite strange. In particular, I uses randomly generated deviations to calculate the expected disagreement. Since these deviations are random, the expected disagreement is also random - meaning the agreement score is random. Of course there are some statistical effects which constrain the randomness of the final result. However, it seems strange for me to accept that an agreement score will fluctuate (even a little) every time it is calculated.

If you look at DKPro Agreement's Krippendorf Alpha and/or the Gamma branch, best open issues/comment in that repo.

If you find everything to be in order in DKPro Agreement and suspect INCEpTION to be calling it the wrong way, best comment here again.

@reckart
Copy link
Member

reckart commented Feb 21, 2025

What may also help you is the diff export that you can get from the agreement page.
For pairwise agreement, use the export that you get from clicking on a cell in the pairwise agreement table.
For document-wise agreement, you can use the diff export in the sidebar.
The table that is produced here is more-or-less a dump of the data that INCEpTION passes (or not) to DKPro Agreement.

In particular you can find the offsets of the positions that are passed to the agreement measure. Also look out for the USED flag which indicate if a data point has been passed on to the measure. The measure does not see any lines that are not marked with this flag.

@reckart
Copy link
Member

reckart commented Feb 22, 2025

I did a little experiment in INCEpTION in a unit test. [x-y] - offsets, (a) label

Setup 1:

  • User 1: [0-4](a) [8-9](a)
  • User 2: [0-4](a) [8-9](a)
  • Agreement: 0.9454

Setup 2:
Setup 1:

  • User 1: [0-7](a) [8-9](a)
  • User 2: [0-4](a) [8-9](a)
  • Agreement: 0.6833

At least in this little experiment, the agreement degrades when there is overlap match instead of exact match.

Code (adjust offsets of user 1 manually to test)

@Test
    void test() throws Exception
    {
        var layer = new AnnotationLayer(MULTI_VALUE_SPAN_TYPE, MULTI_VALUE_SPAN_TYPE,
                SpanLayerSupport.TYPE, project, false, SINGLE_TOKEN, NO_OVERLAP);
        layer.setId(1l);
        layers.add(layer);

        var feature = new AnnotationFeature(project, layer, "values", "values",
                TYPE_NAME_STRING_ARRAY);
        feature.setId(1l);
        feature.setLinkMode(NONE);
        feature.setMode(ARRAY);
        features.add(feature);

        var user1 = createCas(createMultiValueStringTestTypeSystem());
        user1.setDocumentText("This is a test.");
        buildAnnotation(user1, MULTI_VALUE_SPAN_TYPE) //
                .at(0, 7) //
                .withFeature("values", asList("a")) //
                .buildAndAddToIndexes();

        buildAnnotation(user1, MULTI_VALUE_SPAN_TYPE) //
                .at(8, 9) //
                .withFeature("values", asList("a")) //
                .buildAndAddToIndexes();

        var user2 = createCas(createMultiValueStringTestTypeSystem());
        user2.setDocumentText("This is a test.");
        buildAnnotation(user2, MULTI_VALUE_SPAN_TYPE) //
                .at(0, 4) //
                .withFeature("values", asList("a")) //
                .buildAndAddToIndexes();

        var measure = sut.createMeasure(feature, traits);

        var result = measure.getAgreement(Map.of( //
                "user1", user1, //
                "user2", user2));

        System.out.println(result.getAgreement());
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 Bug Something isn't working Triage
Projects
None yet
Development

No branches or pull requests

2 participants