Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MaximallyInflectedVerbSynthesis - variables lost in CLDF #51

Closed
annagraff opened this issue Mar 9, 2023 · 25 comments
Closed

MaximallyInflectedVerbSynthesis - variables lost in CLDF #51

annagraff opened this issue Mar 9, 2023 · 25 comments

Comments

@annagraff
Copy link

The following autotyp variables are not available in the CLDF:

[1] "autotyp_MaximallyInflectedVerbSynthesis_VerbHasBipartiteStem"
[2] "autotyp_MaximallyInflectedVerbSynthesis_VerbHasNounIncorporation"
[3] "autotyp_MaximallyInflectedVerbSynthesis_VerbHasVerbIncorporation"
[4] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionExponenceType"
[5] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionMaxFormativeCount"
[6] "autotyp_MaximallyInflectedVerbSynthesis_VerbIsPhonologicallyCoherent"
[7] "autotyp_MaximallyInflectedVerbSynthesis_VerbIsSyntacticallyCoherent"
[8] "autotyp_MaximallyInflectedVerbSynthesis_VerbAgreement"
[9] "autotyp_MaximallyInflectedVerbSynthesis_VerbHasAnyIncorporation"
[10] "autotyp_MaximallyInflectedVerbSynthesis_VerbHasNounOrVerbIncorporation"
[11] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionMaxCategoryCount"
[12] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionMaxCategorySansAgreementCount"
[13] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionMaxFormativeSansAgreementCount"
[14] "autotyp_MaximallyInflectedVerbSynthesis_VerbIsProsodicallyCoherent"
[15] "autotyp_MaximallyInflectedVerbSynthesis_IsVerbAgreementSurveyComplete"
[16] "autotyp_MaximallyInflectedVerbSynthesis_IsVerbInflectionSurveyComplete"
[17] "autotyp_MaximallyInflectedVerbSynthesis_VerbIncorporation"
[18] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionCategories"

@annagraff
Copy link
Author

the only variable available in the dataset "MaximallyInflectedVerbSynthesis" seems to be called MaximallyInflectedVerbSynthesis as well and is faulty

@tzakharko
Copy link
Contributor

@annagraff Some data in AUTOTYP uses nested tables, MaximallyInflectedVerbSynthesis is one of such datasets. Since CLDF structured datasets do not directly support nesting, we export them as JSON. See this comment by Robert for more information: #2 (comment)

I you use Python, I'd recommend you to use the JSON export instead of CLDF for the complex datasets as it gives you direct access to the structure and is more straightforward to work with.

I hope this addresses your issue. Please reopen it if I missed something.

@nataliacp
Copy link

I am reopening this with a proposal to increase data reusability. Right now, the variables listed in the first comment in this thread are within a JSON format under the MaximallyInflectedVerbSynthesis umbrella variable. Most of these variables though are simple binary per-language variables and they could be incorporated straightforwardly in the CLDF format. The only problem is that the only values for these variables that can be trusted are for the languages that are TRUE for both housekeeping variables (IsVerbAgreementSurveyComplete and IsVerbInflectionSurveyComplete). What do you think about this proposal @tzakharko and @xrotwag?

@tzakharko
Copy link
Contributor

Hi @nataliacp, I have reopened the issue per your request.

If I understand the problem correctly, it should indeed be possible to treat each variable separately and thus only encode nested variables for the CLDF export. This would however require a substantial rewrite of the CLDF export pipeline. Definitely something I am interested in tackling, but unfortunately I cannot assign high priority to this.

@xrotwang
Copy link

@nataliacp @tzakharko Technically, this sounds doable. One could turn the list of sets with complex data (https://github.com/cldf-datasets/autotypcldf/blob/aa0801e6c442c7b4a0607955d81b872e8890b739/cldfbench_autotypcldf.py#L205-L225) into a dict, mapping set IDs to callables, which return additional, synthetic parameter values (late-late aggregation so to speak :) ).

@nataliacp
Copy link

Thank you for the quick responses! Do you think that this could be done easily and relatively soon? @annagraff and I are using these variables and we are base everything on the cldf export. So, either we extract those variables to be used internally in Anna's pipeline, or we can wait for this to be implemented in autotyp so things are more straightforward and transparent.

@tzakharko
Copy link
Contributor

@xrotwang The particular dataset in question is per-language if I remember correctly, so one probably won't need a new level of parameters — just rewrite the entire thing to process variables instead of tables. Of course, it won't be as simple for datasets that describe constructions instead of languages. If a general solution is required, one would probably need to do specialised mapping for each complex dataset to preserve the data semantics.

@nataliacp I won't be able to tackle this in the near future unfortunately. Maybe you or @annagraff could submit a PR? Alternatively you can try using the JSON export directly.

@annagraff
Copy link
Author

Unfortunately, my pipeline is entirely in R, so my local fix for this would not be easily convertible to a pull request for the full CLDF generation pipeline. However, if @xrotwang thinks that a full solution is desireable, I could wait for it to be implemented, if this happens within the next month or so.

@tzakharko
Copy link
Contributor

If your pipeline is in R, why don't you use the R export directly? That's the cleanest data representation you have at your disposal.

@annagraff
Copy link
Author

annagraff commented May 24, 2023

I am working with 5 databases, one of which is autotyp. The others are all available in CLDF but not R format, and it would make my code much messier to write a specific pipeline for autotyp, especially if CLDF is also available here. The only autotyp module I am having problems with is this Synthesis module, because all other modules I care about are identical between the two data structures CLDF and RData

@tzakharko
Copy link
Contributor

Ok, this makes sense, I understand your predicament.

What would be a good way to move forward here? I won't be able to address this in a reasonable time frame. Unless someone submits a PR that takes care of this change, it might be the least painful thing for you @annagraff to implement a workaround and extract these variables from the embedded JSON.

@xrotwang
Copy link

I could have a closer look sometime this week - and if things are as expected, I might be able to provide a PR later next week.

(An analysis using multiple databases and profiting from all being available in CLDF is too good advertisement :) )

@nataliacp
Copy link

thank you Robert! I hope it goes smoothly!

@xrotwang
Copy link

I don't know how you access the CLDF data from R. But if you happen to use the "CLDF-via-SQLite" approach (see https://github.com/cldf/cookbook/tree/master/recipes/cldf_r#working-with-cldf-via-sqlite), you could exploit the pretty good JSON support built into SQLite:

sqlite> select 
    l.cldf_name, json_extract(v.cldf_value, '$.VerbHasBipartiteStem') 
from
    valuetable as v, languagetable as l 
where 
    v.cldf_languageReference = l.cldf_id and v.cldf_parameterReference = 68
limit 20;
Language Value
Belhare 1
Hungarian 0
English 0
Warlpiri 0
Fijian (Boumaa) 0
Thai 0
Mandarin 0
Ingush 1
Songhai (Koyra Chiini) 0
Yoruba 1
Persian 1
Ainu
Rama 1
Russian 0
Georgian 0
Karen (Sgaw)
Cree (Plains) 1
Lakhota 1
Adyghe (West Circassian)
Zapotec (Isthmus) 0

@xrotwang
Copy link

SQLite's extract_json function shouldn't be too hard to use via dbplyr, see https://dbplyr.tidyverse.org/articles/translation-function.html#unknown-functions

@xrotwang
Copy link

For completeness:

select
    l.cldf_name, json_extract(v.cldf_value, '$.VerbHasBipartiteStem') 
from 
    valuetable as v, languagetable as l 
where 
    v.cldf_languageReference = l.cldf_id and 
    v.cldf_parameterReference = 68 and 
    json_extract(v.cldf_value, '$.IsVerbInflectionSurveyComplete') = 1 and 
    json_extract(v.cldf_value, '$.IsVerbAgreementSurveyComplete') = 1;

seems to be what we want here.

@xrotwang
Copy link

On @annagraff 's list above there are three (complex) list-valued variables:

  • VerbInflectionCategories
  • VerbIncorporation
  • VerbAgreement

I guess these shouldn't be available as regular parameters, correct @nataliacp @annagraff ?

@annagraff
Copy link
Author

Yes, this is correct! @xrotwang

@xrotwang
Copy link

@annagraff does this look like what you'd expect:

select 
    p.cldf_name, count(v.cldf_id) 
from 
    parametertable as p, valuetable as v 
where 
    v.cldf_parameterreference = p.cldf_id and p.cldf_name like '%Maximally%'
group by p.cldf_name;
Name Langs
MaximallyInflectedVerbSynthesis 451
MaximallyInflectedVerbSynthesis_VerbHasAnyIncorporation 235
MaximallyInflectedVerbSynthesis_VerbHasBipartiteStem 137
MaximallyInflectedVerbSynthesis_VerbHasNounIncorporation 235
MaximallyInflectedVerbSynthesis_VerbHasNounOrVerbIncorporation 235
MaximallyInflectedVerbSynthesis_VerbHasVerbIncorporation 235
MaximallyInflectedVerbSynthesis_VerbInflectionExponenceType 220
MaximallyInflectedVerbSynthesis_VerbInflectionMaxCategoryCount 235
MaximallyInflectedVerbSynthesis_VerbInflectionMaxCategorySansAgreementCount 235
MaximallyInflectedVerbSynthesis_VerbInflectionMaxFormativeCount 220
MaximallyInflectedVerbSynthesis_VerbIsPhonologicallyCoherent 88
MaximallyInflectedVerbSynthesis_VerbIsProsodicallyCoherent 131
MaximallyInflectedVerbSynthesis_VerbIsSyntacticallyCoherent 131
MaximallyInflectedVerbSynthesis_VerbProsodicCoherencyNotes 48

@annagraff
Copy link
Author

It does! Except the first "MaximallyInflectedVerbSynthesis" is no longer necessary, since it is the original entry with JSON syntax, right?

@xrotwang
Copy link

It does! Except the first "MaximallyInflectedVerbSynthesis" is no longer necessary, since it is the original entry with JSON syntax, right?

It also includes the list-valued data. So for completeness, it should still be there, I think.

@xrotwang
Copy link

@tzakharko Here's the changes needed to make this happen: https://github.com/autotyp/autotyp-cldf-scripts/pull/1/files

@tzakharko
Copy link
Contributor

@xrotwang Thanks for the patch!

@annagraff Can you please check that your pipeline works as expected with https://github.com/autotyp/autotyp-data/tree/version-1.1.1 If everything is fine, I will push it as a minor release.

@annagraff
Copy link
Author

Many many thanks to both of you, @tzakharko @xrotwang!

It works and looks exactly the way it should. You can push this as a minor release, @tzakharko

Have a nice weekend!

@tzakharko
Copy link
Contributor

New version has been released and pushed to Zenodo. Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants