MaximallyInflectedVerbSynthesis - variables lost in CLDF #51

annagraff · 2023-03-09T18:48:41Z

The following autotyp variables are not available in the CLDF:

[1] "autotyp_MaximallyInflectedVerbSynthesis_VerbHasBipartiteStem"
[2] "autotyp_MaximallyInflectedVerbSynthesis_VerbHasNounIncorporation"
[3] "autotyp_MaximallyInflectedVerbSynthesis_VerbHasVerbIncorporation"
[4] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionExponenceType"
[5] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionMaxFormativeCount"
[6] "autotyp_MaximallyInflectedVerbSynthesis_VerbIsPhonologicallyCoherent"
[7] "autotyp_MaximallyInflectedVerbSynthesis_VerbIsSyntacticallyCoherent"
[8] "autotyp_MaximallyInflectedVerbSynthesis_VerbAgreement"
[9] "autotyp_MaximallyInflectedVerbSynthesis_VerbHasAnyIncorporation"
[10] "autotyp_MaximallyInflectedVerbSynthesis_VerbHasNounOrVerbIncorporation"
[11] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionMaxCategoryCount"
[12] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionMaxCategorySansAgreementCount"
[13] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionMaxFormativeSansAgreementCount"
[14] "autotyp_MaximallyInflectedVerbSynthesis_VerbIsProsodicallyCoherent"
[15] "autotyp_MaximallyInflectedVerbSynthesis_IsVerbAgreementSurveyComplete"
[16] "autotyp_MaximallyInflectedVerbSynthesis_IsVerbInflectionSurveyComplete"
[17] "autotyp_MaximallyInflectedVerbSynthesis_VerbIncorporation"
[18] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionCategories"

annagraff · 2023-03-09T18:56:26Z

the only variable available in the dataset "MaximallyInflectedVerbSynthesis" seems to be called MaximallyInflectedVerbSynthesis as well and is faulty

tzakharko · 2023-03-22T12:39:21Z

@annagraff Some data in AUTOTYP uses nested tables, MaximallyInflectedVerbSynthesis is one of such datasets. Since CLDF structured datasets do not directly support nesting, we export them as JSON. See this comment by Robert for more information: #2 (comment)

I you use Python, I'd recommend you to use the JSON export instead of CLDF for the complex datasets as it gives you direct access to the structure and is more straightforward to work with.

I hope this addresses your issue. Please reopen it if I missed something.

nataliacp · 2023-05-24T08:38:26Z

I am reopening this with a proposal to increase data reusability. Right now, the variables listed in the first comment in this thread are within a JSON format under the MaximallyInflectedVerbSynthesis umbrella variable. Most of these variables though are simple binary per-language variables and they could be incorporated straightforwardly in the CLDF format. The only problem is that the only values for these variables that can be trusted are for the languages that are TRUE for both housekeeping variables (IsVerbAgreementSurveyComplete and IsVerbInflectionSurveyComplete). What do you think about this proposal @tzakharko and @xrotwag?

tzakharko · 2023-05-24T08:50:20Z

Hi @nataliacp, I have reopened the issue per your request.

If I understand the problem correctly, it should indeed be possible to treat each variable separately and thus only encode nested variables for the CLDF export. This would however require a substantial rewrite of the CLDF export pipeline. Definitely something I am interested in tackling, but unfortunately I cannot assign high priority to this.

xrotwang · 2023-05-24T09:06:26Z

@nataliacp @tzakharko Technically, this sounds doable. One could turn the list of sets with complex data (https://github.com/cldf-datasets/autotypcldf/blob/aa0801e6c442c7b4a0607955d81b872e8890b739/cldfbench_autotypcldf.py#L205-L225) into a dict, mapping set IDs to callables, which return additional, synthetic parameter values (late-late aggregation so to speak :) ).

nataliacp · 2023-05-24T09:10:59Z

Thank you for the quick responses! Do you think that this could be done easily and relatively soon? @annagraff and I are using these variables and we are base everything on the cldf export. So, either we extract those variables to be used internally in Anna's pipeline, or we can wait for this to be implemented in autotyp so things are more straightforward and transparent.

tzakharko · 2023-05-24T09:21:01Z

@xrotwang The particular dataset in question is per-language if I remember correctly, so one probably won't need a new level of parameters — just rewrite the entire thing to process variables instead of tables. Of course, it won't be as simple for datasets that describe constructions instead of languages. If a general solution is required, one would probably need to do specialised mapping for each complex dataset to preserve the data semantics.

@nataliacp I won't be able to tackle this in the near future unfortunately. Maybe you or @annagraff could submit a PR? Alternatively you can try using the JSON export directly.

annagraff · 2023-05-24T09:27:42Z

Unfortunately, my pipeline is entirely in R, so my local fix for this would not be easily convertible to a pull request for the full CLDF generation pipeline. However, if @xrotwang thinks that a full solution is desireable, I could wait for it to be implemented, if this happens within the next month or so.

tzakharko · 2023-05-24T09:28:51Z

If your pipeline is in R, why don't you use the R export directly? That's the cleanest data representation you have at your disposal.

annagraff · 2023-05-24T09:32:51Z

I am working with 5 databases, one of which is autotyp. The others are all available in CLDF but not R format, and it would make my code much messier to write a specific pipeline for autotyp, especially if CLDF is also available here. The only autotyp module I am having problems with is this Synthesis module, because all other modules I care about are identical between the two data structures CLDF and RData

tzakharko · 2023-05-24T09:36:19Z

Ok, this makes sense, I understand your predicament.

What would be a good way to move forward here? I won't be able to address this in a reasonable time frame. Unless someone submits a PR that takes care of this change, it might be the least painful thing for you @annagraff to implement a workaround and extract these variables from the embedded JSON.

xrotwang · 2023-05-24T09:43:49Z

I could have a closer look sometime this week - and if things are as expected, I might be able to provide a PR later next week.

(An analysis using multiple databases and profiting from all being available in CLDF is too good advertisement :) )

nataliacp · 2023-05-24T09:46:52Z

thank you Robert! I hope it goes smoothly!

xrotwang · 2023-05-24T10:15:13Z

I don't know how you access the CLDF data from R. But if you happen to use the "CLDF-via-SQLite" approach (see https://github.com/cldf/cookbook/tree/master/recipes/cldf_r#working-with-cldf-via-sqlite), you could exploit the pretty good JSON support built into SQLite:

sqlite> select 
    l.cldf_name, json_extract(v.cldf_value, '$.VerbHasBipartiteStem') 
from
    valuetable as v, languagetable as l 
where 
    v.cldf_languageReference = l.cldf_id and v.cldf_parameterReference = 68
limit 20;

Language	Value
Belhare	1
Hungarian	0
English	0
Warlpiri	0
Fijian (Boumaa)	0
Thai	0
Mandarin	0
Ingush	1
Songhai (Koyra Chiini)	0
Yoruba	1
Persian	1
Ainu
Rama	1
Russian	0
Georgian	0
Karen (Sgaw)
Cree (Plains)	1
Lakhota	1
Adyghe (West Circassian)
Zapotec (Isthmus)	0

xrotwang · 2023-05-24T10:25:06Z

SQLite's extract_json function shouldn't be too hard to use via dbplyr, see https://dbplyr.tidyverse.org/articles/translation-function.html#unknown-functions

xrotwang · 2023-05-24T10:33:24Z

For completeness:

select
    l.cldf_name, json_extract(v.cldf_value, '$.VerbHasBipartiteStem') 
from 
    valuetable as v, languagetable as l 
where 
    v.cldf_languageReference = l.cldf_id and 
    v.cldf_parameterReference = 68 and 
    json_extract(v.cldf_value, '$.IsVerbInflectionSurveyComplete') = 1 and 
    json_extract(v.cldf_value, '$.IsVerbAgreementSurveyComplete') = 1;

seems to be what we want here.

xrotwang · 2023-05-25T10:10:59Z

On @annagraff 's list above there are three (complex) list-valued variables:

VerbInflectionCategories
VerbIncorporation
VerbAgreement

I guess these shouldn't be available as regular parameters, correct @nataliacp @annagraff ?

annagraff · 2023-05-25T10:11:49Z

Yes, this is correct! @xrotwang

xrotwang · 2023-05-25T11:44:01Z

@annagraff does this look like what you'd expect:

select 
    p.cldf_name, count(v.cldf_id) 
from 
    parametertable as p, valuetable as v 
where 
    v.cldf_parameterreference = p.cldf_id and p.cldf_name like '%Maximally%'
group by p.cldf_name;

Name	Langs
MaximallyInflectedVerbSynthesis	451
MaximallyInflectedVerbSynthesis_VerbHasAnyIncorporation	235
MaximallyInflectedVerbSynthesis_VerbHasBipartiteStem	137
MaximallyInflectedVerbSynthesis_VerbHasNounIncorporation	235
MaximallyInflectedVerbSynthesis_VerbHasNounOrVerbIncorporation	235
MaximallyInflectedVerbSynthesis_VerbHasVerbIncorporation	235
MaximallyInflectedVerbSynthesis_VerbInflectionExponenceType	220
MaximallyInflectedVerbSynthesis_VerbInflectionMaxCategoryCount	235
MaximallyInflectedVerbSynthesis_VerbInflectionMaxCategorySansAgreementCount	235
MaximallyInflectedVerbSynthesis_VerbInflectionMaxFormativeCount	220
MaximallyInflectedVerbSynthesis_VerbIsPhonologicallyCoherent	88
MaximallyInflectedVerbSynthesis_VerbIsProsodicallyCoherent	131
MaximallyInflectedVerbSynthesis_VerbIsSyntacticallyCoherent	131
MaximallyInflectedVerbSynthesis_VerbProsodicCoherencyNotes	48

annagraff · 2023-05-25T11:48:08Z

It does! Except the first "MaximallyInflectedVerbSynthesis" is no longer necessary, since it is the original entry with JSON syntax, right?

xrotwang · 2023-05-25T11:49:19Z

It does! Except the first "MaximallyInflectedVerbSynthesis" is no longer necessary, since it is the original entry with JSON syntax, right?

It also includes the list-valued data. So for completeness, it should still be there, I think.

xrotwang · 2023-05-25T11:52:18Z

@tzakharko Here's the changes needed to make this happen: https://github.com/autotyp/autotyp-cldf-scripts/pull/1/files

tzakharko · 2023-05-26T11:45:38Z

@xrotwang Thanks for the patch!

@annagraff Can you please check that your pipeline works as expected with https://github.com/autotyp/autotyp-data/tree/version-1.1.1 If everything is fine, I will push it as a minor release.

annagraff · 2023-05-26T14:12:52Z

Many many thanks to both of you, @tzakharko @xrotwang!

It works and looks exactly the way it should. You can push this as a minor release, @tzakharko

Have a nice weekend!

tzakharko · 2023-05-27T07:30:29Z

New version has been released and pushed to Zenodo. Thanks everyone!

tzakharko closed this as completed Mar 22, 2023

nataliacp mentioned this issue May 24, 2023

Provide data in CLDF format #2

Open

tzakharko reopened this May 24, 2023

xrotwang mentioned this issue May 25, 2023

extract simple variables from unitset autotyp/autotyp-cldf-scripts#1

Merged

tzakharko closed this as completed May 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MaximallyInflectedVerbSynthesis - variables lost in CLDF #51

MaximallyInflectedVerbSynthesis - variables lost in CLDF #51

annagraff commented Mar 9, 2023

annagraff commented Mar 9, 2023

tzakharko commented Mar 22, 2023

nataliacp commented May 24, 2023

tzakharko commented May 24, 2023

xrotwang commented May 24, 2023

nataliacp commented May 24, 2023

tzakharko commented May 24, 2023

annagraff commented May 24, 2023

tzakharko commented May 24, 2023

annagraff commented May 24, 2023 •

edited

Loading

tzakharko commented May 24, 2023

xrotwang commented May 24, 2023

nataliacp commented May 24, 2023

xrotwang commented May 24, 2023

xrotwang commented May 24, 2023

xrotwang commented May 24, 2023

xrotwang commented May 25, 2023

annagraff commented May 25, 2023

xrotwang commented May 25, 2023

annagraff commented May 25, 2023

xrotwang commented May 25, 2023

xrotwang commented May 25, 2023

tzakharko commented May 26, 2023

annagraff commented May 26, 2023

tzakharko commented May 27, 2023

MaximallyInflectedVerbSynthesis - variables lost in CLDF #51

MaximallyInflectedVerbSynthesis - variables lost in CLDF #51

Comments

annagraff commented Mar 9, 2023

annagraff commented Mar 9, 2023

tzakharko commented Mar 22, 2023

nataliacp commented May 24, 2023

tzakharko commented May 24, 2023

xrotwang commented May 24, 2023

nataliacp commented May 24, 2023

tzakharko commented May 24, 2023

annagraff commented May 24, 2023

tzakharko commented May 24, 2023

annagraff commented May 24, 2023 • edited Loading

tzakharko commented May 24, 2023

xrotwang commented May 24, 2023

nataliacp commented May 24, 2023

xrotwang commented May 24, 2023

xrotwang commented May 24, 2023

xrotwang commented May 24, 2023

xrotwang commented May 25, 2023

annagraff commented May 25, 2023

xrotwang commented May 25, 2023

annagraff commented May 25, 2023

xrotwang commented May 25, 2023

xrotwang commented May 25, 2023

tzakharko commented May 26, 2023

annagraff commented May 26, 2023

tzakharko commented May 27, 2023

annagraff commented May 24, 2023 •

edited

Loading