-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MaximallyInflectedVerbSynthesis - variables lost in CLDF #51
Comments
the only variable available in the dataset "MaximallyInflectedVerbSynthesis" seems to be called MaximallyInflectedVerbSynthesis as well and is faulty |
@annagraff Some data in AUTOTYP uses nested tables, I you use Python, I'd recommend you to use the JSON export instead of CLDF for the complex datasets as it gives you direct access to the structure and is more straightforward to work with. I hope this addresses your issue. Please reopen it if I missed something. |
I am reopening this with a proposal to increase data reusability. Right now, the variables listed in the first comment in this thread are within a JSON format under the MaximallyInflectedVerbSynthesis umbrella variable. Most of these variables though are simple binary per-language variables and they could be incorporated straightforwardly in the CLDF format. The only problem is that the only values for these variables that can be trusted are for the languages that are TRUE for both housekeeping variables (IsVerbAgreementSurveyComplete and IsVerbInflectionSurveyComplete). What do you think about this proposal @tzakharko and @xrotwag? |
Hi @nataliacp, I have reopened the issue per your request. If I understand the problem correctly, it should indeed be possible to treat each variable separately and thus only encode nested variables for the CLDF export. This would however require a substantial rewrite of the CLDF export pipeline. Definitely something I am interested in tackling, but unfortunately I cannot assign high priority to this. |
@nataliacp @tzakharko Technically, this sounds doable. One could turn the list of sets with complex data (https://github.com/cldf-datasets/autotypcldf/blob/aa0801e6c442c7b4a0607955d81b872e8890b739/cldfbench_autotypcldf.py#L205-L225) into a |
Thank you for the quick responses! Do you think that this could be done easily and relatively soon? @annagraff and I are using these variables and we are base everything on the cldf export. So, either we extract those variables to be used internally in Anna's pipeline, or we can wait for this to be implemented in autotyp so things are more straightforward and transparent. |
@xrotwang The particular dataset in question is per-language if I remember correctly, so one probably won't need a new level of parameters — just rewrite the entire thing to process variables instead of tables. Of course, it won't be as simple for datasets that describe constructions instead of languages. If a general solution is required, one would probably need to do specialised mapping for each complex dataset to preserve the data semantics. @nataliacp I won't be able to tackle this in the near future unfortunately. Maybe you or @annagraff could submit a PR? Alternatively you can try using the JSON export directly. |
Unfortunately, my pipeline is entirely in R, so my local fix for this would not be easily convertible to a pull request for the full CLDF generation pipeline. However, if @xrotwang thinks that a full solution is desireable, I could wait for it to be implemented, if this happens within the next month or so. |
If your pipeline is in R, why don't you use the R export directly? That's the cleanest data representation you have at your disposal. |
I am working with 5 databases, one of which is autotyp. The others are all available in CLDF but not R format, and it would make my code much messier to write a specific pipeline for autotyp, especially if CLDF is also available here. The only autotyp module I am having problems with is this Synthesis module, because all other modules I care about are identical between the two data structures CLDF and RData |
Ok, this makes sense, I understand your predicament. What would be a good way to move forward here? I won't be able to address this in a reasonable time frame. Unless someone submits a PR that takes care of this change, it might be the least painful thing for you @annagraff to implement a workaround and extract these variables from the embedded JSON. |
I could have a closer look sometime this week - and if things are as expected, I might be able to provide a PR later next week. (An analysis using multiple databases and profiting from all being available in CLDF is too good advertisement :) ) |
thank you Robert! I hope it goes smoothly! |
I don't know how you access the CLDF data from R. But if you happen to use the "CLDF-via-SQLite" approach (see https://github.com/cldf/cookbook/tree/master/recipes/cldf_r#working-with-cldf-via-sqlite), you could exploit the pretty good JSON support built into SQLite: sqlite> select
l.cldf_name, json_extract(v.cldf_value, '$.VerbHasBipartiteStem')
from
valuetable as v, languagetable as l
where
v.cldf_languageReference = l.cldf_id and v.cldf_parameterReference = 68
limit 20;
|
SQLite's |
For completeness: select
l.cldf_name, json_extract(v.cldf_value, '$.VerbHasBipartiteStem')
from
valuetable as v, languagetable as l
where
v.cldf_languageReference = l.cldf_id and
v.cldf_parameterReference = 68 and
json_extract(v.cldf_value, '$.IsVerbInflectionSurveyComplete') = 1 and
json_extract(v.cldf_value, '$.IsVerbAgreementSurveyComplete') = 1; seems to be what we want here. |
On @annagraff 's list above there are three (complex) list-valued variables:
I guess these shouldn't be available as regular parameters, correct @nataliacp @annagraff ? |
Yes, this is correct! @xrotwang |
@annagraff does this look like what you'd expect: select
p.cldf_name, count(v.cldf_id)
from
parametertable as p, valuetable as v
where
v.cldf_parameterreference = p.cldf_id and p.cldf_name like '%Maximally%'
group by p.cldf_name;
|
It does! Except the first "MaximallyInflectedVerbSynthesis" is no longer necessary, since it is the original entry with JSON syntax, right? |
It also includes the list-valued data. So for completeness, it should still be there, I think. |
@tzakharko Here's the changes needed to make this happen: https://github.com/autotyp/autotyp-cldf-scripts/pull/1/files |
@xrotwang Thanks for the patch! @annagraff Can you please check that your pipeline works as expected with https://github.com/autotyp/autotyp-data/tree/version-1.1.1 If everything is fine, I will push it as a minor release. |
Many many thanks to both of you, @tzakharko @xrotwang! It works and looks exactly the way it should. You can push this as a minor release, @tzakharko Have a nice weekend! |
New version has been released and pushed to Zenodo. Thanks everyone! |
The following autotyp variables are not available in the CLDF:
[1] "autotyp_MaximallyInflectedVerbSynthesis_VerbHasBipartiteStem"
[2] "autotyp_MaximallyInflectedVerbSynthesis_VerbHasNounIncorporation"
[3] "autotyp_MaximallyInflectedVerbSynthesis_VerbHasVerbIncorporation"
[4] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionExponenceType"
[5] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionMaxFormativeCount"
[6] "autotyp_MaximallyInflectedVerbSynthesis_VerbIsPhonologicallyCoherent"
[7] "autotyp_MaximallyInflectedVerbSynthesis_VerbIsSyntacticallyCoherent"
[8] "autotyp_MaximallyInflectedVerbSynthesis_VerbAgreement"
[9] "autotyp_MaximallyInflectedVerbSynthesis_VerbHasAnyIncorporation"
[10] "autotyp_MaximallyInflectedVerbSynthesis_VerbHasNounOrVerbIncorporation"
[11] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionMaxCategoryCount"
[12] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionMaxCategorySansAgreementCount"
[13] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionMaxFormativeSansAgreementCount"
[14] "autotyp_MaximallyInflectedVerbSynthesis_VerbIsProsodicallyCoherent"
[15] "autotyp_MaximallyInflectedVerbSynthesis_IsVerbAgreementSurveyComplete"
[16] "autotyp_MaximallyInflectedVerbSynthesis_IsVerbInflectionSurveyComplete"
[17] "autotyp_MaximallyInflectedVerbSynthesis_VerbIncorporation"
[18] "autotyp_MaximallyInflectedVerbSynthesis_VerbInflectionCategories"
The text was updated successfully, but these errors were encountered: