Supporting No Data in derived multiple_response for create_categorical(..., multiple=True) #286

jamesrkg · 2018-08-22T19:48:29Z

Following lots of conversations about this (see #228 and #196) it looks like the following specific changes are needed to support the creation of derived multiple_response variables where the expected base is less than the total number of rows in the dataset.

The following two changes are required:

Firstly, the default categories given to a new multiple_response using this method should be:

CategoryList(
    [
        (1, Category(numeric_value=None, selected=True, id=1, missing=False, name=Selected)), 
        (2, Category(numeric_value=None, selected=False, id=2, missing=False, name=Not selected)),
        (-1, Category(numeric_value=None, selected=False, id=-1, missing=True, name=No Data))
    ]
)

Add a new (optional) key to the objects in the list given to categories (in the below example named "base"), that being an expression describing which cases are allowed to have something other than No Data.

test_m = ds.create_categorical(
    alias='test_multi', 
    name='test-multi',
    categories=[
        {'id': 1, 'name': 'Sub1', 'case': 'gen == 1 and age == 1', 'base': 'region == 1'},
        {'id': 2, 'name': 'Sub2', 'case': 'gen == 1 and age == 2', 'base': 'region == 1'},
        {'id': 3, 'name': 'Sub3', 'case': 'gen == 2 and age == 1', 'base': 'region == 1'},
        {'id': 4, 'name': 'Sub4', 'case': 'gen == 2 and age == 2', 'base': 'region == 1'}
    ], 
    multiple=True
)

These expressions would be evaluated as:

case is True and base is True: 1
case is False and base is True: 2
case is True and base is False : -1
case is False and base is False : -1

Possible extension of the above - where the base expression is the same for every subvariable add support for a new arg base that is auto-applied to each:

test_m = ds.create_categorical(
    alias='test_multi', 
    name='test-multi',
    categories=[
        {'id': 1, 'name': 'Sub1', 'case': 'gen == 1 and age == 1'},
        {'id': 2, 'name': 'Sub2', 'case': 'gen == 1 and age == 2'},
        {'id': 3, 'name': 'Sub3', 'case': 'gen == 2 and age == 1'},
        {'id': 4, 'name': 'Sub4', 'case': 'gen == 2 and age == 2'}
    ], 
    multiple=True,
    base='region == 1'
)

The text was updated successfully, but these errors were encountered:

jjdelc · 2018-08-30T01:25:38Z

Here's a test using pycrunch (the pycrunch_dataset is a helper that imports the CSV to the dataset).

This example creates a multiple response variable using 3 categories, Yes/No/Missing. Note that all subvariables need to specify the same categories (but each could have different conditions)

CSV_CATS = '''Gender,Age
M,12
M,13
F,14
M,15
F,50
F,32
M,44'''


    def test_case_variable(self):
        ds = self.pycrunch_dataset(csv=CSV_CATS)
        varcat = ds.variables
        gender = varcat.by('alias')['Gender']
        age = varcat.by('alias')['Age']

        array_expr = {
            'function': 'array',
            'args': [{
                'function': 'select',
                'args': [{
                    'map': {
                        '01': {
                            "function": "case",
                            "args": [{
                                "column": [1, 2, 3],
                                "type": {
                                    "value": {
                                        "class": "categorical",
                                        "categories": [
                                            {"id": 1, "name": "Yes", "missing": False, 'selected': True},
                                            {"id": 2, "name": "No", "missing": False},
                                            {"id": 3, "name": "Missing", "missing": True}
                                        ],
                                        "ordinal": False
                                    }
                                }
                            }, {
                                "function": "<=",
                                "args": [
                                    {"variable": age.entity_url},
                                    {"value": 20}
                                ]
                            }, {
                                "function": "between",
                                "args": [
                                    {"variable": age.entity_url},
                                    {"value": 20},
                                    {"value": 40}
                                ]
                            }, {
                                "function": ">",
                                "args": [
                                    {"variable": age.entity_url},
                                    {"value": 40}
                                ]
                            }],
                            'references': {
                                'name': 'subvar1',
                                'alias': 'subvar1',
                            }
                        },
                        '02': {
                                "function": "case",
                                "args": [{
                                    "column": [1, 2, 3],
                                    "type": {
                                        "value": {
                                            "class": "categorical",
                                            "categories": [
                                                {"id": 1, "name": "Yes", "missing": False, 'selected': True},
                                                {"id": 2, "name": "No", "missing": False},
                                                {"id": 3, "name": "Missing", "missing": True}
                                            ],
                                            "ordinal": False
                                        }
                                    }
                                }, {
                                    "function": "in",
                                    "args": [
                                        {"variable": gender.entity_url},
                                        {"value": [1]}
                                    ]
                                }, {
                                    "function": "in",
                                    "args": [
                                        {"variable": gender.entity_url},
                                        {"value": []}
                                    ]
                                }, {
                                    "function": "in",
                                    "args": [
                                        {"variable": gender.entity_url},
                                        {"value": [2]}
                                    ]
                                }],
                                'references': {
                                    'name': 'subvar2',
                                    'alias': 'subvar2',
                                }
                            }
                    }
                }]
            }]
        }
        arrayvar = ds.variables.create(as_entity({
            'name': 'casevar',
            'derivation': array_expr
        })).refresh()
        self.assertTrue(arrayvar.body.derived)
        self.assertEqual(arrayvar.body.type, 'multiple_response')
        self.assertEqual(arrayvar.body.categories, [
            {'id': 1, 'missing': False, 'name': 'Yes', 'selected': True},
            {'id': 2, 'missing': False, 'name': 'No'},
            {'id': 3, 'missing': True, 'name': 'Missing'}
        ])
        data = ds.follow('table', 'limit=10').data
        self.assertEqual(data[arrayvar.body.id], [
            [1, 1],
            [1, 1],
            [{u'?': 3}, 1],
            [1, 1],
            [{u'?': 3}, {u'?': 3}],
            [{u'?': 3}, 2],
            [1, {u'?': 3}]
        ])

xbito · 2018-08-30T02:23:19Z

@mathiasbc following Jj payload example, my intention is to do this in 2 steps:

First, we need a new method to build this kind of derived multiples, that can have an X number of categories (only 1 chosen as selected) and with the ability to mark some as missing, others not. And for each subvariable be able to pass the adequate expressions to generate them.

Second, once that new helper is in place, modify create_categorical to enable the common use case of generating 3 categories, 1 Selected, 2 Not Selected, 3 Missing. With the ability to specify the missing case per subvariable or globally for the entire variable (that's the parameter that Jamie has named base, I would prefer to use something more like missing_case, though I'm open to suggestions).

mathiasbc · 2018-09-04T16:08:42Z

The first step will be to add a flexible method that allows deriving multiple responses in this proposed format:

desireable_kwargs = {
    'name': 'derived1',
    'alias': 'derived1',
    'description': 'Multiple response derived',
    # categories must have one and only 1 as selected=True
    'categories': [
        {'id': 1, 'name': 'Yes', 'missing': False, 'selected': True},
        {'id': 2, 'name': 'No', 'missing': False},
        {'id': 3, 'name': 'Maybe', 'missing': False},
        {'id': 4, 'name': 'Missing', 'missing': True}
    ],
    'responses': [
        {
            'name': 'Subvar 1',
            'id': 1,
            'cases': ['var_1 < 20', 'var_1 == 20', 'var_1 == 30', 'var_1 > 30']
        }
        {
            'name': 'Subvar 2',
            'id': 2,
            'cases': ['var_2 < 2', 'var_2 == 2', 'var_2 == 3', 'var_2 > 3']
        }
        {
            'name': 'Subvar 3',
            'id': 3,
            'cases': ['var_3 in [1]', 'var_3 in [2]', 'var_3 in [3]', 'var_3 in [4]']
        }
        # ... Define as many subvariables as needed
    ]
}

ds.derive_multiple_response(**desireable_kwargs)

the cases argument of responses is a list of expressions that must have the same length as categories. This way we can map an expression to a category.

Let me know if I'm on the right track or if you have a better approach.

jamesrkg · 2018-09-04T16:41:55Z

Some thoughts/suggestions:

Make sure notes is supported as well.
In specifying the categories, if we can assume 'selected': False unless stated otherwise, can we do the same for 'missing': False? This will make it somewhat easier to read.
Rename responses to subvariables.
In specifying the subvariables remove id (since choosing a sub/variable id is not supported by the API) and explicitly chose the alias instead (where a convention for X_# can take over).
Make specifying the cases more explicit by giving a dict mapping category id to case. Even though this is more verbose it will be much more manageable especially if there are a lot of categories to keep track of.

Given the above I've adapted the example above to:

desireable_kwargs = {
    'name': 'derived1',
    'alias': 'derived1',
    'description': 'Multiple response derived',
    'notes': 'Special variable',
    # categories must have one and only 1 as selected=True
    'categories': [
        {'id': 1, 'name': 'Yes', 'selected': True},
        {'id': 2, 'name': 'No'},
        {'id': 3, 'name': 'Maybe'},
        {'id': 4, 'name': 'Missing', 'missing': True}
    ],
    'subvariables': [
        {
            'alias': 'Subvar_1',
            'name': 'Subvar 1',
            'cases': {
            	1: 'var_1 < 20', 
            	2: 'var_1 == 20', 
            	3: 'var_1 == 30', 
            	4: 'var_1 > 30'
            }
        },
        {
            'alias': 'Subvar_3',
            'name': 'Subvar 2',
            'cases': {
            	1: 'var_2 < 2',
            	2: 'var_2 == 2',
            	3: 'var_2 == 3', 
            	4: 'var_2 > 3'
            }
        },
        {
            'alias': 'Subvar_3',
            'name': 'Subvar 3',
            'cases': {
            	1: 'var_3 in [1]', 
            	2: 'var_3 in [2]', 
            	3: 'var_3 in [3]', 
            	4: 'var_3 in [4]'
            }
        }
        # ... Define as many subvariables as needed
    ]
}

xbito · 2018-09-04T19:16:10Z

On the list that Jamie made:

Agreed on supporting notes
selected=False and missing=False should be default values
I like calling it subvariables, it matches the api that way
On the alias front, I'm a bit thorned, in other parts of scrunch when we deal with subvariables we consider what was provided by the user to be a suffix, using the variable name as a prefix right? Same goes for Gryphon variables. Passing an "alias" may confuse people by breaking their expectations (those being to have that exact alias respected, or having the variable name as prefix. So, should we make alias optional? And support id and translate that to the suffix of var_name+suffix? Shall we use something other than id or alias, like subvar_id?
Agreed on the dict for the case statements (though in the example Jamie has provided a list? It may even be in a format that is not python compatible. Anyway, I think you understand what he meant.

jamesrkg · 2018-09-04T19:21:00Z

Thanks @xbito.

I'm fine with this if you like. I was wondering if you instead wanted this particular method to be truer to the underlying API. If you're after consistency then by all means go back to having an id for subvaraibles.
I've fixed the braces used for mapping the cases that was my typo.

mathiasbc · 2018-09-04T23:12:08Z

@jamesrkg I added a Pull Request with the code: #290

Please pull that branch and test that you get what you are expecting so I can write some tests and have it ready to merge.

An example:

desireable_kwargs = {
    'name': 'derived1',
    'alias': 'derived1',
    'description': 'Multiple response derived',
    'notes': 'Special variable',
    'categories': [
        {'id': 1, 'name': 'Yes', 'selected': True},
        {'id': 2, 'name': 'No'},
        {'id': 3, 'name': 'Missing', 'missing': True}
    ],
    'subvariables': [
        {
            'id': 1,
            'name': 'Subvar 1',
            'cases': {
                1: 'Q3bp2_14 in [1]', 
                2: 'Q3bp2_14 in [2]', 
                3: 'Q3bp2_14 in [3]', 
            }
        },
        {
            'id': 2,
            'name': 'Subvar 2',
            'cases': {
                1: 'Q5a1 in [1]',
                2: 'Q5a1 in [2]',
                3: 'Q5a1 in [3]', 
            }
        },
        {
            'id': 3,
            'name': 'Subvar 3',
            'cases': {
                1: 'Q3bp1_21 in [1]', 
                2: 'Q3bp1_21 in [2]', 
                3: 'Q3bp1_21 in [3]', 
            }
        }
    ]
}

ds.derive_multiple_response(**desireable_kwargs)

I noted that the created derived variable adds a category with -1: No Data which makes me doubt.

I changed the alias in subvariables for id to have it like all other similar methods of this sort. the alias will be constructed with the variable alias + the subvariable id.

mathiasbc · 2018-09-16T16:44:35Z

There is a new PR that integrates derive_multiple_response method with create_categorial : #293.

jamesrkg · 2018-09-17T20:09:49Z

I can't quite get this to produce the result I'm after.

One problem I think I have already commented on here:

https://github.com/Crunch-io/scrunch/pull/293/files#r218187826

But I'll also clarify the requirements because at the top of this ticket I proposed a base argument (describing included cases) whereas the more appropriate solution would be to give a missing argument instead (the PR is already doing that).

Three use cases need to be catered for:

User only wants to give the selected cases, there are no missing cases (this is how it worked originally).
User wants to give the selected cases but the same missing case applies for all subvariables and should only be given once.
User needs to give explicit selected and missing cases for each subvariable separately.

1.

User only wants to give the selected cases, there are no missing cases.

test_m = ds.create_categorical(
    alias='drinks',
    name='Preferred drinks',
    description='Which drinks do you prefer?',
    multiple=True,
    categories=[
        {'id': 1, 'name': 'Sub1', 'case': 'q1 in [1]'},
        {'id': 2, 'name': 'Sub2', 'case': 'q1 in [2]'},
        {'id': 3, 'name': 'Sub3', 'case': 'q1 in [95]'},
        {'id': 4, 'name': 'Sub4', 'case': 'q1 in [99]'},
        {'id': 5, 'name': 'Sub5', 'case': 'q1 in [1,2]'}
    ]
)

Which should yield the following subvariables for derive_multiple_response:

    subvariables=[
        {
            'id': 1, 
            'name': 'Sub1', 
            'cases': {
                1: 'q1 in [1]', 
                2: 'not q1 in [1]'
            }
        },
        {
            'id': 2, 
            'name': 'Sub2', 
            'cases': {
                1: 'q1 in [2]', 
                2: 'not q1 in [2]'
            }
        },
        {
            'id': 3, 
            'name': 'Sub3', 
            'cases': {
                1: 'q1 in [95]', 
                2: 'not q1 in [95]'
            }
        },
        {
            'id': 4, 
            'name': 'Sub4', 
            'cases': {
                1: 'q1 in [99]', 
                2: 'not q1 in [99]'
            }
        },
        {
            'id': 5, 
            'name': 'Sub5', 
            'cases': {
                1: 'q1 in [1,2]', 
                2: 'not q1 in [1,2]'
            }
        }
    ]

2.

User wants to give the selected cases but the same missing case applies for all subvariables and should only be given once.

test_m = ds.create_categorical(
    alias='test_multi', 
    name='test-multi',
    multiple=True,
    missing='missing(q1)',
    categories=[
        {'id': 1, 'name': 'Sub1', 'case': 'q1 in [1]'},
        {'id': 2, 'name': 'Sub2', 'case': 'q1 in [2]'},
        {'id': 3, 'name': 'Sub3', 'case': 'q1 in [95]'},
        {'id': 4, 'name': 'Sub4', 'case': 'q1 in [99]'},
        {'id': 5, 'name': 'Sub5', 'case': 'q1 in [1,2]'}
    ]
)

Which should yield the following subvariables for derive_multiple_response:

    subvariables=[
        {
            'id': 1, 
            'name': 'Sub1', 
            'cases': {
                1: 'q1 in [1]', 
                2: 'not q1 in [1]',
                3: 'missing(q1)'
            }
        },
        {
            'id': 2, 
            'name': 'Sub2', 
            'cases': {
                1: 'q1 in [2]', 
                2: 'not q1 in [2]',
                3: 'missing(q1)'
            }
        },
        {
            'id': 3, 
            'name': 'Sub3', 
            'cases': {
                1: 'q1 in [95]', 
                2: 'not q1 in [95]',
                3: 'missing(q1)'
            }
        },
        {
            'id': 4, 
            'name': 'Sub4', 
            'cases': {
                1: 'q1 in [99]', 
                2: 'not q1 in [99]',
                3: 'missing(q1)'
            }
        },
        {
            'id': 5, 
            'name': 'Sub5', 
            'cases': {
                1: 'q1 in [1,2]', 
                2: 'not q1 in [1,2]',
                3: 'missing(q1)'
            }
        }
    ]

3.

User needs to give explicit selected and missing cases for each subvariable separately.

test_m = ds.create_categorical(
    alias='test_multi', 
    name='test-multi',
    multiple=True,
    categories=[
        {'id': 1, 'name': 'Sub1', 'case': 'q1 in [1,2]', 'missing': 'missing(q1)'},
        {'id': 2, 'name': 'Sub2', 'case': 'q2 in [1,2]', 'missing': 'missing(q2)'},
        {'id': 3, 'name': 'Sub3', 'case': 'q3 in [1,2]', 'missing': 'missing(q3)'},
        {'id': 4, 'name': 'Sub4', 'case': 'q4 in [1,2]', 'missing': 'missing(q4)'},
        {'id': 5, 'name': 'Sub5', 'case': 'q5 in [1,2]', 'missing': 'missing(q5)'}
    ]
)

Which should yield the following subvariables for derive_multiple_response:

    subvariables=[
        {
            'id': 1, 
            'name': 'Sub1', 
            'cases': {
                1: 'q1 in [1,2]', 
                2: 'not q1 in [1,2]',
                3: 'missing(q1)'
            }
        },
        {
            'id': 2, 
            'name': 'Sub2', 
            'cases': {
                1: 'q2 in [1,2]', 
                2: 'not q2 in [1,2]',
                3: 'missing(q2)'
            }
        },
        {
            'id': 3, 
            'name': 'Sub3', 
            'cases': {
                1: 'q3 in [1,2]', 
                2: 'not q3 in [1,2]',
                3: 'missing(q3)'
            }
        },
        {
            'id': 4, 
            'name': 'Sub4', 
            'cases': {
                1: 'q4 in [1,2]', 
                2: 'not q4 in [1,2]',
                3: 'missing(q4)'
            }
        },
        {
            'id': 5, 
            'name': 'Sub5', 
            'cases': {
                1: 'q5 in [1,2]', 
                2: 'not q5 in [1,2]',
                3: 'missing(q5)'
            }
        }
    ]

mathiasbc · 2018-09-17T23:33:13Z

@jamesrkg: I added the Not Selected case and did some changes: https://github.com/Crunch-io/scrunch/pull/293/files#diff-10a14081413b0535e4d0097c2ad71a58R1498. Let me know if that works better for you.

I added the missing argument to create_categorical that allows a generic missing_case declaration also.

jamesrkg added for consideration urgent labels Aug 22, 2018

jamesrkg added this to the Wishlist milestone Aug 22, 2018

jamesrkg changed the title ~~Supporting No Data in derived for create_categorical(..., multiple=True)~~ Supporting No Data in derived multiple_response for create_categorical(..., multiple=True) Aug 22, 2018

xbito assigned mathiasbc Aug 30, 2018

mathiasbc closed this as completed Sep 21, 2018

This was referenced Sep 26, 2018

Creating derived multiple_response with control over base #228

Closed

(derived) Support "else" case for new categoricals #63

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting No Data in derived multiple_response for create_categorical(..., multiple=True) #286

Supporting No Data in derived multiple_response for create_categorical(..., multiple=True) #286

jamesrkg commented Aug 22, 2018 •

edited

Loading

jjdelc commented Aug 30, 2018

xbito commented Aug 30, 2018

mathiasbc commented Sep 4, 2018 •

edited

Loading

jamesrkg commented Sep 4, 2018 •

edited

Loading

xbito commented Sep 4, 2018

jamesrkg commented Sep 4, 2018

mathiasbc commented Sep 4, 2018 •

edited

Loading

mathiasbc commented Sep 16, 2018

jamesrkg commented Sep 17, 2018 •

edited

Loading

mathiasbc commented Sep 17, 2018

Supporting No Data in derived multiple_response for create_categorical(..., multiple=True) #286

Supporting No Data in derived multiple_response for create_categorical(..., multiple=True) #286

Comments

jamesrkg commented Aug 22, 2018 • edited Loading

jjdelc commented Aug 30, 2018

xbito commented Aug 30, 2018

mathiasbc commented Sep 4, 2018 • edited Loading

jamesrkg commented Sep 4, 2018 • edited Loading

xbito commented Sep 4, 2018

jamesrkg commented Sep 4, 2018

mathiasbc commented Sep 4, 2018 • edited Loading

mathiasbc commented Sep 16, 2018

jamesrkg commented Sep 17, 2018 • edited Loading

1.

2.

3.

mathiasbc commented Sep 17, 2018

jamesrkg commented Aug 22, 2018 •

edited

Loading

mathiasbc commented Sep 4, 2018 •

edited

Loading

jamesrkg commented Sep 4, 2018 •

edited

Loading

mathiasbc commented Sep 4, 2018 •

edited

Loading

jamesrkg commented Sep 17, 2018 •

edited

Loading