Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting No Data in derived multiple_response for create_categorical(..., multiple=True) #286

Closed
jamesrkg opened this issue Aug 22, 2018 · 10 comments

Comments

@jamesrkg
Copy link

jamesrkg commented Aug 22, 2018

Following lots of conversations about this (see #228 and #196) it looks like the following specific changes are needed to support the creation of derived multiple_response variables where the expected base is less than the total number of rows in the dataset.

The following two changes are required:

Firstly, the default categories given to a new multiple_response using this method should be:

CategoryList(
    [
        (1, Category(numeric_value=None, selected=True, id=1, missing=False, name=Selected)), 
        (2, Category(numeric_value=None, selected=False, id=2, missing=False, name=Not selected)),
        (-1, Category(numeric_value=None, selected=False, id=-1, missing=True, name=No Data))
    ]
) 

Add a new (optional) key to the objects in the list given to categories (in the below example named "base"), that being an expression describing which cases are allowed to have something other than No Data.

test_m = ds.create_categorical(
    alias='test_multi', 
    name='test-multi',
    categories=[
        {'id': 1, 'name': 'Sub1', 'case': 'gen == 1 and age == 1', 'base': 'region == 1'},
        {'id': 2, 'name': 'Sub2', 'case': 'gen == 1 and age == 2', 'base': 'region == 1'},
        {'id': 3, 'name': 'Sub3', 'case': 'gen == 2 and age == 1', 'base': 'region == 1'},
        {'id': 4, 'name': 'Sub4', 'case': 'gen == 2 and age == 2', 'base': 'region == 1'}
    ], 
    multiple=True
)

These expressions would be evaluated as:

  1. case is True and base is True: 1
  2. case is False and base is True: 2
  3. case is True and base is False : -1
  4. case is False and base is False : -1

Possible extension of the above - where the base expression is the same for every subvariable add support for a new arg base that is auto-applied to each:

test_m = ds.create_categorical(
    alias='test_multi', 
    name='test-multi',
    categories=[
        {'id': 1, 'name': 'Sub1', 'case': 'gen == 1 and age == 1'},
        {'id': 2, 'name': 'Sub2', 'case': 'gen == 1 and age == 2'},
        {'id': 3, 'name': 'Sub3', 'case': 'gen == 2 and age == 1'},
        {'id': 4, 'name': 'Sub4', 'case': 'gen == 2 and age == 2'}
    ], 
    multiple=True,
    base='region == 1'
)
@jamesrkg jamesrkg added this to the Wishlist milestone Aug 22, 2018
@jamesrkg jamesrkg changed the title Supporting No Data in derived for create_categorical(..., multiple=True) Supporting No Data in derived multiple_response for create_categorical(..., multiple=True) Aug 22, 2018
@jjdelc
Copy link
Contributor

jjdelc commented Aug 30, 2018

Here's a test using pycrunch (the pycrunch_dataset is a helper that imports the CSV to the dataset).

This example creates a multiple response variable using 3 categories, Yes/No/Missing. Note that all subvariables need to specify the same categories (but each could have different conditions)

CSV_CATS = '''Gender,Age
M,12
M,13
F,14
M,15
F,50
F,32
M,44'''


    def test_case_variable(self):
        ds = self.pycrunch_dataset(csv=CSV_CATS)
        varcat = ds.variables
        gender = varcat.by('alias')['Gender']
        age = varcat.by('alias')['Age']

        array_expr = {
            'function': 'array',
            'args': [{
                'function': 'select',
                'args': [{
                    'map': {
                        '01': {
                            "function": "case",
                            "args": [{
                                "column": [1, 2, 3],
                                "type": {
                                    "value": {
                                        "class": "categorical",
                                        "categories": [
                                            {"id": 1, "name": "Yes", "missing": False, 'selected': True},
                                            {"id": 2, "name": "No", "missing": False},
                                            {"id": 3, "name": "Missing", "missing": True}
                                        ],
                                        "ordinal": False
                                    }
                                }
                            }, {
                                "function": "<=",
                                "args": [
                                    {"variable": age.entity_url},
                                    {"value": 20}
                                ]
                            }, {
                                "function": "between",
                                "args": [
                                    {"variable": age.entity_url},
                                    {"value": 20},
                                    {"value": 40}
                                ]
                            }, {
                                "function": ">",
                                "args": [
                                    {"variable": age.entity_url},
                                    {"value": 40}
                                ]
                            }],
                            'references': {
                                'name': 'subvar1',
                                'alias': 'subvar1',
                            }
                        },
                        '02': {
                                "function": "case",
                                "args": [{
                                    "column": [1, 2, 3],
                                    "type": {
                                        "value": {
                                            "class": "categorical",
                                            "categories": [
                                                {"id": 1, "name": "Yes", "missing": False, 'selected': True},
                                                {"id": 2, "name": "No", "missing": False},
                                                {"id": 3, "name": "Missing", "missing": True}
                                            ],
                                            "ordinal": False
                                        }
                                    }
                                }, {
                                    "function": "in",
                                    "args": [
                                        {"variable": gender.entity_url},
                                        {"value": [1]}
                                    ]
                                }, {
                                    "function": "in",
                                    "args": [
                                        {"variable": gender.entity_url},
                                        {"value": []}
                                    ]
                                }, {
                                    "function": "in",
                                    "args": [
                                        {"variable": gender.entity_url},
                                        {"value": [2]}
                                    ]
                                }],
                                'references': {
                                    'name': 'subvar2',
                                    'alias': 'subvar2',
                                }
                            }
                    }
                }]
            }]
        }
        arrayvar = ds.variables.create(as_entity({
            'name': 'casevar',
            'derivation': array_expr
        })).refresh()
        self.assertTrue(arrayvar.body.derived)
        self.assertEqual(arrayvar.body.type, 'multiple_response')
        self.assertEqual(arrayvar.body.categories, [
            {'id': 1, 'missing': False, 'name': 'Yes', 'selected': True},
            {'id': 2, 'missing': False, 'name': 'No'},
            {'id': 3, 'missing': True, 'name': 'Missing'}
        ])
        data = ds.follow('table', 'limit=10').data
        self.assertEqual(data[arrayvar.body.id], [
            [1, 1],
            [1, 1],
            [{u'?': 3}, 1],
            [1, 1],
            [{u'?': 3}, {u'?': 3}],
            [{u'?': 3}, 2],
            [1, {u'?': 3}]
        ])

@xbito
Copy link
Contributor

xbito commented Aug 30, 2018

@mathiasbc following Jj payload example, my intention is to do this in 2 steps:

First, we need a new method to build this kind of derived multiples, that can have an X number of categories (only 1 chosen as selected) and with the ability to mark some as missing, others not. And for each subvariable be able to pass the adequate expressions to generate them.

Second, once that new helper is in place, modify create_categorical to enable the common use case of generating 3 categories, 1 Selected, 2 Not Selected, 3 Missing. With the ability to specify the missing case per subvariable or globally for the entire variable (that's the parameter that Jamie has named base, I would prefer to use something more like missing_case, though I'm open to suggestions).

@mathiasbc
Copy link
Contributor

mathiasbc commented Sep 4, 2018

The first step will be to add a flexible method that allows deriving multiple responses in this proposed format:

desireable_kwargs = {
    'name': 'derived1',
    'alias': 'derived1',
    'description': 'Multiple response derived',
    # categories must have one and only 1 as selected=True
    'categories': [
        {'id': 1, 'name': 'Yes', 'missing': False, 'selected': True},
        {'id': 2, 'name': 'No', 'missing': False},
        {'id': 3, 'name': 'Maybe', 'missing': False},
        {'id': 4, 'name': 'Missing', 'missing': True}
    ],
    'responses': [
        {
            'name': 'Subvar 1',
            'id': 1,
            'cases': ['var_1 < 20', 'var_1 == 20', 'var_1 == 30', 'var_1 > 30']
        }
        {
            'name': 'Subvar 2',
            'id': 2,
            'cases': ['var_2 < 2', 'var_2 == 2', 'var_2 == 3', 'var_2 > 3']
        }
        {
            'name': 'Subvar 3',
            'id': 3,
            'cases': ['var_3 in [1]', 'var_3 in [2]', 'var_3 in [3]', 'var_3 in [4]']
        }
        # ... Define as many subvariables as needed
    ]
}

ds.derive_multiple_response(**desireable_kwargs)

the cases argument of responses is a list of expressions that must have the same length as categories. This way we can map an expression to a category.

Let me know if I'm on the right track or if you have a better approach.

@jamesrkg
Copy link
Author

jamesrkg commented Sep 4, 2018

Some thoughts/suggestions:

  1. Make sure notes is supported as well.
  2. In specifying the categories, if we can assume 'selected': False unless stated otherwise, can we do the same for 'missing': False? This will make it somewhat easier to read.
  3. Rename responses to subvariables.
  4. In specifying the subvariables remove id (since choosing a sub/variable id is not supported by the API) and explicitly chose the alias instead (where a convention for X_# can take over).
  5. Make specifying the cases more explicit by giving a dict mapping category id to case. Even though this is more verbose it will be much more manageable especially if there are a lot of categories to keep track of.

Given the above I've adapted the example above to:

desireable_kwargs = {
    'name': 'derived1',
    'alias': 'derived1',
    'description': 'Multiple response derived',
    'notes': 'Special variable',
    # categories must have one and only 1 as selected=True
    'categories': [
        {'id': 1, 'name': 'Yes', 'selected': True},
        {'id': 2, 'name': 'No'},
        {'id': 3, 'name': 'Maybe'},
        {'id': 4, 'name': 'Missing', 'missing': True}
    ],
    'subvariables': [
        {
            'alias': 'Subvar_1',
            'name': 'Subvar 1',
            'cases': {
            	1: 'var_1 < 20', 
            	2: 'var_1 == 20', 
            	3: 'var_1 == 30', 
            	4: 'var_1 > 30'
            }
        },
        {
            'alias': 'Subvar_3',
            'name': 'Subvar 2',
            'cases': {
            	1: 'var_2 < 2',
            	2: 'var_2 == 2',
            	3: 'var_2 == 3', 
            	4: 'var_2 > 3'
            }
        },
        {
            'alias': 'Subvar_3',
            'name': 'Subvar 3',
            'cases': {
            	1: 'var_3 in [1]', 
            	2: 'var_3 in [2]', 
            	3: 'var_3 in [3]', 
            	4: 'var_3 in [4]'
            }
        }
        # ... Define as many subvariables as needed
    ]
}

@xbito
Copy link
Contributor

xbito commented Sep 4, 2018

On the list that Jamie made:

  1. Agreed on supporting notes
  2. selected=False and missing=False should be default values
  3. I like calling it subvariables, it matches the api that way
  4. On the alias front, I'm a bit thorned, in other parts of scrunch when we deal with subvariables we consider what was provided by the user to be a suffix, using the variable name as a prefix right? Same goes for Gryphon variables. Passing an "alias" may confuse people by breaking their expectations (those being to have that exact alias respected, or having the variable name as prefix. So, should we make alias optional? And support id and translate that to the suffix of var_name+suffix? Shall we use something other than id or alias, like subvar_id?
  5. Agreed on the dict for the case statements (though in the example Jamie has provided a list? It may even be in a format that is not python compatible. Anyway, I think you understand what he meant.

@jamesrkg
Copy link
Author

jamesrkg commented Sep 4, 2018

Thanks @xbito.

  1. I'm fine with this if you like. I was wondering if you instead wanted this particular method to be truer to the underlying API. If you're after consistency then by all means go back to having an id for subvaraibles.
  2. I've fixed the braces used for mapping the cases that was my typo.

@mathiasbc
Copy link
Contributor

mathiasbc commented Sep 4, 2018

@jamesrkg I added a Pull Request with the code: #290

Please pull that branch and test that you get what you are expecting so I can write some tests and have it ready to merge.

An example:

desireable_kwargs = {
    'name': 'derived1',
    'alias': 'derived1',
    'description': 'Multiple response derived',
    'notes': 'Special variable',
    'categories': [
        {'id': 1, 'name': 'Yes', 'selected': True},
        {'id': 2, 'name': 'No'},
        {'id': 3, 'name': 'Missing', 'missing': True}
    ],
    'subvariables': [
        {
            'id': 1,
            'name': 'Subvar 1',
            'cases': {
                1: 'Q3bp2_14 in [1]', 
                2: 'Q3bp2_14 in [2]', 
                3: 'Q3bp2_14 in [3]', 
            }
        },
        {
            'id': 2,
            'name': 'Subvar 2',
            'cases': {
                1: 'Q5a1 in [1]',
                2: 'Q5a1 in [2]',
                3: 'Q5a1 in [3]', 
            }
        },
        {
            'id': 3,
            'name': 'Subvar 3',
            'cases': {
                1: 'Q3bp1_21 in [1]', 
                2: 'Q3bp1_21 in [2]', 
                3: 'Q3bp1_21 in [3]', 
            }
        }
    ]
}

ds.derive_multiple_response(**desireable_kwargs)

I noted that the created derived variable adds a category with -1: No Data which makes me doubt.

I changed the alias in subvariables for id to have it like all other similar methods of this sort. the alias will be constructed with the variable alias + the subvariable id.

@mathiasbc
Copy link
Contributor

There is a new PR that integrates derive_multiple_response method with create_categorial : #293.

@jamesrkg
Copy link
Author

jamesrkg commented Sep 17, 2018

I can't quite get this to produce the result I'm after.

One problem I think I have already commented on here:

https://github.com/Crunch-io/scrunch/pull/293/files#r218187826

But I'll also clarify the requirements because at the top of this ticket I proposed a base argument (describing included cases) whereas the more appropriate solution would be to give a missing argument instead (the PR is already doing that).

Three use cases need to be catered for:

  1. User only wants to give the selected cases, there are no missing cases (this is how it worked originally).
  2. User wants to give the selected cases but the same missing case applies for all subvariables and should only be given once.
  3. User needs to give explicit selected and missing cases for each subvariable separately.

1.

User only wants to give the selected cases, there are no missing cases.

test_m = ds.create_categorical(
    alias='drinks',
    name='Preferred drinks',
    description='Which drinks do you prefer?',
    multiple=True,
    categories=[
        {'id': 1, 'name': 'Sub1', 'case': 'q1 in [1]'},
        {'id': 2, 'name': 'Sub2', 'case': 'q1 in [2]'},
        {'id': 3, 'name': 'Sub3', 'case': 'q1 in [95]'},
        {'id': 4, 'name': 'Sub4', 'case': 'q1 in [99]'},
        {'id': 5, 'name': 'Sub5', 'case': 'q1 in [1,2]'}
    ]
)

Which should yield the following subvariables for derive_multiple_response:

    subvariables=[
        {
            'id': 1, 
            'name': 'Sub1', 
            'cases': {
                1: 'q1 in [1]', 
                2: 'not q1 in [1]'
            }
        },
        {
            'id': 2, 
            'name': 'Sub2', 
            'cases': {
                1: 'q1 in [2]', 
                2: 'not q1 in [2]'
            }
        },
        {
            'id': 3, 
            'name': 'Sub3', 
            'cases': {
                1: 'q1 in [95]', 
                2: 'not q1 in [95]'
            }
        },
        {
            'id': 4, 
            'name': 'Sub4', 
            'cases': {
                1: 'q1 in [99]', 
                2: 'not q1 in [99]'
            }
        },
        {
            'id': 5, 
            'name': 'Sub5', 
            'cases': {
                1: 'q1 in [1,2]', 
                2: 'not q1 in [1,2]'
            }
        }
    ]

2.

User wants to give the selected cases but the same missing case applies for all subvariables and should only be given once.

test_m = ds.create_categorical(
    alias='test_multi', 
    name='test-multi',
    multiple=True,
    missing='missing(q1)',
    categories=[
        {'id': 1, 'name': 'Sub1', 'case': 'q1 in [1]'},
        {'id': 2, 'name': 'Sub2', 'case': 'q1 in [2]'},
        {'id': 3, 'name': 'Sub3', 'case': 'q1 in [95]'},
        {'id': 4, 'name': 'Sub4', 'case': 'q1 in [99]'},
        {'id': 5, 'name': 'Sub5', 'case': 'q1 in [1,2]'}
    ]
)

Which should yield the following subvariables for derive_multiple_response:

    subvariables=[
        {
            'id': 1, 
            'name': 'Sub1', 
            'cases': {
                1: 'q1 in [1]', 
                2: 'not q1 in [1]',
                3: 'missing(q1)'
            }
        },
        {
            'id': 2, 
            'name': 'Sub2', 
            'cases': {
                1: 'q1 in [2]', 
                2: 'not q1 in [2]',
                3: 'missing(q1)'
            }
        },
        {
            'id': 3, 
            'name': 'Sub3', 
            'cases': {
                1: 'q1 in [95]', 
                2: 'not q1 in [95]',
                3: 'missing(q1)'
            }
        },
        {
            'id': 4, 
            'name': 'Sub4', 
            'cases': {
                1: 'q1 in [99]', 
                2: 'not q1 in [99]',
                3: 'missing(q1)'
            }
        },
        {
            'id': 5, 
            'name': 'Sub5', 
            'cases': {
                1: 'q1 in [1,2]', 
                2: 'not q1 in [1,2]',
                3: 'missing(q1)'
            }
        }
    ]

3.

User needs to give explicit selected and missing cases for each subvariable separately.

test_m = ds.create_categorical(
    alias='test_multi', 
    name='test-multi',
    multiple=True,
    categories=[
        {'id': 1, 'name': 'Sub1', 'case': 'q1 in [1,2]', 'missing': 'missing(q1)'},
        {'id': 2, 'name': 'Sub2', 'case': 'q2 in [1,2]', 'missing': 'missing(q2)'},
        {'id': 3, 'name': 'Sub3', 'case': 'q3 in [1,2]', 'missing': 'missing(q3)'},
        {'id': 4, 'name': 'Sub4', 'case': 'q4 in [1,2]', 'missing': 'missing(q4)'},
        {'id': 5, 'name': 'Sub5', 'case': 'q5 in [1,2]', 'missing': 'missing(q5)'}
    ]
)

Which should yield the following subvariables for derive_multiple_response:

    subvariables=[
        {
            'id': 1, 
            'name': 'Sub1', 
            'cases': {
                1: 'q1 in [1,2]', 
                2: 'not q1 in [1,2]',
                3: 'missing(q1)'
            }
        },
        {
            'id': 2, 
            'name': 'Sub2', 
            'cases': {
                1: 'q2 in [1,2]', 
                2: 'not q2 in [1,2]',
                3: 'missing(q2)'
            }
        },
        {
            'id': 3, 
            'name': 'Sub3', 
            'cases': {
                1: 'q3 in [1,2]', 
                2: 'not q3 in [1,2]',
                3: 'missing(q3)'
            }
        },
        {
            'id': 4, 
            'name': 'Sub4', 
            'cases': {
                1: 'q4 in [1,2]', 
                2: 'not q4 in [1,2]',
                3: 'missing(q4)'
            }
        },
        {
            'id': 5, 
            'name': 'Sub5', 
            'cases': {
                1: 'q5 in [1,2]', 
                2: 'not q5 in [1,2]',
                3: 'missing(q5)'
            }
        }
    ]

@mathiasbc
Copy link
Contributor

@jamesrkg: I added the Not Selected case and did some changes: https://github.com/Crunch-io/scrunch/pull/293/files#diff-10a14081413b0535e4d0097c2ad71a58R1498. Let me know if that works better for you.

I added the missing argument to create_categorical that allows a generic missing_case declaration also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants