-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct error(s) in data #14
Comments
@emilycantrell I noticed this error too, and there is already a line in the code that fixes the problem. |
Amazing, thank you!! |
I looked at the values of the following three questions for all years, and didn't see anything else that is concerning:
|
I realized there might be errors in the test set that we can't check. So maybe rather than specifically recoding the value 2025, we should recode any value greater than, say, 2000, to auto-adjust. I'm not sure how likely it is that this same error occurs in the test set, but if time permits, I'll edit the code to a more generalizable correction of this specific type of error. |
I did this in a special branch that I'm working on locally, within submission.R. Later I'll merge it into whatever branch we plan to submit.
|
In a very important feature, cf20m130 (Within how many years do you hope to have your [first/next] child?) there is a response that is almost certainly an error: someone said "2025". I take this to mean that they expect to have a child by 2025, i.e., their entry should be "5." I debated whether "2025" suggested they expect a child before 1/1/2025, or whether it means they expect a child by 12/31/2025, as this would change whether we should make the response 4 or 5. My instinct is to assume they mean by 12/31/2025, so let's make it "5." In 2019, the same person answered "3" to this question, which is extra confirmation that "5" is a reasonable range; it just seems that their timeline was pushed back a couple of years, perhaps because of the pandemic or perhaps because that's just how life goes.
I think we should manually recode this value as "5" by just putting a line for it into the code. @HanzhangRen is submission.R the most appropriate file in which to make this edit? I can add the code for this edit there or wherever you recommend.
Note: if we ever do automated feature selection, this change should happen BEFORE automated feature selection, as the "2025" value makes the linear correlation between the outcome and feature very weak.
If we end up having a lot of edits to the data like this, then we can consider setting up a more systematic way to make the edits, like we did for Million Monkeys.
Given the quick turnaround on this project, I don't think we should spend much time actively looking for errors like this in the data. However, I do want to check on other variables from the set of questions about expectations for having children, to make sure everything in those key features is in a reasonable range.
After I check the other related variables to make sure they are in range, I'll also email Lisa and Gert about this to see if they want to fix "2025" for all participants, or if that's something they want to treat as part of the challenge.
The text was updated successfully, but these errors were encountered: