-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Selecting the right features #10
Comments
@HanzhangRen Excellent, thank you!!!
Good idea. I briefly scrolled through your spreadsheet just to make sure I understood the gist of what you did, but I did not look at the details of what you chose, so that when I review the variables, it will be an "independent" review. I think it will be interesting to compare the automated feature selection to the manual feature selection to see how much our choices overlap. When we time-shift earlier data forward, my ideal would be to time-shift ALL features. However, I learned last year that some features change in the structure of their name, which makes the time-shifting more difficult (since it's harder to identify what the corresponding earlier feature was). I'm going to spend some time to look into the ability to time-shift all features forward, but if it turns out to be too difficult, an alternative is that we can time-shift only the features that we actually plan to feed into the model, after selecting them through either manual or automated feature selection.
I love that you made scales!!! And the other data pre-processing also looks great. I'll think about what other data prep we might want to do.
I agree with everything you listed here. This week, I am finally back to a normal schedule, so I'm going to start working on creating time-shifted outcome data, and then time-shifted feature data. |
The following features that we are currently using don't exist in the time-shifted data. For now I'm creating columns that are all NA for the time-shifted data and then imputing it; I'll return to handle this properly later. Income questions that were only asked in even-numbered yearsWe can adjust to code to use corresponding questions from other years for the time-shifted data. Religiosity questionThis was only asked in 2019-2020. We can still use this, it will just have to be imputed for time-shifted cases. Traditional fertility variables, which were only asked in 2008, 2009, and 2010.We can't create variables like this for the time-shifted data. We can still use these, they will just have to be imputed for time-shifted cases. |
One benefit of the time shift is that now that we have more data, some previously tiny categories in categorical variables have become larger, and I have added the dummy versions of these categories because I think it's potentially helpful. I am using 50 as a threshold for the minimum category size for me to include the dummy. Previously, I was only able to include employees, freelancers, students, and homemakers as occupation status categories. Now I can include categories for those people who lost their jobs but are currently seeking, as well as people with work disabilities. |
I added some other features that I thought might be useful: Some of these features have values that only appear a few dozen times (or less) in the original data, but we get a larger sample size in the time-shifted data, which might make them worthwhile. I added these features from the partner survey: (see #22) I also included an indicator of whether the partner survey data is non-missing, in case that helps the model differentiate between imputed and non-imputed values. However, I didn't include an indicator for imputation for other variables. We can consider that as something to add in the future, but I suspect it won't make a difference. |
I made the following additions:
The F-1 score for the latest model version is 0.7963717. If I were to remove these additions, the F-1 score is 0.7867618. We seem to observe an improvement of 0.01, but luck may well be involved in all this. I also removed some variables for at least one of the following reasons: 1) They may be repetitive, and getting rid of them speeds up the code, which is getting quite slow 2) They seem less helpful for predicting the outcome, and not having them may allow the algorithm to focus on more informative features. Here are the features I removed. The total number of features went from 141 to 104:
If I were to revert these subtractions, the F-1 score is 0.7958853. Having fewer variables here seems to result in an improvement of 0.0005, but I think the larger lesson to be drawn here is that we can likely get rid of predictors without compromising performance. This raises a question: Instead of removing variables by hand, can we rely on an algorithm to do feature selection to result in a more succinct and perhaps more powerful algorithm? I've attached here some code that I think allows us to do feature selection. I customized a step function for tidymodels recipes called step_feature_selection(). Prepping the recipe means that we take the training set, fit an xgboost model, and then get the top k features as defined by "Gain" in feature importance. Baking the recipe means that we select these top k features, as calculated from the training set, for both the test set. It appears that this feature selection algorithm underperforms manual feature selection. Recall that I was able to keep F-1 score stable going from 0.7958853 to 0.7963717 as I take the 141 features we had and reduce them to 104. Overall, automatic feature selection appears to be making things worse. There could be some bug with my code, or perhaps my understanding of what these variables mean helped with prediction. I'm not sure which is the case. Perhaps some algorithm other than xgboost is better suited for the purpose of feature selection? |
I also made some additional changes, e.g.
|
I included 61 variables in the algorithm I submitted to the first leaderboard, before cleaning the variables to form 36 predictors. I chose them in a way that is mostly based on intuition.
PreFer_codebook.xlsx
The excel file contains 31667 variables. I scrolled through all of them and selected 351 variables, marked mostly in yellow, that 1) feel possibly relevant to our outcome AND 2) represent the latest version of the relevant concept.
Then, out of the 351 variables, I further picked 61 variables that I feel would be a pity not to include in the algorithm. These variables are marked in red.
There are 4 variables marked in green that I realized after submitting the code that probably should have ended up with the 61.
There are 2 variables marked in grey that I thought would be helpful but unfortunately had zero variance in the training set.
@emilycantrell, do you think it would be a useful exercise for you to download a fresh version of the codebook, do a similar run-through of all the variables, and then compare our results?
In addition to the two of us, we could also try to get ideas about what variables to select from other sources
Another thing is that there are probably better ways to preprocess these variables. I did the following for preprocessing
If we have the time time, we might want to explore the following:
The text was updated successfully, but these errors were encountered: