Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selecting the right features #10

Open
HanzhangRen opened this issue Apr 22, 2024 · 6 comments
Open

Selecting the right features #10

HanzhangRen opened this issue Apr 22, 2024 · 6 comments

Comments

@HanzhangRen
Copy link
Collaborator

HanzhangRen commented Apr 22, 2024

I included 61 variables in the algorithm I submitted to the first leaderboard, before cleaning the variables to form 36 predictors. I chose them in a way that is mostly based on intuition.

PreFer_codebook.xlsx

The excel file contains 31667 variables. I scrolled through all of them and selected 351 variables, marked mostly in yellow, that 1) feel possibly relevant to our outcome AND 2) represent the latest version of the relevant concept.

Then, out of the 351 variables, I further picked 61 variables that I feel would be a pity not to include in the algorithm. These variables are marked in red.

There are 4 variables marked in green that I realized after submitting the code that probably should have ended up with the 61.

There are 2 variables marked in grey that I thought would be helpful but unfortunately had zero variance in the training set.

@emilycantrell, do you think it would be a useful exercise for you to download a fresh version of the codebook, do a similar run-through of all the variables, and then compare our results?
In addition to the two of us, we could also try to get ideas about what variables to select from other sources

  1. Let's run a feature selection algorithm on all features to see which ones stand out.
  2. We can ask some professors in our departments about what they would look for if they were to do the prediction challenge.
  3. Ask ChatGPT :)

Another thing is that there are probably better ways to preprocess these variables. I did the following for preprocessing

  1. some basic logical imputation (e.g. one cannot be married with partner if they do not have a partner)
  2. dummy encoded categorical variables
  3. did some scale calculations for variables that are grouped together on the LISS website (see an example of a group of variables here.
  4. I mean imputed all the remaining missingness

If we have the time time, we might want to explore the following:

  1. Can we reduce missingness by combining past and present versions of the same construct?
  2. Combine categories for categorical variables with very small categories.
  3. Check Cronbach's alpha to see if the groups of variables that I current treat as scales do indeed make sense as scales
  4. Do something other than mean-imputation
  5. Decide on how much missingness is too much for us to include a variable as predictor
@emilycantrell
Copy link
Collaborator

@HanzhangRen Excellent, thank you!!!

@emilycantrell, do you think it would be a useful exercise for you to download a fresh version of the codebook, do a similar run-through of all the variables, and then compare our results?

Good idea. I briefly scrolled through your spreadsheet just to make sure I understood the gist of what you did, but I did not look at the details of what you chose, so that when I review the variables, it will be an "independent" review.

I think it will be interesting to compare the automated feature selection to the manual feature selection to see how much our choices overlap.

When we time-shift earlier data forward, my ideal would be to time-shift ALL features. However, I learned last year that some features change in the structure of their name, which makes the time-shifting more difficult (since it's harder to identify what the corresponding earlier feature was). I'm going to spend some time to look into the ability to time-shift all features forward, but if it turns out to be too difficult, an alternative is that we can time-shift only the features that we actually plan to feed into the model, after selecting them through either manual or automated feature selection.

did some scale calculations for variables that are grouped together on the LISS website (see an example of a group of variables here.

I love that you made scales!!! And the other data pre-processing also looks great. I'll think about what other data prep we might want to do.

If we have the time time, we might want to explore the following

I agree with everything you listed here.

This week, I am finally back to a normal schedule, so I'm going to start working on creating time-shifted outcome data, and then time-shifted feature data.

@emilycantrell
Copy link
Collaborator

The following features that we are currently using don't exist in the time-shifted data. For now I'm creating columns that are all NA for the time-shifted data and then imputing it; I'll return to handle this properly later.

Income questions that were only asked in even-numbered years

We can adjust to code to use corresponding questions from other years for the time-shifted data.
ca20g012
ca20g013
ca20g078

Religiosity question

This was only asked in 2019-2020. We can still use this, it will just have to be imputed for time-shifted cases.
cr20m162

Traditional fertility variables, which were only asked in 2008, 2009, and 2010.

We can't create variables like this for the time-shifted data. We can still use these, they will just have to be imputed for time-shifted cases.
cv10c135
cv10c136
cv10c137
cv10c138

@HanzhangRen
Copy link
Collaborator Author

One benefit of the time shift is that now that we have more data, some previously tiny categories in categorical variables have become larger, and I have added the dummy versions of these categories because I think it's potentially helpful. I am using 50 as a threshold for the minimum category size for me to include the dummy.

Previously, I was only able to include employees, freelancers, students, and homemakers as occupation status categories. Now I can include categories for those people who lost their jobs but are currently seeking, as well as people with work disabilities.
Previously, I was only able to distinguish between people with Western and non-Western backgrounds. Now, among people with non-Western backgrounds, I can distinguish between 1st and 2nd generation immigrants. In addition, I noticed that there was a bug that prevented this immigration background variable from being used at all in our last submission, and I fixed that bug.
Previously, I had to merge the lowest 4 categories of education into 1. Now I just need to merge them into 3.

@emilycantrell
Copy link
Collaborator

emilycantrell commented Jun 3, 2024

I added some other features that I thought might be useful:
# Gender of first, second, and third child
"cf20m068", "cf20m069", "cf20m070",
# Type of parent to first, second, and third child (bio, step, adoptive, foster)
"cf20m098", "cf20m099", "cf20m100",
# Current partner is biological parent of first, second, third child
"cf20m113", "cf20m114", "cf20m115",
# Satisfaction with relationship
"cf19l180", "cf20m180",
# Satisfaction with family life
"cf19l181", "cf20m181"

Some of these features have values that only appear a few dozen times (or less) in the original data, but we get a larger sample size in the time-shifted data, which might make them worthwhile.

I added these features from the partner survey: (see #22)
# Partner survey: fertility expectations in 2020
"cf20m128_PartnerSurvey", "cf20m129_PartnerSurvey", "cf20m130_PartnerSurvey",
# Partner survey: fertility expectations in 2019
"cf19l128_PartnerSurvey", "cf19l129_PartnerSurvey", "cf19l130_PartnerSurvey",
# Partner survey: whether ever had kids
"cf19l454_PartnerSurvey", "cf20m454_PartnerSurvey",
# Partner survey: Number of kids reported in 2019 and 2020
"cf19l455_PartnerSurvey", "cf20m455_PartnerSurvey"

I also included an indicator of whether the partner survey data is non-missing, in case that helps the model differentiate between imputed and non-imputed values. However, I didn't include an indicator for imputation for other variables. We can consider that as something to add in the future, but I suspect it won't make a difference.

emilycantrell added a commit that referenced this issue Jun 3, 2024
@HanzhangRen
Copy link
Collaborator Author

I made the following additions:

  1. I constructed variables for when the respondent's most recent child was born. I included this for the respondents themselves in 2018, 19, and 20. I also added this for the partners in 2019 and 2020.
  2. I added a variable about whether the partner has visited a gynecologist.
  3. I augmented the variable about partner birthyear, as reported by the respondent (cf20m026), with background data from the partners themselves.
  4. I added variables about personal income for both the respondent and the partner. If we just include information about household income, we may not be able to capture women's personal economic calculations with regard to foregone wages due to childrearing.
  5. I calculated household income per capita.
  6. I included "partner birth year" and "year relationship began" information from 2018 and 2019, because we have included three years of info for other basic information variables about partners.

The F-1 score for the latest model version is 0.7963717. If I were to remove these additions, the F-1 score is 0.7867618. We seem to observe an improvement of 0.01, but luck may well be involved in all this.

I also removed some variables for at least one of the following reasons: 1) They may be repetitive, and getting rid of them speeds up the code, which is getting quite slow 2) They seem less helpful for predicting the outcome, and not having them may allow the algorithm to focus on more informative features. Here are the features I removed. The total number of features went from 141 to 104:

  1. Data about partner gender
  2. Past fertility intention data up to 2017
  3. Data about child gender
  4. Type of parent (self, biological) for respondents and partner

If I were to revert these subtractions, the F-1 score is 0.7958853. Having fewer variables here seems to result in an improvement of 0.0005, but I think the larger lesson to be drawn here is that we can likely get rid of predictors without compromising performance.

This raises a question: Instead of removing variables by hand, can we rely on an algorithm to do feature selection to result in a more succinct and perhaps more powerful algorithm?

feature_selection.txt

I've attached here some code that I think allows us to do feature selection. I customized a step function for tidymodels recipes called step_feature_selection(). Prepping the recipe means that we take the training set, fit an xgboost model, and then get the top k features as defined by "Gain" in feature importance. Baking the recipe means that we select these top k features, as calculated from the training set, for both the test set.

It appears that this feature selection algorithm underperforms manual feature selection.

Recall that I was able to keep F-1 score stable going from 0.7958853 to 0.7963717 as I take the 141 features we had and reduce them to 104.
If I were to select 104 variables using the code above, F-1 score would decrease to 0.7704959. Automatic feature selection appears to be making things worse.
If I were to start from the 104 features that I hand-picked, and use the algorithm to further reduce the number of features to 90, F-1 score would decrease to 0.7710458.
If I were to further reduce it to k=25, F-1 score would decrease to 0.7698078.
If I were to further reduce it to k=10, F-1 score would decrease be 0.7751638 (the second best pipeline has an F-1 of 0.7613969)

Overall, automatic feature selection appears to be making things worse. There could be some bug with my code, or perhaps my understanding of what these variables mean helped with prediction. I'm not sure which is the case.

Perhaps some algorithm other than xgboost is better suited for the purpose of feature selection?

@HanzhangRen
Copy link
Collaborator Author

I also made some additional changes, e.g.

  1. I made sure that all year variables are shifted by 3 years.
  2. I got rid of some categorical variables that are extremely rare (cell size under 50)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants