-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to utilize the time-shifted data in our workflow #16
Comments
@emilycantrell All this aligns with my understanding! |
@HanzhangRen A first draft of the time-shifted data is almost ready! Tomorrow I'll set up the cross-validation code to work with it. I'm unclear on where to read in the data files. It looks like I should indicate the path to my data files here, but I'm confused about where to read in PreFer_train_outcome.csv. Is DATA_FILE supposed to be PreFer_train_outcome.csv, and is BACKGROUND_DATA supposed to be train_data.csv? In which case, PreFer_train_background_data.csv isn't used? Or is there a different spot to read in the outcome data? (P.S. I know to just do this locally of course, since we can't post the data files to github. I'll share the time-shifted data csv files I create with you so that you can also do it locally.) |
data_file is supposed to be prefer_train_data.csv/PreFer_fake_data.csv, and background_data is supposed to be PreFer_train_background_data.csv/PreFer_fake_background_data.csv. Background data refers to an extended dataset that we did not really use in our code. See these two sections in the dataset guide: My understanding is that "Rscript run.R PreFer_fake_data.csv PreFer_fake_background_data.csv" does not really use the outcome data and only applies model.rds to the fake data to produce predictions. In other words, running this code does not train the model. The trained model rds object is something we need to create by ourselves using training.py Some other code would then compare these predictions to actual outcomes. |
Got it. Thanks! |
Update: submission.R actually required a small change. This commit creates an indicator for whether the data was time-shifted. In our training data, the "original" data has a value of 0 for time shift, and the data associated with the 2018-2020 outcome period has a value of 1 for time shift. In the holdout data, everyone should have a value of 0 for time shift. So, the code in the commit checks whether there is already a time_shift column, and if there is no time_shift column, it generates a time_shift column with a value of 0 for everybody, which is what will need to happen on the holdout data. |
We've successfully integrated the time-shifted data into the code. This issue is ready to close! |
@HanzhangRen I'm thinking about the logistics of how to insert the time-shifted data into our workflow. I think the following will work:
The following don't require any special changes:
Does all of this align with your understanding?
The text was updated successfully, but these errors were encountered: