-
Notifications
You must be signed in to change notification settings - Fork 37
Description
State of first draft data obfuscation:
- We have a logging obfuscation function where we simulate the behaviours of patients logging their meal
-
All meals - keep all meals for now
-
Multiple meals per day (1-2 largest meals) - Find a threshold so that we have an average of 1.8 meals logged per day
-
Once per day (largest meal) - Find the largest one in a day
-
A few times per week - Find a threshold so that we have an average of 3 meals logged per week
-
Never - Wipe all data
- We have a logging timing habit function where we simulate the habits of patients logging when theyare actually log their meals.
-
Temporally right skewed -> forgetful loggers - Gamma function with right-skewed. Fixed value distribution with minor randomness.
-
Temporally left skewed -> hasty loggers - Gamma function with left-skewed (less skewed because a patient probably won't log their meal too early most of the time) - Fixed value distribution with minor randomness.
-
Normal Distribution - Gaussian distribution with fixed valued spread
-
Unchanged
Data flow:
data/raw/sim -> logging obfuscation function to create msg_type_log -> logging timing habit function to create 'msg_type_log_shiftedfrommsg_type_log->data/raw/obfuscated`
Improvement:
-
Find out the right distribution between each type of user for both functions. For example, loggers who might log all of their meal consist of 25% rather than 30%.
-
Fine-tune the default distribution (we need a better param for gamma distribution to reflect the true behaviour of patients) or find a better distribution.
-
Left and right skewed distribution should be different. For hasty loggers, maybe on average, they log their meals 10 mins early and probably wouldn't be longer than that but for forgetful loggers, it may go up to >40 mins.
-
Remove the original csv file when generating a new file name (bug)
-
Investigate new line characters at the end of some files (bug?)
-
Clean up columns from the
simulation_data_generationscript. We haveUnnamed: 0column maybe we should have dropped it.