Skip to content

Add obfuscator - Second Draft #202

@Tony911029

Description

@Tony911029

State of first draft data obfuscation:

  • We have a logging obfuscation function where we simulate the behaviours of patients logging their meal
  1. All meals - keep all meals for now

  2. Multiple meals per day (1-2 largest meals) - Find a threshold so that we have an average of 1.8 meals logged per day

  3. Once per day (largest meal) - Find the largest one in a day

  4. A few times per week - Find a threshold so that we have an average of 3 meals logged per week

  5. Never - Wipe all data

  • We have a logging timing habit function where we simulate the habits of patients logging when theyare actually log their meals.
  1. Temporally right skewed -> forgetful loggers - Gamma function with right-skewed. Fixed value distribution with minor randomness.

  2. Temporally left skewed -> hasty loggers - Gamma function with left-skewed (less skewed because a patient probably won't log their meal too early most of the time) - Fixed value distribution with minor randomness.

  3. Normal Distribution - Gaussian distribution with fixed valued spread

  4. Unchanged

Data flow:

data/raw/sim -> logging obfuscation function to create msg_type_log -> logging timing habit function to create 'msg_type_log_shiftedfrommsg_type_log->data/raw/obfuscated`

Improvement:

  1. Find out the right distribution between each type of user for both functions. For example, loggers who might log all of their meal consist of 25% rather than 30%.

  2. Fine-tune the default distribution (we need a better param for gamma distribution to reflect the true behaviour of patients) or find a better distribution.

  3. Left and right skewed distribution should be different. For hasty loggers, maybe on average, they log their meals 10 mins early and probably wouldn't be longer than that but for forgetful loggers, it may go up to >40 mins.

  4. Remove the original csv file when generating a new file name (bug)

  5. Investigate new line characters at the end of some files (bug?)

  6. Clean up columns from the simulation_data_generation script. We have Unnamed: 0 column maybe we should have dropped it.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions