-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Problem:
Casting the experiments / drug response table for the ctrpv2 data set (i.e. dataset.format('experiments', shape='wide')
) fails.
Cause:
the internal pivot inside of the format
function fails due to multiple entries with the same "index". This can be traced back to the fact that the experiments data table for ctrpv2 contains duplicate rows.
How to reproduce:
import coderdata as cd
data = cd.load(name='ctrpv2')
exp_f = data.format(
data_type='experiments',
shape='wide',
metrics=[
'fit_auc',
'fit_ic50',
'fit_r2',
'fit_ec50se',
'fit_einf',
'fit_hs',
'aac',
'auc',
'dss',
],
)
The above code will throw the initial error. Digging further down via the code block below details where the format function actually fails:
tmp = deepcopy(data.experiments)
ret = tmp.pivot(
index = [
'source',
'improve_sample_id',
'improve_drug_id',
'study',
'time',
'time_unit'
],
columns = 'dose_response_metric',
values = 'dose_response_value'
)
The error is a ValueError("Index contains duplicate entries, cannot reshape")
. The code block below will reveal that there are duplicate entries in data.experiments
, which is the reason the pivot
function raises the ValueError
.
>>> exp = data.experiments
>>> exp.shape
(3097840, 8)
>>> exp.drop_duplicates().shape
(3094010, 8)
As can be seen the number of rows changes from 3,097,840 to 3,094,010 after duplicate removal, i.e. a 3,830 row reduction.
Duplicate rows in the experiment file should not exist in the first place. My best guess is that something went wrong during the build process.
Proposed solution:
Fix the dataset.