Skip to content

ctrpv2 drug response contains duplicate entries #455

@ymahlich

Description

@ymahlich

Problem:

Casting the experiments / drug response table for the ctrpv2 data set (i.e. dataset.format('experiments', shape='wide')) fails.

Cause:

the internal pivot inside of the format function fails due to multiple entries with the same "index". This can be traced back to the fact that the experiments data table for ctrpv2 contains duplicate rows.

How to reproduce:

import coderdata as cd
data = cd.load(name='ctrpv2')
exp_f = data.format(
    data_type='experiments',
    shape='wide',
    metrics=[
        'fit_auc',
        'fit_ic50',
        'fit_r2',
        'fit_ec50se',
        'fit_einf',
        'fit_hs',
        'aac',
        'auc',
        'dss',
    ],
)

The above code will throw the initial error. Digging further down via the code block below details where the format function actually fails:

tmp = deepcopy(data.experiments)
ret = tmp.pivot(
    index = [
        'source',
        'improve_sample_id',
        'improve_drug_id',
        'study',
        'time',
        'time_unit'
    ],
    columns = 'dose_response_metric',
    values = 'dose_response_value'
)

The error is a ValueError("Index contains duplicate entries, cannot reshape"). The code block below will reveal that there are duplicate entries in data.experiments, which is the reason the pivot function raises the ValueError.

>>> exp = data.experiments
>>> exp.shape
(3097840, 8)
>>> exp.drop_duplicates().shape
(3094010, 8)

As can be seen the number of rows changes from 3,097,840 to 3,094,010 after duplicate removal, i.e. a 3,830 row reduction.
Duplicate rows in the experiment file should not exist in the first place. My best guess is that something went wrong during the build process.

Proposed solution:

Fix the dataset.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions