ctrpv2 drug response contains duplicate entries

## Problem:

Casting the experiments / drug response table for the ctrpv2 data set (i.e. `dataset.format('experiments', shape='wide')`) fails.

## Cause:

the internal pivot inside of the `format` function fails due to multiple entries with the same "index". This can be traced back to the fact that the experiments data table for ctrpv2 contains duplicate rows.

## How to reproduce:

```python
import coderdata as cd
data = cd.load(name='ctrpv2')
exp_f = data.format(
    data_type='experiments',
    shape='wide',
    metrics=[
        'fit_auc',
        'fit_ic50',
        'fit_r2',
        'fit_ec50se',
        'fit_einf',
        'fit_hs',
        'aac',
        'auc',
        'dss',
    ],
)
```

The above code will throw the initial error. Digging further down via the code block below details where the format function actually fails:

```
tmp = deepcopy(data.experiments)
ret = tmp.pivot(
    index = [
        'source',
        'improve_sample_id',
        'improve_drug_id',
        'study',
        'time',
        'time_unit'
    ],
    columns = 'dose_response_metric',
    values = 'dose_response_value'
)
```

The error is a `ValueError("Index contains duplicate entries, cannot reshape")`. The code block below will reveal that there are duplicate entries in `data.experiments`, which is the reason the `pivot` function raises the `ValueError`.

```python
>>> exp = data.experiments
>>> exp.shape
(3097840, 8)
>>> exp.drop_duplicates().shape
(3094010, 8)
```

As can be seen the number of rows changes from 3,097,840 to 3,094,010 after duplicate removal, i.e. a 3,830 row reduction.
Duplicate rows in the experiment file should not exist in the first place. My best guess is that something went wrong during the build process.

## Proposed solution:
Fix the dataset.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ctrpv2 drug response contains duplicate entries #455

Problem:

Cause:

How to reproduce:

Proposed solution:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ctrpv2 drug response contains duplicate entries #455

Description

Problem:

Cause:

How to reproduce:

Proposed solution:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions