Design formula affects design matrix with PyDEseq2 #690

emdann · 2024-12-19T00:13:14Z

Report

Something is going wrong in the construction of the design matrix for DE analysis, where the order of covariates in the design formula determines whether the right name will be assigned to the right column in the design matrix.

Reproducible example from the DE tutorial

import scanpy as sc
import numpy as np

adata = pt.dt.zhang_2021()
adata = adata[adata.obs["Origin"] == "t", :].copy()
adata = adata[:, 0:100].copy()
ps = pt.tl.PseudobulkSpace()
pdata = ps.compute(adata, target_col="Patient", groups_col="Cluster", mode="sum", min_cells=10, min_counts=10)

# Add a continuous covariate (easier to visualize issue)
pdata.obs['cont_covariate'] = np.random.randint(1, 101, size=pdata.n_obs)

Running DE analysis with PyDESeq2

pds2 = pt.tl.PyDESeq2(adata=pdata, design="~cont_covariate+Treatment")
pds2.fit()
res_df = pds2.test_contrasts(["Treatment", "Chemo", "Anti-PD-L1+Chemo"])
print(res_df.head())

variable    baseMean    log_fc     lfcSE      stat       p_value  \
0  AL139246.5    7.935286  0.749880  0.122076  6.142734  8.111325e-10   
1      INTS11   17.534659  0.245957  0.047331  5.196517  2.030573e-07   
2        HES4   25.928723  0.761584  0.153050  4.976039  6.489851e-07   
3       ISG15  171.110284  0.606480  0.125045  4.850094  1.234032e-06   
4       RPL22  520.041771 -0.177769  0.039758 -4.471257  7.776129e-06   

    adj_p_value contrast  
0  5.921267e-08     None  
1  7.411592e-06     None  
2  1.579197e-05     None  
3  2.252109e-05     None  
4  1.135315e-04     None

Here the design matrix looks as expected

print(pds2.design.head())

                   Intercept  cont_covariate  Treatment[T.Chemo]
P002_Mix                 1.0              86                   0
P012_t_Bfoc-MKI67        1.0              99                   0
P019_t_Bfoc-MKI67        1.0              84                   0
P022_t_Bfoc-MKI67        1.0              89                   1
P023_t_Bfoc-MKI67        1.0              75                   1

If I swap the covariates in the formula:

pds2 = pt.tl.PyDESeq2(adata=pdata, design="~Treatment+cont_covariate")
pds2.fit()
res_df = pds2.test_contrasts(["Treatment", "Chemo", "Anti-PD-L1+Chemo"])
print(res_df.head())

The results look different:

variable   baseMean    log_fc     lfcSE      stat   p_value  adj_p_value  \
0  PLEKHG5   0.515634  0.014820  0.004918  3.013508  0.002582     0.258246   
1   INTS11  17.534659  0.001781  0.000798  2.230971  0.025683     0.704902   
2    CCNL2  13.397763  0.003030  0.001403  2.159899  0.030781     0.704902   
3     SDF4  42.932654  0.002131  0.001001  2.129164  0.033241     0.704902   
4    MEGF6   0.863331  0.007630  0.003637  2.097969  0.035908     0.704902   

  contrast  
0     None  
1     None  
2     None  
3     None  
4     None

And column names appear to be swapped in the design matrix:

print(pds2.design.head())

                  Intercept  cont_covariate  Treatment[T.Chemo]
P002_Mix                 1.0               0                  86
P012_t_Bfoc-MKI67        1.0               0                  99
P019_t_Bfoc-MKI67        1.0               0                  84
P022_t_Bfoc-MKI67        1.0               1                  89
P023_t_Bfoc-MKI67        1.0               1                  75

This is likely caused by this code.

Version information

Version info

-----
anndata             0.10.8
decoupler           1.8.0
numpy               1.26.4
pandas              2.2.3
pertpy              0.9.4
pydeseq2            0.5.0pre2
scanpy              1.10.4
scipy               1.14.1
session_info        1.0.0
-----
PIL                 11.0.0
absl                NA
adjustText          1.3.0
asttokens           NA
attr                24.2.0
blitzgsea           NA
certifi             2024.08.30
cffi                1.17.1
charset_normalizer  3.4.0
chex                0.1.87
comm                0.2.2
cycler              0.12.1
cython_runtime      NA
dateutil            2.9.0.post0
...
Python 3.12.0 | packaged by conda-forge | (main, Oct  3 2023, 08:43:22) [GCC 12.3.0]
Linux-6.1.0-25-amd64-x86_64-with-glibc2.36
-----
Session information updated at 2024-12-18 16:10

The text was updated successfully, but these errors were encountered:

grst · 2024-12-19T11:04:39Z

Hi Emma,

all of this should be fixed with pydeseq v0.5 which finally switches to formulaic and design matrices natively. Please try out their release candidate!

I still need to find the time to update pertpy accordingly, see also #610

emdann · 2024-12-19T16:52:21Z

Hi Gregor, thanks for clarifying. I am already using the pydeseq release candidate. I've proposed changes to the formulaic branch here #691

emdann added the bug Something isn't working label Dec 19, 2024

Zethson assigned grst Dec 19, 2024

emdann mentioned this issue Dec 19, 2024

Removing design matrix workaround #691

Merged

3 tasks

Zethson closed this as completed Jan 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design formula affects design matrix with PyDEseq2 #690

Design formula affects design matrix with PyDEseq2 #690

emdann commented Dec 19, 2024 •

edited

Loading

grst commented Dec 19, 2024

emdann commented Dec 19, 2024

Design formula affects design matrix with PyDEseq2 #690

Design formula affects design matrix with PyDEseq2 #690

Comments

emdann commented Dec 19, 2024 • edited Loading

Report

Version information

grst commented Dec 19, 2024

emdann commented Dec 19, 2024

emdann commented Dec 19, 2024 •

edited

Loading