Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design formula affects design matrix with PyDEseq2 #690

Closed
emdann opened this issue Dec 19, 2024 · 2 comments
Closed

Design formula affects design matrix with PyDEseq2 #690

emdann opened this issue Dec 19, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@emdann
Copy link
Member

emdann commented Dec 19, 2024

Report

Something is going wrong in the construction of the design matrix for DE analysis, where the order of covariates in the design formula determines whether the right name will be assigned to the right column in the design matrix.

Reproducible example from the DE tutorial

import scanpy as sc
import numpy as np

adata = pt.dt.zhang_2021()
adata = adata[adata.obs["Origin"] == "t", :].copy()
adata = adata[:, 0:100].copy()
ps = pt.tl.PseudobulkSpace()
pdata = ps.compute(adata, target_col="Patient", groups_col="Cluster", mode="sum", min_cells=10, min_counts=10)

# Add a continuous covariate (easier to visualize issue)
pdata.obs['cont_covariate'] = np.random.randint(1, 101, size=pdata.n_obs)

Running DE analysis with PyDESeq2

pds2 = pt.tl.PyDESeq2(adata=pdata, design="~cont_covariate+Treatment")
pds2.fit()
res_df = pds2.test_contrasts(["Treatment", "Chemo", "Anti-PD-L1+Chemo"])
print(res_df.head())
variable    baseMean    log_fc     lfcSE      stat       p_value  \
0  AL139246.5    7.935286  0.749880  0.122076  6.142734  8.111325e-10   
1      INTS11   17.534659  0.245957  0.047331  5.196517  2.030573e-07   
2        HES4   25.928723  0.761584  0.153050  4.976039  6.489851e-07   
3       ISG15  171.110284  0.606480  0.125045  4.850094  1.234032e-06   
4       RPL22  520.041771 -0.177769  0.039758 -4.471257  7.776129e-06   

    adj_p_value contrast  
0  5.921267e-08     None  
1  7.411592e-06     None  
2  1.579197e-05     None  
3  2.252109e-05     None  
4  1.135315e-04     None  

Here the design matrix looks as expected

print(pds2.design.head())
                   Intercept  cont_covariate  Treatment[T.Chemo]
P002_Mix                 1.0              86                   0
P012_t_Bfoc-MKI67        1.0              99                   0
P019_t_Bfoc-MKI67        1.0              84                   0
P022_t_Bfoc-MKI67        1.0              89                   1
P023_t_Bfoc-MKI67        1.0              75                   1

If I swap the covariates in the formula:

pds2 = pt.tl.PyDESeq2(adata=pdata, design="~Treatment+cont_covariate")
pds2.fit()
res_df = pds2.test_contrasts(["Treatment", "Chemo", "Anti-PD-L1+Chemo"])
print(res_df.head())

The results look different:

variable   baseMean    log_fc     lfcSE      stat   p_value  adj_p_value  \
0  PLEKHG5   0.515634  0.014820  0.004918  3.013508  0.002582     0.258246   
1   INTS11  17.534659  0.001781  0.000798  2.230971  0.025683     0.704902   
2    CCNL2  13.397763  0.003030  0.001403  2.159899  0.030781     0.704902   
3     SDF4  42.932654  0.002131  0.001001  2.129164  0.033241     0.704902   
4    MEGF6   0.863331  0.007630  0.003637  2.097969  0.035908     0.704902   

  contrast  
0     None  
1     None  
2     None  
3     None  
4     None  

And column names appear to be swapped in the design matrix:

print(pds2.design.head())
                  Intercept  cont_covariate  Treatment[T.Chemo]
P002_Mix                 1.0               0                  86
P012_t_Bfoc-MKI67        1.0               0                  99
P019_t_Bfoc-MKI67        1.0               0                  84
P022_t_Bfoc-MKI67        1.0               1                  89
P023_t_Bfoc-MKI67        1.0               1                  75

This is likely caused by this code.

Version information

Version info
-----
anndata             0.10.8
decoupler           1.8.0
numpy               1.26.4
pandas              2.2.3
pertpy              0.9.4
pydeseq2            0.5.0pre2
scanpy              1.10.4
scipy               1.14.1
session_info        1.0.0
-----
PIL                 11.0.0
absl                NA
adjustText          1.3.0
asttokens           NA
attr                24.2.0
blitzgsea           NA
certifi             2024.08.30
cffi                1.17.1
charset_normalizer  3.4.0
chex                0.1.87
comm                0.2.2
cycler              0.12.1
cython_runtime      NA
dateutil            2.9.0.post0
...
Python 3.12.0 | packaged by conda-forge | (main, Oct  3 2023, 08:43:22) [GCC 12.3.0]
Linux-6.1.0-25-amd64-x86_64-with-glibc2.36
-----
Session information updated at 2024-12-18 16:10
@emdann emdann added the bug Something isn't working label Dec 19, 2024
@grst
Copy link
Collaborator

grst commented Dec 19, 2024

Hi Emma,

all of this should be fixed with pydeseq v0.5 which finally switches to formulaic and design matrices natively. Please try out their release candidate!

I still need to find the time to update pertpy accordingly, see also #610

@emdann
Copy link
Member Author

emdann commented Dec 19, 2024

Hi Gregor, thanks for clarifying. I am already using the pydeseq release candidate. I've proposed changes to the formulaic branch here #691

@Zethson Zethson closed this as completed Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants