Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large mismatches between unique TCC counts and estimated counts for transcripts with very large abundances in lr-kallisto #482

Open
dnwissel opened this issue Mar 9, 2025 · 1 comment

Comments

@dnwissel
Copy link

dnwissel commented Mar 9, 2025

Hi,

first, thanks a lot for developing kallisto and lr-kallisto!

We've been running a benchmark on quantifying some PacBio spike-in (SIRV) data recently with different quantification methods and noticed an odd mismatch for lr-kallisto where the estimated abundance for some spike-ins was much smaller than expected. When looking at this in more detail, we noticed that for these spike-ins, the unique TCC corresponding to them seems to be potentially ignored (based on the fact that the final abundance is significantly smaller than the TCC count that is unique to that isoform). Here is an example for two spike-in isoforms from SIRV4 and SIRV5:

transcript_id estimated_counts tcc_count delta_est_tcc
SIRV410 33.00 1937240 -1937207
SIRV508 1573.18 2075800 -2074227

In addition, we noticed that this issue wasn't present in some downsampling experiments that we ran and seems to only happen for isoforms which have very large unique TCC values (~ > 1e6). No other isoforms had a negative delta between estimated counts and unique TCC counts (which makes sense, since the unique TCC count should lower bound the estimated counts, I suppose).

Estimated counts were taken from matrix.abundance.mtx and unique EC counts from count.mtx (and filtered to only contain unique TCCs). We've accounted for the different offsets and as mentioned, when downsampling, these problems disappear completely (which is the reason we noticed it in the first place, since performance degrades drastically for downstream tasks such as DTE at full depth compared to downsamplings).

Could you think of a possible explanation for this (including user error on our side) or is this a bug?

This is on Kallisto 0.51.1 and bustools 0.44.1 on Ubuntu 22.04. Full commands below with enough Snakemake removed to make it easily parseable (hopefully).

Indexing:

  kallisto index \
      -k 63 -t 12 -i {output.index} \
      results/prepare/extract_transcriptomes/sirv_transcriptome.fa &> {log}

Quantification:

  kallisto  bus -t 12 --long --threshold \
      0.8 -x bulk -i {input.index} \
      -o {params.outdir} {input.reads} &>> {log};
  bustools sort -t 12 {params.outdir}/output.bus \
      -o {params.outdir}/sorted.bus &>> {log};
  bustools count {params.outdir}/sorted.bus \
      -t {params.outdir}/transcripts.txt \
      -e {params.outdir}/matrix.ec \
      -g {input.sirv_four_transcriptome_gmap} \
      -o {params.outdir}/count --cm -m \
      &>> {log};
  kallisto  quant-tcc -t 12 \
      --long -P PacBio -f {params.outdir}/flens.txt \
      {params.outdir}/count.mtx -i {input.index} \
      -e {params.outdir}/count.ec.txt \
      -o {params.outdir} &>> {log}

Happy to provide a full reprex or any other information/details, although I would have to share full BAM files, given that this only seems to happen at sufficient depth.

Thanks a lot!

Best
David

Copy link
Collaborator

Thanks for submitting an issue for this! I’ll look into it; would you be able to share the output folder with me; my email is: [email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants