Explanation of txClassDescriptions and error message for large bam file. #402

apsteinberg · 2023-11-09T22:41:04Z

Hi there,

Thank you again for helping me the other day with the issues I was having. I've been able to run bambu successfully on some of my samples, and I have a couple of follow-up questions (I hope you don't mind me asking here, happy to correspond via email if this is easier):

I would like to now tap into bambu's really awesome isoform analysis tools. I saw that you have in the output a "txClassDescription", which describes what is novel about the isoform. I wanted to do an analysis of full length isoforms for my samples as your team has done in Figure 5a, d, and e of the following paper: https://www.biorxiv.org/content/10.1101/2021.04.21.440736v1
And I was wondering how the txClassDescription codes relate to some of the events you've described here, if at all? For example, does a txClassDescription code of "new First Junction" correspond to what you describe as a "alternative 5' end" in the paper? And further, are exon skipping events encompassed in these class codes? If not, it would be wonderful if you could point me to where I may be able to classify my transcripts in this way.
I am encountering an issue with one of my bam files through bambu, and I am not entirely sure why. I am getting the following error:

--- Start generating read class files ---
Error: BiocParallel errors
1 remote errors, element index: 1
0 unevaluated and other errors
first remote error:
Error in h(simpleError(msg, call)): error in evaluating the argument 'y' in selecting a method for function 'intersect': error in eval
uating the argument 'x' in selecting a method for function 'ranges': Subsetting operation on CompressedGRangesList object 'x' produces
a
result that is too big to be represented as a CompressedList object.
Please try to coerce 'x' to a SimpleList object first (with 'as(x,
"SimpleList")').
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Execution halted

For #2, any clues as to how to resolve this?

Thanks in advance for your time and help!

Cheers,
Asher

Sefi196 · 2023-11-13T02:18:00Z

Hi Asher,
I am having the same issue with a subset of my files.

I had a feeling it had something to do with the size of the bam files.

Can i ask if the bam files your inputting are large (>100GB)?

apsteinberg · 2023-11-13T15:42:15Z

Hi there,

Gotcha, yeah mine isn't that large (it's 55 GB) -- so perhaps it is related to something else? The reason I thought it had something to do with the size though was the part of the error message that read:

CompressedGRangesList object 'x' produces
a
result that is too big to be represented as a CompressedList object.

I also see on the README that it says version 3.25 resolved "issues with large files" -- I'm wondering if maybe this might resolve the issue. I'm trying to update now to this version (Bioconducter version I believe is 3.24).

Please let me know if you figure it out on your end, and I'll let you know if I make progress as well.

Cheers,
Asher

andredsim · 2023-11-14T01:44:30Z

Hi both,

Lets start with 2. first. So that I can understand the issue let me confirm some details, this issue only happens for some of your bam files, and others run successfully? Is there anything different between the samples that work and the ones that do not? Are they all aligned to the same genome?
Could you comment with the code you used to run bambu, set verbose = TRUE, and after it fails running use warnings() and provide that here too.

andredsim · 2023-11-14T02:06:25Z

Regarding the first question, I assume the link you posted is the SG-nex preprint (unfortuantely BioRxiv is down at the moment so I couldn't check).
You can find the description of the different txClassDescriptions here (https://github.com/GoekeLab/bambu#output-description). new First Junction does not necessarily refer an alternative 5' end depending on how you define that, it could share a transcription start site with other transcripts but it does mean a new first exon as the first junction is at a novel position.

We do not output descriptions of alternative splicing events as they are relative to what you are comparing them too. We do have a function compareTranscripts(granges1, granges2) which is unfortunately not yet documented, but you could play around with it and see if the output is useful for you.
It takes 2 granges lists which must be the same length. Each annotation in the first list is compared to the corresponding annotation in the second list and returns a table indicating on splicing differences (ie, first entry in list one is compared to the first entry in list two, then second entries are compared against each other and so on). For example you could set the first list being all the novel transcripts and the second being the major isoform (most expressed) from the corresponding gene. Unfortunately because we have not documented it and do not support this function yet I won't be able to support you much further with the use of this function however, nevertheless I hope it will be of use.

Kind Regards,
Andre Sim

Sefi196 · 2023-11-14T03:34:53Z

Hi Andre,
Thanks for help on this issue,

Yes it appears some BAM files generate this error while others do not. The BAMs were all generated using the same minimap command and mapped to the same reference file. There is nothing 'special' about these files as far as i can tell.

I ran the following command
bambuAnnotations <- bambu::prepareAnnotations(annotation)
bambu_out <- withr::with_package("GenomeInfoDb",
bambu::bambu(
reads = c("C3_Day80_matched_reads_dedup_align2genome.bam"),
annotations = bambuAnnotations,
genome = "/data/scratch/users/yairp/old_discoAnt/discoAnt/ref_hg38/GRCh38.primary_assembly.genome.fa",
quant = FALSE,
discovery = TRUE,
lowMemory = TRUE,
verbose = TRUE,
ncore = 4)
)

bambu::writeBambuOutput(bambu_out, path = "/out/all_together")

in verbose mode bambu generates alot of messages but here are the final few lines

reads count for all annotated junctions: 0 (0%) reads count for all annotated junctions after correction to reference junction: 0 (0%) Finished correcting junction based on set of high confidence junctions in 0 mins. Finished creating transcript models (read classes) for reads with spliced junctions in 0 mins. Finished creating junction list with splice motif in 0 mins. before strand correction, annotated introns: NA (NA%) Junction correction with not enough data, precalculated model is used Model to predict true splice sites built in 0 mins. reads count for all annotated junctions: 0 (0%) reads count for all annotated junctions after correction to reference junction: 0 (0%) Finished correcting junction based on set of high confidence junctions in 0 mins. Finished creating transcript models (read classes) for reads with spliced junctions in 0 mins. Error: BiocParallel errors 1 remote errors, element index: 1 0 unevaluated and other errors first remote error: Error in h(simpleError(msg, call)): error in evaluating the argument 'y' in selecting a method for function 'intersect': error in evaluating the argument 'x' in selecting a method for function 'ranges': Subsetting operation on CompressedGRangesList object 'x' produces a result that is too big to be represented as a CompressedList object. Please try to coerce 'x' to a SimpleList object first (with 'as(x, "SimpleList")'). In addition: There were 50 or more warnings (use warnings() to see the first 50) Execution halted

The version i am running is 3.2.6

hope this helps get to the bottom of this

Thanks for your help
Sefi

andredsim · 2023-11-14T04:00:25Z

Hi Sefi,

Thanks for sharing this. This output is concerning as it says there are no reads for any of the annotated junctions which should not occur normally for a human sample. How many reads does this bam file have? Do you expect there to only be unspliced reads? Do you still have the minimap2 command you used and could you share it? A common error is that minimap was not run with splice aware mode on.
Another cause for this could be the annotations do not line up with the genome, for example if the genome and bam have scaffolds named "chr1" and the annotations have "1". Could you share with me head(bambuAnnotations), there have been cases where certain gff3 formats have issues?
Let me know these first and if these seem normal we can delve deeper.

Kind Regards,
Andre Sim

Sefi196 · 2023-11-14T04:40:22Z

Hi Andre,
Yes that does seem very strange. I expect spliced and unspliced reads.
This bam file contains ~92M primary alignments
I have successful runs with some bam files using the same gtf and genome files so i doubt it has to do with the reference files i am using.

My minimap command:
minimap2 -ax splice -t 16 -k14 --secondary=no --seed 2023 --junc-bed out/D80/tmp_splice_anno.bed12 --junc-bonus 1 -o out/D80/C3_Day80_tmp_align.sam ref_hg38/GRCh38.primary_assembly.genome.fa out/D80/C3_Day80_matched_reads.fastq

head(bambuAnnotations)
GRangesList object of length 6:
$ENST00000000233.10
GRanges object with 6 ranges and 2 metadata columns:
seqnames ranges strand | exon_rank exon_endRank
|
[1] chr7 127588411-127588565 + | 1 6
[2] chr7 127589083-127589163 + | 2 5
[3] chr7 127589485-127589594 + | 3 4
[4] chr7 127590066-127590137 + | 4 3
[5] chr7 127590963-127591088 + | 5 2
[6] chr7 127591213-127591700 + | 6 1

seqinfo: 25 sequences from an unspecified genome; no seqlengths

$ENST00000000412.8
GRanges object with 7 ranges and 2 metadata columns:
seqnames ranges strand | exon_rank exon_endRank
|
[1] chr12 8940361-8941940 - | 7 1
[2] chr12 8942416-8942542 - | 6 2
[3] chr12 8943405-8943535 - | 5 3
[4] chr12 8943801-8943910 - | 4 4
[5] chr12 8945418-8945584 - | 3 5
[6] chr12 8946229-8946405 - | 2 6
[7] chr12 8949488-8949645 - | 1 7

seqinfo: 25 sequences from an unspecified genome; no seqlengths

$ENST00000000442.11
GRanges object with 7 ranges and 2 metadata columns:
seqnames ranges strand | exon_rank exon_endRank
|
[1] chr11 64305524-64305736 + | 1 7
[2] chr11 64307168-64307504 + | 2 6
[3] chr11 64313951-64314067 + | 3 5
[4] chr11 64314239-64314367 + | 4 4
[5] chr11 64314741-64314911 + | 5 3
[6] chr11 64315001-64315270 + | 6 2
[7] chr11 64315707-64316743 + | 7 1

seqinfo: 25 sequences from an unspecified genome; no seqlengths

Hope this helps narrow down the issue

andredsim · 2023-11-14T05:43:29Z

Hi Sefi,

This all looks good, this definitely is narrowing down where the issue may be occurring. Do you happen to remember how many reads were listed for the samples that did work here "reads count for all annotated junctions" in the verbose output? Assuming the samples are of similar size I would expect around ~40M depending on how degraded the sample is.

I doubt this is the cause, as you said you used the same reference files for the samples that worked, but just one last sanity check. What is the output of (in bash) head ref_hg38/GRCh38.primary_assembly.genome.fa. Also have you tried looking at this bam file in a genome browser like IGV to confirm there are reads matching an annotated junction?

Assuming the above look normal, then the issue must lie in the early parts of the code where bambu is reading in the bam file. Are you able to subsample this bam file (perhaps only to only a region of the chromosome for example the area you looked at in the genome browser) and if the issue still occurs attach that sampled bam file here along with the annotations and genome so I investigate how bambu is handling these reads? With a minimal reproducible input set it will make it a lot easier for me to try solve this on myside :) .

Kind Regards,
Andre Sim

Sefi196 · 2023-11-14T06:47:53Z

Hi Andre,
Yes i agree an example would be useful.

I have subset the bam based on a single gene i know is in my data. I have viewed this in IGV and it looks clear to me that there are reads that map to known junctions of this gene. The mapping is a more messy than i expected but i think this should still be fine for bambu isoform discovery. here is what i can see.

Running this subset generates this error

--- Start generating read class files --- GRIA2 Finished creating junction list with splice motif in 0 mins. before strand correction, annotated introns: 35 (0.00982042648709315%) [17:29:25] WARNING: ../..//src/learner.cc:553: If you are loading a serialized model (like pickle in Python, RDS in R) generated by older XGBoost, please export the model by calling Booster.save_model` from that version
first, then load it back in current version. See:

https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html