-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explanation of txClassDescriptions and error message for large bam file. #402
Comments
Hi Asher, I had a feeling it had something to do with the size of the bam files. Can i ask if the bam files your inputting are large (>100GB)? |
Hi there, Gotcha, yeah mine isn't that large (it's 55 GB) -- so perhaps it is related to something else? The reason I thought it had something to do with the size though was the part of the error message that read: CompressedGRangesList object 'x' produces I also see on the README that it says version 3.25 resolved "issues with large files" -- I'm wondering if maybe this might resolve the issue. I'm trying to update now to this version (Bioconducter version I believe is 3.24). Please let me know if you figure it out on your end, and I'll let you know if I make progress as well. Cheers, |
Hi both, Lets start with 2. first. So that I can understand the issue let me confirm some details, this issue only happens for some of your bam files, and others run successfully? Is there anything different between the samples that work and the ones that do not? Are they all aligned to the same genome? |
Regarding the first question, I assume the link you posted is the SG-nex preprint (unfortuantely BioRxiv is down at the moment so I couldn't check). We do not output descriptions of alternative splicing events as they are relative to what you are comparing them too. We do have a function Kind Regards, |
Hi Andre, Yes it appears some BAM files generate this error while others do not. The BAMs were all generated using the same minimap command and mapped to the same reference file. There is nothing 'special' about these files as far as i can tell. I ran the following command bambu::writeBambuOutput(bambu_out, path = "/out/all_together") in verbose mode bambu generates alot of messages but here are the final few lines
The version i am running is 3.2.6 hope this helps get to the bottom of this Thanks for your help |
Hi Sefi, Thanks for sharing this. This output is concerning as it says there are no reads for any of the annotated junctions which should not occur normally for a human sample. How many reads does this bam file have? Do you expect there to only be unspliced reads? Do you still have the minimap2 command you used and could you share it? A common error is that minimap was not run with splice aware mode on. Kind Regards, |
Hi Andre, My minimap command:
seqinfo: 25 sequences from an unspecified genome; no seqlengths $ENST00000000412.8 seqinfo: 25 sequences from an unspecified genome; no seqlengths $ENST00000000442.11 seqinfo: 25 sequences from an unspecified genome; no seqlengths Hope this helps narrow down the issue |
Hi Sefi, This all looks good, this definitely is narrowing down where the issue may be occurring. Do you happen to remember how many reads were listed for the samples that did work here "reads count for all annotated junctions" in the verbose output? Assuming the samples are of similar size I would expect around ~40M depending on how degraded the sample is. I doubt this is the cause, as you said you used the same reference files for the samples that worked, but just one last sanity check. What is the output of (in bash) Assuming the above look normal, then the issue must lie in the early parts of the code where bambu is reading in the bam file. Are you able to subsample this bam file (perhaps only to only a region of the chromosome for example the area you looked at in the genome browser) and if the issue still occurs attach that sampled bam file here along with the annotations and genome so I investigate how bambu is handling these reads? With a minimal reproducible input set it will make it a lot easier for me to try solve this on myside :) . Kind Regards, |
Hi Andre, I have subset the bam based on a single gene i know is in my data. I have viewed this in IGV and it looks clear to me that there are reads that map to known junctions of this gene. The mapping is a more messy than i expected but i think this should still be fine for bambu isoform discovery. here is what i can see. Running this subset generates this error
for more details about differences between saving model and serializing. [17:29:25] WARNING: ../..//src/learner.cc:553:
for more details about differences between saving model and serializing. [17:29:25] WARNING: ../..//src/learner.cc:553:
for more details about differences between saving model and serializing. [17:29:25] WARNING: ../..//src/learner.cc:553:
for more details about differences between saving model and serializing. Junction correction with not enough data, precalculated model is used
for more details about differences between saving model and serializing. [17:29:37] WARNING: ../..//src/learner.cc:553:
for more details about differences between saving model and serializing. Sample does not have more than 50 of both read class labels the bam file is a little big to send here. Is there another way i can send it to you or should I subset further thanks |
Hi, This seems to be a different error than the first case bambu seems to have worked in this case and finished transcript discovery. Try In this case it now says "reads count for all annotated junctions: 84917 (0.941200594090133%)" instead of 0 like it said previously. Was this subset from the same bam file used previously that caused the error? If so then it seems to be some alignments might be causing the issue, or maybe the bam file is formatted in a way bambu doesn't expect. I notice you named the file .dedup, is it possible the deduplication step might have changed something? You could try splitting the bam file by chromosomes to divide and conquer and find the problematic alignments that replicate the original error message. My suspicion is it might be an accessory scaffold that is causing the issue. Let me know if you are able to replicate it and then we can organize a way to transfer the bam file so I can have a closer look. Kind Regards, |
Hi Andre, Im glad i got it working thanks for your help |
Hi, I am glad to hear that worked. I will have to think about why Bambu would crash in the presence of other alignments, because theoretically it shouldn't happen, hopefully I can find that and solve it, or at least return an informative warning message. Thanks for providing all that information! @apsteinberg were you able to resolve your issue as well? |
Hi Andre, I haven't unfortunately been able to resolve the issue on my end yet, but I also haven't had adequate time to try out some of the things you had suggested. I was however able to run bambu in verbose mode as you had suggested, and received a similar error to @Sefi196. Bambu was also telling me that I had no read counts. I did check my bam file and there are in fact reads there, and the header is normal. I aligned the reads using minimap within the nanoseq pipeline, and I need to check if there were any differences in the log messages from nanoseq for this particular bam vs some of the others. Also, thank you for the details about the transcript descriptions! This is very helpful. I am intending on working on this more later this week, but if you'd like to close out the issue for now and I could re-open it once I've had a chance to hone in on what the issues in my bam file are, please do. Thank you for your help! Best, |
Hi Asher, Thanks for getting back. I will keep this issue open for now, as I want to either solve the issue in the case its spurious alignments causing bambu to crash, or at least provide a graceful crash/informative error message. Kind Regards, |
Hi Andre, Sounds good! I will keep you posted as I progress on this. Thanks again for all your help. Cheers, |
Hi @andredsim generanges[subjectHits(ov)[multiHits]] Error in METHOD(x, i) : It seems this is due to the excessively large total number of 'grange' elements produced here (after unlisting). a = as(generanges,"SimpleList")[subjectHits(ov)[multiHits]] sum(lengths(a)) I suspect this number is too large for it to be stored as a compressed range list object. filteredMultiHits = split_intersection(grl,geneRanges,ov) This problem also arises in another section, during quantification: calculateDistToAnnotation -> findspliceoverlapbyDist -> subject[subjectHits(olap)] I've used a similar approach to split 'olap' and merge it, and now the code runs smoothly. I've also attempted to convert it into a SimpleList, but the speed becomes very slow. Hope this information can help :-) |
Hi @andredsim, Apologies for the delay on my end. I finally got around to trying what @Sefi196 tried successfully here:
And it worked for me as well, which is very exciting! It looks like to answer your question about how many reads were aligned to scaffolds and accessory chromosomes it was 29e6 reads, or about ~30% of the aligned reads in my bam. However, I still have a few questions about this. It is still unclear to me as to why these reads would cause Bambu to fail. The genomes were aligned to a fasta file that included the scaffolds and then Bambu was given this reference fasta and gtf. Is the implication that these reads were poorly aligned? It would be good for me to understand both to control for this source of error in the future and to offer an explanation to collaborators. I can see if it's possible to transfer a partial bam if it's useful. Thanks again for your time and help! Cheers, |
Hi Asher, I am glad you were able to get it to run. As to why it causes Bambu to fail, I cannot say for certain as as long as those scaffolds were in the fasta during alignment it shouldn't cause a problem, at least as we intend it. From @HongYhong's amazing work, it could be that after removing 30% of the reads you moved below the threshold to avoid the GRangesList bug he mentioned, however I have tested Bambu with way more reads than your particular file has so it is hard to say. (By the way Hong Yhong we are looking at implementing this fix, just needing to find the time to test it!). If you could send me a partial bam containing reads mapped to these scaffolds that causes this issue, or host the full bam somehwere so I can try replicate your bug I will gladly try and look into it and try find an explanation for you. Kind Regards, |
Hi Andre, Thanks! Got it, this makes sense. Thank you for offering to troubleshoot the bam file further. I spoke with my manager, and I think unfortunately it would be difficult for us to share these data at this stage due to some stringent confidentiality policies at our institution (we work at a hospital). However, maybe what I can do is try to split the bam file into bams for each accessory scaffold individually, and try to determine which is yielding the error. I will send further updates if I am able to pinpoint what the source of the issue is (in case it's useful for further development of Bambu). I will also see if I can potentially try to implement @HongYhong's fix. I suspect perhaps there is something else going on as you mentioned though, since originally along with the CompressedGRangesList error I was also getting the same error as @Sefi196: Thank you again for all your help. Cheers, |
Hi All, Error: BiocParallel errors
1 remote errors, element index: 1
0 unevaluated and other errors
first remote error:
Error in h(simpleError(msg, call)): error in evaluating the argument 'y' in selecting a method for function 'intersect': error in evaluating the argument 'x' in selectin
g a method for function 'ranges': Subsetting operation on CompressedGRangesList object 'x' produces a
result that is too big to be represented as a CompressedList object.
Please try to coerce 'x' to a SimpleList object first (with 'as(x,
"SimpleList")'). Is there any update on how to fix this issue? Thanks Bambu team |
Hi, Sorry for the delay in replying to this one. I also wanted to know if on these new files you were still getting the following in your output " Please let me know how this goes! Kind Regards, |
Hi Andre, I am not getting the "reads count for all annotated junctions: 0 (0%)" message probably because i am only mapping to the correct chromosomes this time I am happy to share the bam but it is >100GB. If you interested in taking a look at the bam let me know how i could send this over to you. Thanks again for all the help Sefi |
Hi Sefi, Glad to hear this works. I really need to find the time to update this branch further and get it through a pull request then. Yes I would really like to give your bam file a go. Could you write to me at andre_sim[at]gis.a-star.edu.sg and I can reply with some details on where you can upload the large file. Kind Regards, |
Hi @andredsim, Sorry for the late reply! I just gave @HongYhong 's suggested fix a try and it worked great! Thank you both for helping with this solution, really excited to have this resolved. Best wishes, |
Hi there,
Thank you again for helping me the other day with the issues I was having. I've been able to run bambu successfully on some of my samples, and I have a couple of follow-up questions (I hope you don't mind me asking here, happy to correspond via email if this is easier):
I would like to now tap into bambu's really awesome isoform analysis tools. I saw that you have in the output a "txClassDescription", which describes what is novel about the isoform. I wanted to do an analysis of full length isoforms for my samples as your team has done in Figure 5a, d, and e of the following paper: https://www.biorxiv.org/content/10.1101/2021.04.21.440736v1
And I was wondering how the txClassDescription codes relate to some of the events you've described here, if at all? For example, does a txClassDescription code of "new First Junction" correspond to what you describe as a "alternative 5' end" in the paper? And further, are exon skipping events encompassed in these class codes? If not, it would be wonderful if you could point me to where I may be able to classify my transcripts in this way.
I am encountering an issue with one of my bam files through bambu, and I am not entirely sure why. I am getting the following error:
--- Start generating read class files ---
Error: BiocParallel errors
1 remote errors, element index: 1
0 unevaluated and other errors
first remote error:
Error in h(simpleError(msg, call)): error in evaluating the argument 'y' in selecting a method for function 'intersect': error in eval
uating the argument 'x' in selecting a method for function 'ranges': Subsetting operation on CompressedGRangesList object 'x' produces
a
result that is too big to be represented as a CompressedList object.
Please try to coerce 'x' to a SimpleList object first (with 'as(x,
"SimpleList")').
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Execution halted
For #2, any clues as to how to resolve this?
Thanks in advance for your time and help!
Cheers,
Asher
The text was updated successfully, but these errors were encountered: