Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandora index failed on prg output #358

Open
mthang opened this issue Dec 24, 2024 · 8 comments
Open

Pandora index failed on prg output #358

mthang opened this issue Dec 24, 2024 · 8 comments

Comments

@mthang
Copy link

mthang commented Dec 24, 2024

I am experencing the following error while running pandora index and unable to find the solution. I suspected it was lack of RAM at the beginning, but the error still appears after increase the RAM size to 24G. The input prg fasta file contains 15625 and I have little idea how much memory is required. After having a closer look at my input file, I realized the sequence(s) causing this issue contains numbers. I would like to know how Pandora index handles these sequences.

ERROR
terminate called after throwing an instance of 'FatalRuntimeError'
what(): [FATAL ERROR]: In conversion from linear localPRG string to graph, splitting the string by the next var site resulted in the wrong number of intervals. Please check that site numbers are flanked by a space on either side. Or perhaps ordering of numbers in GFA is irregular?! Size of partition based on site 5 is 1

@iqbal-lab
Copy link
Collaborator

Can you give a bit more detail. What exactly was the file it failed on- would it be possible to share it, on this issue if small or via some other means. Also, I'm afraid we are on vacation now until 6th January. Will reply ASAP afterwards

@mthang
Copy link
Author

mthang commented Dec 24, 2024

@iqbal-lab Thank you for speedy response ! I didnt expect to receive the response so quick. I understand it's a holiday season, so this issue can wait. I ran the pandora index on a file which was generated by make_prgs program with parameter from_msa. What I understand is the program make_prgs is used to generate a graph base fasta from multiple alignment file. What interesting is the output of make_prgs contains numbers in the sequences. I expect Pandora index woudl work with graph base sequences file. This is also mentioned in this post https://cpang.netlify.app/post/testing-pandora-core/ . It would be great if you can have a look and let me know how to resolve this. The pandora program I used is the binary version (0.12.0). Here's the file I used
Uploading prg.zip…

@iqbal-lab
Copy link
Collaborator

The output of make_prg does contain numbers. The format , I now realise, is not described on the github and we will fix that. Bit if you look at Figure 2 here

https://www.biorxiv.org/content/10.1101/059170v2.full

you'll get the idea. Will be in touch in the new year!

@iqbal-lab
Copy link
Collaborator

Hi there @mthang (cc @LeahRoberts ) - sorry your upload of the pandora PRG has failed, could you try again? Or give it to leah to give ot me?

@iqbal-lab
Copy link
Collaborator

Hi @mthang - apologies for the slow response on this - I am now engaged! The error message suggests that the PRg is mal-constructed. Could you share the input data that the prg came from and the command you used, and the pandora version?

@iqbal-lab
Copy link
Collaborator

This is interesting. The error is thrown here

"In conversion from linear localPRG string to graph, splitting the "

and the error, reported in the original bug report - top of page I mean - says

"Please check that site numbers are flanked by a space on either side"

I note that one of the genes in the fasta looks like this

>DMEEDF_00155.msa ATGCTTCCCCGTTTTGCCGACATTTTTCAGCAGGGAAACCGCTGGCTTAACTGGCTGGAG AAACAACCGGAAGGTTCAGTGCGTCCGGTAGTCATTGAGTCTGTGACAAAAATCATGGCC TGCGGGACCACGCTGATGGGGTACACACAGTGGTGCTGTTCATCTCCGGACTGCAGCCAC ATAAAAAAG5A6G5TCTGCTTCCGGTGTAAAAGTCGCTCCTGCCCGCACTGCGGAGTGAA GGCTGGCGCACAGTGGATACAGTATCTGCTGAGTCTGGTTCCCGACTGTCCGTGGCAGCA TATTGTGTTCACACTTCCCTGCCAGTACTGGTCCCTGGTGTTCCACAACCGG7T8A7GGT TACTGGCAGAGATGAGCCGCATTGCTGCGGATGTGATACAGGAAATCTGCCGCCAGGCAG ATGTGGTGCCGGGGA9G10T9ATTCACGGTGATCCACACATGGGGACGTGACCAG11C12 G11AGTGGCATCCGCACATTCACCTGTCGACAACGACCGGCGGCGTGACATCAGACCACA CCTGGAAAAACCTTCATTTTTACGCCCGTAAGGTGATGAGTATGTGGCGTTACCGGATAA CGCGGTTACTGTCACGGAAATATCCGGACCTGGTGATACCGGATGCGCTGGCAGCAGAAG GAAGCAGTAAACGGGACTGGAATCGCTTCCTGGACAGTCATTACCGGCGGGGCTGGAATG TCAACGTATCCCGGGTGATGGATAACGCCACACATGTGGCGGTGTACTTCGGCTCTTACC TGAAAAAACCGCCGGTGCCGATGAGCCGTCTGGAGCACTATGCTGGTCAGGATGAAATTG GTCTGCGTTACAACAGTCACCGGACAAAACGGGAAGAATACCTGGTGATGAGTGGTGATG AGTTTATGGAAAGGTTCTCCTGGCATGTGGCGGATAAGGGGTTCCGTATGGTGAGGTACT ACGGTTTCCTGAGTCCGGTGAAGCGCCGGTTACTGGAAGATGTTGTGTACGTCATAACGG AGACGGTGAGAAAGACGGCGATGCAAATCAGGTGGAGAGGGATG13A14T13ATCAGCGG TTACTGAAGGTTGACCCGCTGAAGTGCATCCTGTGCGGAGGTCAGATGCGTTTTACGGGG CTGAAGCGGGGCTACCGTCTGACAGAGCTGGTCCTGATGCATGAGCCACTGGCGCAACAG CGGGTGTG15CGGCTGA16TGGCTGA15

Note that it ends in the number 15. I need to check whether this is permitted/expected from make-prg.
I guess there is no alternative when it is given alleles which differ at the end. This business of needing a space at the end is something @leoisl wrote and I need to dig into why

@iqbal-lab
Copy link
Collaborator

Just for the record, look at this crazy gene

>EKIOKO_00070.msa ATGACGGT5T6G5ATTTCGAC7T8C7GCGATCGA9C10T9CGCGACAGCGA11CAGCTTC AAG12GACCTTCAAA11GCCAATGCCA13GCAAAAACAAAGCCCTG14TCAAGAACAAGG CGCTT13ATCGACGA15ACTCCACAATCGTTCGGCGAAA16GCTCAGCGAGCGCTCGGCA AAG15GCGCGCGAGGGCGG17ATCGCAGACAGCGCGCGAGCGCCATACCGGCAAAGGTAA GCTGCTGCCG18CGCGCAATCGGCCCGTGAACGTCACACAAGCAAGGGCAAGCTCCTGCC C17CGCGACCGCAT19CCAGCTTTTGC20TCAGCTGCTGA19TCGATGCCGGCAGCCC21 G22C21TTCCTGGAG23ATCGGCA24GTCGGCG23CGCTGGC25A26G25GCCAACGG27 C28A27ATGTATG29AT30GC29GACGAGGCGCC31C32G31GGCGCCGGCATCATATC3 3A34G33GGCATCGGCCGCGT35TG36AT35CCGGCCG37CGAGGTG38TGAGGTC37AT GATCGTCGCCAATGACGCGACGGT39G40A39AAGGGCGGCGCCTATTT41C42T41CCG ATGACGGTGAAGAAACATCTC43C44A43GGGCGCAGGA45G46A45ATCGCCATGCAGA ACCGG47T48C47TGCCCTG49TCTCTATCTTGTCGAT50CCTTTATCTGGTCGAC49AG CGGCGGCGCCAATCT51T52G51CCGCATCAGGCCGA53A54G53GTCTTTCCCGACCGC GA55C56T55CATTTCGGCGCGATCTTCTACAACCAGGC57C58G57CAGATGTCGGCCG AAGGCATT59CCGCAGATC60GCCCAGATT59GCCTGCGTCATGGGAAGCTGCACCGCCG G61TGGC62CGGT61GCCTATGT63GCCCGCC64TCCCGCG63ATGTCCGACGAAACGGT 65C66G65ATCGTGCGCAATCAGGGCACGATCTTCCTTGCCGGCCCGCCGCTGGT67GAA G68CAAA67GCCGCGACCGGCGA69GATCATTTCA70AATCATCTCG69GCCGAAGA71A 72G71CTCGGCGGCGCCGA73G74A73ACCCATGGCCGCCGCTCCGGCGTCGTCGATCAT GT75GGCCGAA76CGCCGAG75AACGACGAACATGCGCTG77TTGCTCGTT78CTTCTGG TG77CGCGATATCG79CAGCCAC80TCGCCAG79CCTCAACAGCGTGAA81ATCA82GTC G81GTCGATAT83CGACCTGCAGCCGCCA84AGACATTCAATCCCCC83CGGCCGC85TG AAACTCGATCCCGAGGATCTCTGCGGCCTCATC86CGAAGCTCCACCTCGAGGACCTTCA CGGCATCATT85CCGGAGGA87T88C87GTGCGCTCGCC89CTATGAT90GTATGAC89G TGCGCGAGGT91C92G91ATCGGCCG93G94A93ATCGTCGA95T96C95GGCTCGGAAC TGCA97C98T97GAGTTCAAGCC99GCTCTATGGCG100ACTCTACGGCA99CCACGCT1 01G102C101GTCTGCGGCTTCGCCCGCATCTGGGGCATGCC103C104T103GTCGCCG TCATCGCCAATAACGGCGTGCT105G106C105TTTTCCGAAAG107C108T107GCGCT GAAGGG109GGCG110CGCA109CATTTCAT111C112T111GAGCTCGCCTGCCAGCGC CGC113ATACCTTTGCTGTTT114GTGCCGCTGCTCTTC113CTGCAGAA115C116T11 5ATTTCCGG117C118G117TTCATGGTCGGCGGCCGCTA119T120C119GAGGCCGGC GGCATCGCCAAGGATGGGGC121A122G121AAGCTGGTGACGGCGGT123T124G123G CGACCGC125GAG126CAC125CGTGCCGAA127GGTCACG128AGTCACC127GTCATC ATCGGCGGCAGCTTCGGCGCCGGCAATTACGGCATGTGCGGCCGCGCCTATCGCCCGCGC TT129T130C129CTCTTCACCTGGCCGAACAGCCGCATCAGCGT131G132C131ATGG GCGGCGAACAGGCGGCCTC133AGTGCTCGCCACG134GGTGCTTGCCACC133ATCCGC CG135AGACTCCATGGAG136CGACGCGATGGAA135GCGCGCGG137TGAGA138CGAG G137ATTGGCC139TATTGAAGAGGAGGAGGCC140GGCCGCCGAGGAAGAGGCG139TT CAAGGCGCCGATCCG141CGCCGGT142TGCGGGC141TACGAGGCCGAGGGCAATCCCT ATT143ATGCCACGGCT144TTGCCACAGCC143CGCCTCTGGGACGACGGCATCATCGA 145T146C145CCGCGCCAGAC147GCGG148ACGC147GATGTGCTGGG149CCTT150 TCTC149GCCTTTTCCGCCTGCCTGAATGCGCCGATCCCGAAAGGGCCGCGCTTCGGCCT 151G152A151TTCAGGATGTGA

all the way up to 152!

@iqbal-lab
Copy link
Collaborator

"Size of partition based on site 5 is 1" means it is complaining....actually that's weird, it's complaining at the first site.
@mthang , have you tried running your command on subsets of the PRG, to see which gene is problematic?It's tempting to break the prg into 15625 different prgs, one per gene, and run your command to see which break (obviously not the ones with no numbers, but I think it will be specific gene(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants