-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent position #34
Comments
Not a transvar author, but chiming in anyways as it's an easy one for me:
At codon 175/176 of NM_001164277.1 there is a difference in the
transcript mapping between UCSC and NCBI. You can see an insertion in the genome
relative to the transcript. After this position, the numbering of the
codons changes between NCBI and UCSC. So this may possibly be due to
transvar using the UCSC
Refseq mapping and the other tools using the the NCBI mapping. I don't know
which mapping is correct or what's exactly happening at the basepair
level, it may be interesting for you to investigate the details manually, but
all variants after codon 175 will show this -1 difference between
different tools.
Here is a UCSC Session link that shows the various tracks (mostly the
"NCBI refseq > UCSC Refseq" subtrack) and zoomed around this position:
http://genome.ucsc.edu/s/Max/variantPosDiff%2Dtransvar
To the transvar authors: UCSC provides the NCBI mapping in the
identical format, so it should be very easy to change the
RefSeq/genome mapping in Transvar. The NCBI mapping is the
"ncbiRefseq" track, the UCSC mapping is the "refGene" track.
Both text files are available on hgdownload.soe.ucsc.edu. As I said, I
don't know which one is "correct", but the NCBI mapping is
more consistent with other tools.
|
Thank you for your detailed explanation. Now I'm trying to replace the original refGene by NCBI mapping |
You can tell me which table you downloaded and from where and/or tell me if
there is any difference in the file format. I believe the file format is
exactly identical but of course I could be wrong.
…On Wed, May 8, 2019 at 12:36 PM bitsyamagu ***@***.***> wrote:
Thank you for your detailed explanation.
I've got the circumstance.
Now I'm trying to replace the original refGene by NCBI mapping
on a local copy of TransVar which I installed to an local server.
But it doesn't work yet.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#34 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AACL4TJJKZU6WYJ4TRV6CVLPUKUJ3ANCNFSM4HLFIMTQ>
.
|
Actually, I was wrong, the screenshot I sent you shows that this is not a
problem of NCBI vs. UCSC mapping but rather a difference between UCSC genes
and RefSeq genes. Not sure what really happened here, it seems like the
transvar authors should comment on this case since only they know how the
original mapping was made and how they count the nucleotides in deletion
cases like this that result in a micro-intron with a split codon over it.
…On Wed, May 8, 2019 at 12:36 PM bitsyamagu ***@***.***> wrote:
Thank you for your detailed explanation.
I've got the circumstance.
Now I'm trying to replace the original refGene by NCBI mapping
on a local copy of TransVar which I installed to an local server.
But it doesn't work yet.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#34 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AACL4TJJKZU6WYJ4TRV6CVLPUKUJ3ANCNFSM4HLFIMTQ>
.
|
Sorry for being silent until now. I have taken a look at this case. It seems this is brought about by the Patch 13 of GRCh37/hg19 (happened in 2018), which includes a deletion in the reference genome upstream the SLC37A4 locus. See the following for the update info and screenshot https://genome.ucsc.edu/cgi-bin/hgc?hgsid=725242099_iR8o5gc4tCCAUvNxq78k2BYTpEwI&c=chr11&l=118898423&r=118898453&o=118849110&t=118957986&g=hg19Patch13Patches&i=jh159138.1In fact if you look at the original RefSeq annotation gtf, there is a Note that seems to pointing to this discrepancy. The problem is that there is no specification of the alignment to the patch included in RefSeq gene. I would suggest you either switch to hg38 or use Ensembl annotation which treats the gap as an intron. I am not sure how from the transvar perspective whether we can completely solve this problem since the refseq gene doesn't include this particular gap and patches aren't meant to be applied (https://www.biostars.org/p/45788/). But certainly let me know if you have better suggestion. |
In short, this is an extremely rare situation where the genome reference is patched to introduce a deletion in the exon sequence of a gene causing the gene to "frameshift". I have not previously encountered a situation like this before. |
There are two errors that many software in this space gets wrong.
You might be interested in UTA (Universal Transcript Archive), which contains transcripts from NCBI, UCSC, and Ensembl, and their alignments to multiple assemblies using coordinates from the source databases (I don't realign them myself). A public instance of UTA is available at uta.biocommons.org:5432 (postgresql) and as a docker image. Start here: https://hgvs.readthedocs.io/en/stable/installation.html#local-installation-of-uta. UTA is the only comprehensive source of this information that I know of (and it's about 9 months out of date :-( ). For this specific case, UTA shows the following exon coordinate differences:
ord is the exon ordinal number, 0-based Note that starting in exon 5 (ord 4) that the alignment starts going belly-up. Variants up to n.1283 (c.526, I think) should map consistently (1283=1138+145). When splign and BLAT disagree on exon coordinates, splign's alignment is nearly always more parsimonious with the genomic sequence than BLAT's and therefore more likely correct. Such is the case above: after the discrepancy in exon 5 above, splign's alignments are more plausible that BLAT's as judged by the cigar strings. P.S. VariantValidator is a web interface to the hgvs package, which provides most of the algorithmic support. hgvs uses UTA for exon structures. |
I don't think the fact that this location was later patched is relevant
here, we're looking at the un-patched genome sequence. I think it's enough
to say that it's a difference between the Refseq sequence and the genome
sequence. I may be wrong, but this is not extremely rare, we found on the
order of hundreds of transcripts with genomic indels in hg19 (Reece
probably knows the exact number by h(e)art... :-)
… |
@maximilianh @reece I agree with you both. I was not aware there are many cases with indel in the alignments. It has just dawned on me that it makes a lot of sense when there are polymorphism in the dna that affects the structure of the transcript. I will take a closer look at VariantValidator/hgvs/UTA. It seems there is a lot that can be learned. Unfortunately with the current implementation of TransVar, the coordinates are inferred from the gtf. Hence if there are unannotated indels (such as in the case of RefSeq), there is yet a way to capture. But (correct me if wrong) Ensembl annotation doesn't seem to be affected by this issue since they explicitly annotate the indels? Thanks for the insight from you both! |
Thanks everyone for your replies and considerations. I haven't been aware of these kind of cases, so I think our local repository which we are making also has similar problems, and should be checked and be improved. |
The gff files should work fine as a source. The cigar string appears at the end of the line only for imperfect matches, like this:
Ensembl isn't affected by this issue, but it has a different challenge: Because transcripts are derived from spans in the genomic sequence, it's harder to represent transcripts in which the genomic sequence contains a sequencing error. It's been a while since I looked at this, but I'm pretty sure that transcripts in NEFL and ALMS1 had this challenge. My information here is stale, so you should consult with Ensembl folks. |
Thanks again for the insight @reece. Makes a lot of sense! |
When I submitted a mutation:
NM_001164277:c.925_928delinsTC
to Transvar
with these options:
Reverse annotation cDNA, GRCh37, RefSeq
I got "p.G309Sfs*3" and "chr11:g.118896734_118896737delinsGA"
for this mutation.
Although I checked this mutation on the UCSC Genome Browser,
the position chr11:g.118896734_118896737delinsGA is 308 of
NM_001164277(attached png).
Other similar services, Mutalyzer and VariantVaridator, returns "309"
, and it is consistent with the genome browser.
The text was updated successfully, but these errors were encountered: