Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map transcript coordinates to genome coordinates #203

Open
dariober opened this issue Jan 5, 2023 · 3 comments
Open

Map transcript coordinates to genome coordinates #203

dariober opened this issue Jan 5, 2023 · 3 comments

Comments

@dariober
Copy link

dariober commented Jan 5, 2023

I have the coordinates of domains mapped to transcripts so that the coordinates are relative to transcripts. Now I would like to transfer these transcript-coordinates to genome-coordinates. I wonder if there is some magic in gffutils that makes this reasonably easy. Hopefully an example will clarify:

Say my genome gff file contains this transcript with 3 exons:

import gffutils

data = """\
chr1 . gene 11 30 . . . ID=g1
chr1 . mRNA 11 30 . . . ID=t1;Parent=g1
chr1 . exon 11 14 . . . ID=e1;Parent=t1
chr1 . exon 19 22 . . . ID=e2;Parent=t1
chr1 . exon 27 30 . . . ID=e3;Parent=t1"""

db = gffutils.create_db(data.replace(' ', '\t'), ":memory:", from_string=True)

Which would look like this as text ideogram:

          GGGGGGGG_g1_GGGGGGGG
          EEEE----EEEE----EEEE
1         11        21        31 

I have a domain that in transcript coordinates start at position 7 and ends at position 10. So it would look like this:

          GGGGGGGG_g1_GGGGGGGG
          EEEE----EEEE----EEEE
                    ||----||
1         11        21        31

I would like a function that given the transcript ID and the domain coordinates in transcript-space returns the coordinates in genome-space:

transcriptToGenome(db, txid='t1', tx_start=7, tx_end=10)
# returns something like:
chr1	.	dom	21	28	.	.	.	ID=d1;Parent=t1
chr1	.	dom	21	22	.	.	.	ID=d1.1;Parent=d1
chr1	.	dom	27	28	.	.	.	ID=d1.2;Parent=d1

Before I bang my head on it, I wonder if an easy solution already exists. Thanks!

@daler
Copy link
Owner

daler commented Jan 5, 2023

Interesting, I agree this would be useful, but such magic does not currently exist in this package!

I think the first missing piece would be to handle splicing properly, I don't really have data structures in place for that so it wouldn't be trivial. If you do end up getting something working and can provide some example code, it would be great to include in gffutils.

@dariober
Copy link
Author

dariober commented Jan 5, 2023

Thanks for getting in touch so quickly. For now I have this horrible implementation:

def transcriptToGenome(db, txid, tx_start, tx_end, featuretype='exon', child_featuretype='.', parent_featuretype='.'):
    features = list(db.children(txid, featuretype=featuretype, order_by='start'))
    if len(features) == 0:
        raise Exception(f'No feature of type "{featuretype}" found for ID "{txid}"')

    tx = db[txid]
    txToGenomeMap = {}
    out = []
    tx_pos = 1
    for feature in features:
        genome_pos = range(feature.start, feature.end + 1)
        for i in genome_pos:
            # We map tx coordinates to genome coordinates in a dictionary like:
            # {1:11, 2:12, 3:100, 4:101, ...}
            # That's horrible but easier later to query
            txToGenomeMap[tx_pos] = i
            tx_pos += 1
        
        if tx_start < tx_pos:
            if txToGenomeMap[tx_start] < feature.start:
                gstart = feature.start
            else:
                gstart = txToGenomeMap[tx_start] 
            
            if tx_end >= tx_pos:
                gend = feature.end
            else:
                gend = txToGenomeMap[tx_end]
            out.append(gffutils.Feature(tx.chrom, '.', child_featuretype, gstart, gend))
            if tx_end < tx_pos:
                break
    parent = gffutils.Feature(tx.chrom, '.', parent_featuretype, out[0].start, out[-1].end)
    out.insert(0, parent)
    return(out)

Using toy data from above:

out = transcriptToGenome(db, 't1', 7, 10, 'exon', child_featuretype='chunk', parent_featuretype='dom')
for f in out:
    print(f)

chr1	.	dom	21	28	.	.	.	
chr1	.	chunk	21	22	.	.	.	
chr1	.	chunk	27	28	.	.	.	

One would need to fill in the source column, the attributes, strand, etc but that's relatively easy and it depends on context.

@mdshw5
Copy link
Contributor

mdshw5 commented Jan 6, 2023

I'd be really glad to have a clean gffutils based solution for this as well! A while back I tried to put together a utility to convert transcript coordinate SAM files to genomic coordinates (https://github.com/mdshw5/transcoorder) but it's not really correct and doesn't handle reads spanning exon-exon junctions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants