Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create cigar-based read filter #588

Open
akiezun opened this issue Jun 22, 2015 · 6 comments
Open

create cigar-based read filter #588

akiezun opened this issue Jun 22, 2015 · 6 comments

Comments

@akiezun
Copy link
Contributor

akiezun commented Jun 22, 2015

feature request from @vdauwera (from #429)
"Feature request: add ability to recognize a cigar pattern (to e.g. select reads with insertions> 10 bases, or reads with soft-clips, etc)."

@vdauwera please write an example commandline you'd like to be able to write (or a list of all patterns you want to be able to filter). Assign to me when done.

@vdauwera
Copy link
Contributor

Ticket in gsa-unstable: https://github.com/broadinstitute/gsa-unstable/issues/832

If it gets implemented there we'll be sure to fwd-port to Hellbender as well.

@droazen droazen added Engine and removed CLI labels Mar 23, 2017
lbergelson pushed a commit that referenced this issue May 31, 2017
Miscellaneous boy scout rule activities
@droazen droazen assigned jonn-smith and unassigned vdauwera Jun 7, 2017
@droazen
Copy link
Contributor

droazen commented Jun 7, 2017

Re-assigning to @jonn-smith, as this might be a fun one.

@droazen droazen added this to the engine-4.0 milestone Jun 7, 2017
@jonn-smith
Copy link
Collaborator

@vdauwera can you provide some more examples of what kinds of cases you'd like to have handled?

@vdauwera
Copy link
Contributor

There were some good basic examples in the original ticket:

  • get all the contiguously aligned reads (e.g, xxM)
  • get reads with soft clipping (e.g., xSxxM, could be reads with partial adapter sequence still left after trimming)
  • get reads with insertions (e.g., xxMxxIxxM, could be spliced reads, e.g., reads spanning exon-exon, or intron-intron junction)
  • get reads with deletions (g.g., xxMxxDxxM, could point at SV)

Those would be the basic must-haves.

Then the next step of nice-to-haves would be to be able to find specific patterns like "D followed by I" or specific numbers of operators like "exactly five D in a row" or "five D in total, not necessarily in consecutive order".

Do you need me to be more specific than that?

@jonn-smith
Copy link
Collaborator

@vdauwera - I think that makes sense. We've been brainstorming ideas for how a user would actually input the filter strings and there seem to be a few options.

  • JEXL
    • it's already in use elsewhere and we can use JEXL functions at the command-line to specify "hasAtLeast(5,"D")" for simple filters, but it seems like it would get clunky with increasing filter complexity
  • Regular Expressions
    • they're fairly universal, but it would be hard to match numerical values and can be confusing/exhausting to write correctly
  • Modified regular expressions, for example:
    • ^D matches any number of deletions at the start
    • DMD matches any number of deletions followed by any number of (mis-)matches, followed by any number of deletions
    • ^<5SM>=4D$ matches less than 5 soft clipped bases at the start of the cigar, followed by any number of (mis-)matches, followed by at least 4 deletions at the end of the cigar
  • Command-line options passed into the filter, for example:
    • --hasAtLeast 5 D --startsWith S --endsWith M

We can also implement some combination of these. What do you think?

@vdauwera
Copy link
Contributor

I like the idea of the modified regexes, that seems like the best balance of usability and flexibility/power. I'd rather avoid having a slew of new special-cased arguments.

jonn-smith added a commit that referenced this issue Jun 27, 2017
jonn-smith added a commit that referenced this issue Jun 27, 2017
@droazen droazen removed this from the Engine-4.0 milestone Oct 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants