Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split barcode policy optimization #13

Merged
merged 11 commits into from
Feb 8, 2023
Merged

Conversation

mtomko
Copy link
Collaborator

@mtomko mtomko commented Feb 8, 2023

In the event that we are matching barcodes using a template that contains precisely 2 barcode regions, each preceded by a prefix, it may be faster to skip matching the whole template and instead match the first prefix, skip ahead to the next prefix location, match that with the second prefix, and extract the barcodes if necessary.

This is particularly useful for templates such as

cggtgNNNNNNNNNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnngttccNNNNNNNNNNNNN

which has a very long fixed region whose contents we don't care about.

In practice, this has perhaps been a more modest improvement than I had hoped, but it is worth keeping the work, I think.

Also technically this should probably be released as PoolQ 3.7.0 but since we don't have a lot of downstream consumers and the renamed class is not really intended for public consumption, perhaps we can get away with calling it 3.6.1?

@mtomko mtomko requested a review from tmgreen February 8, 2023 00:54
@mtomko mtomko self-assigned this Feb 8, 2023
tmgreen
tmgreen previously approved these changes Feb 8, 2023
Copy link
Member

@tmgreen tmgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I'd suggest starting to get into the habit of jmh-benchmarking some of these algorithm optimizations. I had assumed PQ was more IO-bound in its performance than it appears to be. We may want to accelerate the notion of abandoning String in favor of a more purpose-built byte-array DNA class where you'd have more "unsafe" access to its innards without all the array copying that has to be done with String, and avoid all the unnecessary String sausage being made under the hood to deal with unicode etc. Hard to tell how much this would pay off given the new "compact" Latin1 strings and all the optimizations and intrinsics that the JVM may be including that are string-specific, but it would be very interesting to benchmark. 🧷 🚁 🧯

if (read.seq.length < patternEnd) None
else {
val p2Index = p1Index + expectedP2Offset
val rsca = read.seq.toCharArray()
Copy link
Member

@tmgreen tmgreen Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can avoid this whole array copy if you use String#charAt() for your array access in containsP2() and then use String#getChars() in place of the subsequent array copies to populate dest. Hard to know how all that would affect performance but it would be interesting to benchmark the difference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I removed the toCharArray() in a later commit but I actually didn't know about String#charAt and I'm going to add a commit using that. More on benchmarking later.

@mtomko
Copy link
Collaborator Author

mtomko commented Feb 8, 2023

I have removed TemplatePolicy.copy in favor of String#getChars.

Regarding benchmarking, PoolQ used to have a jmh benchmark module from the original PoolQ2 to PoolQ3 migration, but it wasn't well maintained and never revealed much that was useful during the PoolQ3 implementation process. I ended up removing it in the process of releasing PoolQ 3.5 because it was out-of-date, unused, and made some aspects of the build harder to work with. I agree that having it would have been helpful for this project and probably others. I'll think about putting it back in some form.

Copy link
Member

@tmgreen tmgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mtomko mtomko merged commit c679c0d into main Feb 8, 2023
@mtomko mtomko deleted the split-barcode-policy-optimization branch February 8, 2023 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants