Split barcode policy optimization #13

mtomko · 2023-02-08T00:54:38Z

In the event that we are matching barcodes using a template that contains precisely 2 barcode regions, each preceded by a prefix, it may be faster to skip matching the whole template and instead match the first prefix, skip ahead to the next prefix location, match that with the second prefix, and extract the barcodes if necessary.

This is particularly useful for templates such as

cggtgNNNNNNNNNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnngttccNNNNNNNNNNNNN

which has a very long fixed region whose contents we don't care about.

In practice, this has perhaps been a more modest improvement than I had hoped, but it is worth keeping the work, I think.

Also technically this should probably be released as PoolQ 3.7.0 but since we don't have a lot of downstream consumers and the renamed class is not really intended for public consumption, perhaps we can get away with calling it 3.6.1?

Includes bugfixes and code cleanups

No code changes

tmgreen

Looks good. I'd suggest starting to get into the habit of jmh-benchmarking some of these algorithm optimizations. I had assumed PQ was more IO-bound in its performance than it appears to be. We may want to accelerate the notion of abandoning String in favor of a more purpose-built byte-array DNA class where you'd have more "unsafe" access to its innards without all the array copying that has to be done with String, and avoid all the unnecessary String sausage being made under the hood to deal with unicode etc. Hard to tell how much this would pay off given the new "compact" Latin1 strings and all the optimizations and intrinsics that the JVM may be including that are string-specific, but it would be very interesting to benchmark. 🧷 🚁 🧯

tmgreen · 2023-02-08T05:09:12Z

src/main/scala/org/broadinstitute/gpp/poolq3/barcode/BarcodePolicy.scala

+      if (read.seq.length < patternEnd) None
+      else {
+        val p2Index = p1Index + expectedP2Offset
+        val rsca = read.seq.toCharArray()


I think you can avoid this whole array copy if you use String#charAt() for your array access in containsP2() and then use String#getChars() in place of the subsequent array copies to populate dest. Hard to know how all that would affect performance but it would be interesting to benchmark the difference.

Yes, I removed the toCharArray() in a later commit but I actually didn't know about String#charAt and I'm going to add a commit using that. More on benchmarking later.

mtomko · 2023-02-08T15:16:02Z

I have removed TemplatePolicy.copy in favor of String#getChars.

Regarding benchmarking, PoolQ used to have a jmh benchmark module from the original PoolQ2 to PoolQ3 migration, but it wasn't well maintained and never revealed much that was useful during the PoolQ3 implementation process. I ended up removing it in the process of releasing PoolQ 3.5 because it was out-of-date, unused, and made some aspects of the build harder to work with. I agree that having it would have been helpful for this project and probably others. I'll think about putting it back in some form.

tmgreen

👍

mtomko added 9 commits February 3, 2023 15:07

Use SBT 1.8.2

7e763d4

Update dependencies

7ff3610

Add SplitBarcodePolicy

73638d0

Includes bugfixes and code cleanups

Add test for SplitBarcodePolicy

e10df44

TemplatePolicy is now a trait

3619058

Loosen regex slightly

1adf144

Add tests

449cd53

Move template policy object

13871ac

No code changes

Bugfixes, tests, and optimizations to SplitBarcodePolicy

0b6a138

mtomko requested a review from tmgreen February 8, 2023 00:54

mtomko self-assigned this Feb 8, 2023

Fix formatting

3068474

tmgreen previously approved these changes Feb 8, 2023

View reviewed changes

tmgreen reviewed Feb 8, 2023

View reviewed changes

Use String#getChars

6d71538

mtomko dismissed tmgreen’s stale review via 6d71538 February 8, 2023 15:11

tmgreen approved these changes Feb 8, 2023

View reviewed changes

mtomko merged commit c679c0d into main Feb 8, 2023

mtomko deleted the split-barcode-policy-optimization branch February 8, 2023 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split barcode policy optimization #13

Split barcode policy optimization #13

mtomko commented Feb 8, 2023 •

edited

Loading

tmgreen left a comment

tmgreen Feb 8, 2023 •

edited

Loading

mtomko Feb 8, 2023

mtomko commented Feb 8, 2023

tmgreen left a comment

Split barcode policy optimization #13

Split barcode policy optimization #13

Conversation

mtomko commented Feb 8, 2023 • edited Loading

tmgreen left a comment

Choose a reason for hiding this comment

tmgreen Feb 8, 2023 • edited Loading

Choose a reason for hiding this comment

mtomko Feb 8, 2023

Choose a reason for hiding this comment

mtomko commented Feb 8, 2023

tmgreen left a comment

Choose a reason for hiding this comment

mtomko commented Feb 8, 2023 •

edited

Loading

tmgreen Feb 8, 2023 •

edited

Loading