-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split barcode policy optimization #13
Conversation
Includes bugfixes and code cleanups
No code changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I'd suggest starting to get into the habit of jmh
-benchmarking some of these algorithm optimizations. I had assumed PQ was more IO-bound in its performance than it appears to be. We may want to accelerate the notion of abandoning String
in favor of a more purpose-built byte-array DNA class where you'd have more "unsafe" access to its innards without all the array copying that has to be done with String
, and avoid all the unnecessary String sausage being made under the hood to deal with unicode etc. Hard to tell how much this would pay off given the new "compact" Latin1 strings and all the optimizations and intrinsics that the JVM may be including that are string-specific, but it would be very interesting to benchmark. 🧷 🚁 🧯
if (read.seq.length < patternEnd) None | ||
else { | ||
val p2Index = p1Index + expectedP2Offset | ||
val rsca = read.seq.toCharArray() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can avoid this whole array copy if you use String#charAt()
for your array access in containsP2()
and then use String#getChars()
in place of the subsequent array copies to populate dest
. Hard to know how all that would affect performance but it would be interesting to benchmark the difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I removed the toCharArray()
in a later commit but I actually didn't know about String#charAt
and I'm going to add a commit using that. More on benchmarking later.
I have removed Regarding benchmarking, PoolQ used to have a jmh benchmark module from the original PoolQ2 to PoolQ3 migration, but it wasn't well maintained and never revealed much that was useful during the PoolQ3 implementation process. I ended up removing it in the process of releasing PoolQ 3.5 because it was out-of-date, unused, and made some aspects of the build harder to work with. I agree that having it would have been helpful for this project and probably others. I'll think about putting it back in some form. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
In the event that we are matching barcodes using a template that contains precisely 2 barcode regions, each preceded by a prefix, it may be faster to skip matching the whole template and instead match the first prefix, skip ahead to the next prefix location, match that with the second prefix, and extract the barcodes if necessary.
This is particularly useful for templates such as
which has a very long fixed region whose contents we don't care about.
In practice, this has perhaps been a more modest improvement than I had hoped, but it is worth keeping the work, I think.
Also technically this should probably be released as PoolQ 3.7.0 but since we don't have a lot of downstream consumers and the renamed class is not really intended for public consumption, perhaps we can get away with calling it 3.6.1?