sharkmer is a kmer counter and seeded de Bruijn graph assembler. sharkmer features include:
- in silico PCR (sPCR). You supply a fastq file of whole genome shotgun reads and sequences for primer pairs, and get a fasta file of the amplicons that would be produced by PCR on the genome. This is useful for assembling and isolating particular genes from raw genome skimming data.
- Incremental kmer counting. This allows you to run kmers on incrementally larger subsets of your data. Applications include assessing the robustness of genome size estimates to sequencing depth.
We are working on a paper that describes sharkmer. In the meantime, please cite the two papers below. We wrote the tool for these projects, though we don't describe it in detail in these prior publications.
For in silico PCR:
Church et al. (2025) Population genomics of a sailing siphonophore reveals genetic structure in the open ocean. Current Biology, 35(15), 3556-3569.e6. doi:10.1016/j.cub.2025.05.066
For incremental kmer counting:
N Ahuja, X Cao, DT Schultz, N Picciani, A Lord, S Shao, K Jia, DR Burdick, SH D Haddock, Y Li, CW Dunn (2024) Giants among Cnidaria: Large Nuclear Genomes and Rearranged Mitochondrial Genomes in Siphonophores. Genome Biology and Evolution, 16(3). doi:10.1093/gbe/evae048.
This is the simplest way to install sharkmer.
If you don't have it already, install miniconda.
To install sharkmer into an existing active conda environment:
conda install -c bioconda -c conda-forge sharkmer
To create a new environment (here called sharkmer_env, but you can name it anything you like) and install sharkmer in it:
conda create -n sharkmer_env -c bioconda -c conda-forge sharkmer
If you don't have them already, install the rust build tools.
Then clone this repository and build sharkmer:
git clone https://github.com/caseywdunn/sharkmer.git # Or download and expand the zip file
cd sharkmer
cargo build --release
You can then install the built binary into your Cargo bin directory (usually ~/.cargo/bin/), and later uninstall it:
cargo install --path .
cargo uninstall sharkmer
Alternatively, you can copy the compiled executable from target/release/sharkmer to a directory already in your PATH, or add target/release/ to your PATH.
To get full usage information, run
sharkmer --help
sharkmer reads FASTQ input from files (.fastq or .fastq.gz) or from stdin. Gzip-compressed files are detected and decompressed automatically.
sharkmer --max-reads 1000000 -s Agalma-elegans -o output/ --pcr-panel cnidaria agalma_*.fastq.gz
Raw reads, including NCBI SRA reads, can be streamed directly from ENA by accession using --ena, without requiring any additional tools:
sharkmer --max-reads 1000000 -s Agalma-elegans -o output/ --pcr-panel cnidaria --ena SRR25099394
Uncompressed FASTQ data can also be piped via stdin. For example, to trim adapters and low quality regions with fastp before counting kmers:
fastp -i agalma_1.fastq.gz -I agalma_2.fastq.gz --stdout | sharkmer --max-reads 1000000 -s Agalma-elegans -o output/ --pcr-panel cnidaria
Investigators often want to extract small genome regions from genome skimming data, for example to blast a commonly sequenced gene to verify that the sample is the expected species. This task is surprisingly challenging in practice, though. You can map reads to known sequences and then collapse them into a sequence prediction, but this does not always work well across species and can miss variable regions. You can assemble all the reads and then pull out the region of interest, but this is computationally expensive and often the region of interest is not assembled well.
in silico PCR (sPCR) is a new alternative approach. You specify a file with raw reads, and provide one or more primer pairs. sharkmer outputs a fasta file with the sequence of the region that would be amplified by PCR on the genome the reads are derived from. There are multiple advantages to this approach:
- sPCR directly leverages the decades of work to optimize PCR primers that work well across species and span informative gene regions. These primers tend to bind conserved regions that flank variable informative regions and have minimal off-target binding.
- Because sPCR is primer based, you can use it to obtain the exact same gene regions (co1, 16s, 18s, 28s, etc...) that have been PCR amplified for decades and still remain the most broadly sampled across species in public databases.
- sPCR often doesn't take much data. For small genomes or high copy regions (such as rRNA or mitochondrial genes), you can analyze small (eg one million read) subsets of your data.
- sPCR is fast and has minimum computational requirements. It can be run on a laptop in a couple minutes on a million reads.
- sPCR requires a single tool (sharkmer), not complex workflows with multiple tools.
sPCR is useful when you want specific genes from skimming datasets you have collected for other purposes. But in some cases it may be more cost effective and easier to skim and apply sPCR than to use traditional PCR. With sPCR you sequence once and then pull out as many gene regions as you want, as opposed to PCR where you amplify and sequence each region separately. There is little additional computational cost for each added primer pair, since most of the work is counting kmers and this is done once for all primer pairs. So the cost of sPCR is fixed and does not depend on the number of genes considered.
sPCR of nuclear ribosomal RNA genes (eg animal 28s, 18s, ITS) and mitochondrial genes does not take much sequence data, given the relatively high copy number of these genes. For Illumina raw reads, 0.25x average sequencing depth of the genome is often sufficient.
We will use the coral Stenogorgia casta as an example. With --ena, sharkmer downloads the SRA reads directly from ENA — no extra tools needed:
sharkmer --max-reads 1000000 -s Stenogorgia_casta -o output/ --pcr-panel cnidaria --ena SRR26955578
Alternatively, if you have local FASTQ files (gzipped or uncompressed):
sharkmer --max-reads 1000000 -s Stenogorgia_casta -o output/ --pcr-panel cnidaria data/SRR26955578_1.fastq.gz data/SRR26955578_2.fastq.gz
This is equivalent to specifying the primer pairs manually with --pcr-primers:
sharkmer \
--max-reads 1000000 \
-s Stenogorgia_casta -o output/ \
--pcr-primers "forward=GRCTGTTTACCAAAAACATA,reverse=AATTCAACATMGAGG,max-length=700,name=16S,min-length=500" \
--pcr-primers "forward=TCATAARGATATHGG,reverse=RTGNCCAAAAAACCA,max-length=800,name=CO1,min-length=600" \
--pcr-primers "forward=AACCTGGTTGATCCTGCCAGT,reverse=TGATCCTTCTGCAGGTTCACCTAC,max-length=2000,name=18S,min-length=1600" \
--pcr-primers "forward=CCYYAGTAACGGCGAGT,reverse=SWACAGATGGTAGCTTCG,max-length=3500,name=28S,min-length=2900" \
--pcr-primers "forward=TACACACCGCCCGTCGCTACTA,reverse=ACTCGCCGTTACTRRGG,max-length=1000,name=ITS,min-length=600" \
--pcr-primers "forward=GTAGGTGAACCTGCAGAAGGATCA,reverse=ACTCGCCGTTACTRRGG,max-length=1000,name=ITS-v2,min-length=600" \
--pcr-primers "forward=AMGWGGHATGGTDGCTGGTG,reverse=YTTRATNAYDCCAACAGCWAC,max-length=3000,name=EF1A,min-length=200" \
--pcr-primers "forward=CGTGAAACCGYTRRAAGGG,reverse=TTGGTCCGTGTTTCAAGACG,max-length=700,name=28S-v2,min-length=300" \
--ena SRR26955578
The --pcr-primers argument takes a string with the format key1=value1,key2=value2,..., where the required keys are forward, reverse, and name. Run sharkmer --help-pcr for details on all available keys.
The --max-reads 1000000 argument indicates that the first million reads (individual reads, not read pairs) should be used. This is plenty for nuclear rRNA sequences 18s, 28s, and ITS, since they occur in many copies in the genome, and mitochondrial sequences 16s and co1. Single copy nuclear genes require more data.
This analysis will generate one fasta file for each primer pair, named {sample}_{panel}_{gene}.fasta (e.g., Stenogorgia_casta_cnidaria_18S.fasta). If no product was found, the fasta file is not generated. The fasta file can contain more than one sequence when multiple products are found. A YAML stats file ({sample}.stats.yaml) is also produced with run statistics and per-gene PCR results.
You can see all the available built-in PCR panels with:
sharkmer --list-panels
To export a built-in panel as YAML (for customization or sideloading with --pcr-panel-file):
sharkmer --export-panel cnidaria > octocorallia.yaml
You can then edit the exported YAML to add genes, remove genes, or optimize primer sequences for your study. To run sharkmer with your custom panel, use --pcr-panel-file:
sharkmer --max-reads 1000000 -s Stenogorgia_casta -o output/ --pcr-panel-file octocorallia.yaml --ena SRR26955578
If you have other primers that you would like to have added to the built-in panels, please submit them to the issue tracker.
There are a few different strategies to take if you are not getting a sPCR product, or it is working inconsistently. If you get things working, please let me know in the issue tracker so I can improve the default primer sets and help other users. If you hit a wall, please also let me know in the issue tracker.
The things you should try first are:
-
Specify a reasonable value for the
--pcr-primersparametermax-length, which is the maximum expected length of the amplified product, based on what is known about the gene. It does not need to be exact. It does need to be longer than the amplified product, but if it is much too long it may result in amplification of spurious products or runs that take too long without finding anything. Keep in mind that if there are introns, the amplified product can be much longer than the coding sequence. -
Optimize your primer sequences. Make some multiple sequence alignments of the desired sequence region from several closely related species, and refine the primer sequences to be more specific to the target region. You can use degenerate nucleotide symbols, such as R for A or G, a variable sites in the site where you would like the sequence to bind. Remember that, just as in real PCR, the forward and reverse primers bind opposite strands with their 3' ends pointing toward each other. The reverse primer sequence should be reverse complemented relative to the reference strand.
-
Pick new primers that shorten the region you are trying to amplify, i.e. primers that are closer together in the genome sequence. Shorter amplification fragments tend to require fewer reads to assemble.
-
Adjust the
--pcr-primersparametertrim. The default is 15. This is the max number of bases to keep at the 3' end of each primer. Primers used for real PCR tend to be longer than what is required for them to be unique within the genome. This is because they are lengthened to increase melting temperature. If you don't get a product, try reducingtrimto reduce the specificity of the primer. If you are getting too many spurious products, try increasing this value. Modifyingtrimrather than adjusting the primer sequence makes subsequent adjustments easier (since you don't have to look up the primer sequence again) and also makes the provenance of primer sequences clearer. -
Adjust the number of reads. If you are not getting a product, try increasing the number of reads. You can do this with the
--max-readsargument. Likewise, if you are getting products and want to speed things up, or you are getting many products for a gene, reduce the number of reads.
If these do not work, then you can try adjusting other parameters.
-
Specify a reasonable
--pcr-primersparametermin-length. This value defaults to 0, but raising it can get rid of small spurious products. -
Adjust
--min-kmer-count. This globally filters the kmer table before sPCR, removing all kmers with counts below this value. It defaults to 2, and should generally be at least 2 to avoid one-off sequencing errors. Raising it to 3 or 4 reduces noise but requires more sequencing data. -
Adjust the per-primer
min-countparameter in--pcr-primers. This sets the minimum kmer count required to extend the de Bruijn graph during sPCR for a specific primer pair. It must be at least as high as--min-kmer-count(since lower-count kmers have already been filtered). Raising it for a particular gene can help when that gene tends to produce spurious products. For example,--pcr-primers "forward=GRCTGTTTACCAAAAACATA,reverse=AATTCAACATMGAGG,max-length=700,name=16S,min-length=500,min-count=4" -
Adjust the
--pcr-primersparametermismatches. This defaults to 2. You can try raising it to 3 or 4 if you aren't getting the desired product. This reduces specificity, but this may increase the number of spurious paths that need to be traversed and bog down the run.
Keep in mind that there is no way to assemble a sPCR product without kmer counts along its full length that meet or exceed the min-count parameter. The tool cannot output assembled sequences in the fasta file that are not in the input raw reads from the fastq file. If you are trying to amplify a single copy nuclear gene, that means your sequencing depth (average coverage) of the genome will need to be quite a bit higher than the min-count parameter, since there will be fluctuations in coverage along the length of the target region. If coverage at each site is independently distributed, then to have a 95% chance of coverage
Kmer analyses have become an essential component of many routine genomic analyses (Manekar and Sathe, 2018). There are multiple excellent highly optimized stand-alone kmer counters, including Jellyfish (Marçais and Kingsford, 2011) and KMC (Kokot et al., 2017). These tools ingest sequence data, such as raw reads, and generate a variety of intermediate products, including count tables of all observed kmers and histograms of observed kmer counts. A variety of downstream tools are available for specific biological analyses. These include genome size estimation with GenomeScope2.0 (Ranallo-Benavidez et al., 2020).
Kmer spectra analyses typically focus on a single snapshot of data - the complete set of kmers at the time the analysis is performed. Many questions that motivate kmer spectrum analyses are about what happens as data are added. Rather than performing a single analysis on all the data, one can rarefy the data and look at progressively larger nested subsets of reads. This provides more insight into the data in hand, builds better intuition for what changes as data are added, and allows the investigator to better understand what would happen if more data were added. But reanalyzing nested subsets is computationally expensive, since all the data shared across nested subsets are reanalyzed.
Incremental kmer counting, as implemented in sharkmer, allows investigators to efficiently investigate the effects of adding data without needing to re-analyze nested subsets. Instead, the data are broken into exclusive subsets. kmer spectra are calculated once for each, and then incrementally combined.
Genome size estimation takes a lot of data (about 50x coverage of the genome), and a large amount of RAM. So this example isn't practical on most laptops, given their disk and RAM limitations, and will require a workstation or cluster.
We will use a Cordagalma ordinatum dataset from NCBI SRA. With --ena, sharkmer downloads the reads directly:
sharkmer --chunks 10 -o output/ -s Cordagalma-ordinatum --ena SRR23143278
The --chunks 10 argument specifies breaking the reads into 10 incremental subsets (histograms are only produced when --chunks is greater than 0). The -o argument specifies the output directory. The -s argument specifies a prefix for the output files.
Alternatively, with local files (gzipped or uncompressed):
sharkmer --chunks 10 -o output/ -s Cordagalma-ordinatum data/SRR23143278_1.fastq.gz data/SRR23143278_2.fastq.gz
The incremental histogram files in this case will be:
output/Cordagalma-ordinatum.histo # All the incremental histograms, each in their own column. Suitable for analysis with `sharkmer_viewer.py`.
output/Cordagalma-ordinatum.final.histo # Just the final histogram with all data. Suitable for analysis with genomescope and other tools.
Optional Python tools for visualizing incremental kmer counting results are available in the sharkmer_viewer folder. See the sharkmer_viewer readme for installation and usage instructions.
Here is an overview of how kmer counting works in sharkmer:
- Fastq reads are ingested, split at any
Nbases, and distributed acrossnchunks. - Within each chunk, kmers are counted in a hashmap.
- The hashmaps for the
nchunks are summed one by one, and a histogram is generated after each chunk of counts is added in. This producesnhistograms, each summarizing more reads than the last.
Incremental kmer counting is used when ingesting reads for other features, including in silico PCR.
sharkmer can generate tab-completion scripts for your shell so that flag
names, panel names, and other arguments are completed when you press Tab.
If you installed via bioconda (conda install sharkmer), completions are
installed automatically — no extra steps needed.
For other installation methods (cargo install, manual build), you need to generate and install the completion script yourself. First, check which shell you are using:
echo $SHELLThen follow the instructions for your shell:
zsh (default on macOS, common on Linux):
# Create the completions directory if it doesn't exist
mkdir -p ~/.zfunc
# Generate the completion script
sharkmer --completions zsh > ~/.zfunc/_sharkmer
# Add to your ~/.zshrc (only needed once):
# fpath=(~/.zfunc $fpath)
# autoload -Uz compinit && compinitbash:
# System-wide (may require sudo):
sharkmer --completions bash > /etc/bash_completion.d/sharkmer
# Or per-user:
mkdir -p ~/.local/share/bash-completion/completions
sharkmer --completions bash > ~/.local/share/bash-completion/completions/sharkmerfish:
mkdir -p ~/.config/fish/completions
sharkmer --completions fish > ~/.config/fish/completions/sharkmer.fishAfter installing, open a new terminal session (or run source ~/.zshrc etc.)
for completions to take effect. Completion scripts are a static snapshot of the
available flags, so if you upgrade sharkmer you should regenerate them by
re-running the same command above. Bioconda users don't need to worry about
this — completions are updated automatically with each package upgrade.
See CONTRIBUTING.md for development and testing information.