This is a push to gather together some tools that are helpful for genome annotation, and serve as a forkable, version-controlled, reusable, and citable record of our pipeline. The steps use nextflow as a workflow engine so we can abstract the individual steps from their execution environment (SGE, MPI or simple local multithreading).
This is not a push-button solution, but it can serve as a starting point for annotating your new genome.
The minimum prerequisites are docker and
nextflow, and a fasta file (henceforth
scaffolds.fasta
) of your genome assembly.
Some steps require software or data with licences that restrict distribution, but I've kept them to a minimum and will make it clear when those pieces are necessary.
Each of these steps corresponds to one of the nextflow recipes provided by this repository.
Taking cues from jamg, we transcribe all of the open reading frames and then use hhblit to match against a database of known transposons. A GFF file is produced that describes to position of the transposons that we find.
This uses two docker images, which will be pulled automatically from the docker registry as needed.
Repeats are an important part of the final genome annotation. I recommend a two-step process:
- Find denovo repeats with RepeatScout.
- Use the RepeatScout output in conjuctions with the latest RepBase library as input to RepeatMasker
I've taken care of the RepeatScout and RepeatMasker installation by bundling them as docker images. The only hiccup is that RepBase requires registration.