This is a computational pipeline for DAP-seq data processing. It was designed for mapping of transcription factor binding sites in microbial genomes.
The DAP-seq pipeline consists of three major steps: sequence reads preprocessing, mapping of the reads to reference genome, peak calling with motif search.
Separate shell scripts run each step of the pipeline. You can inspect results of each step before running the next script.
The pipeline runs on OS Linux (developed and tested on Ubuntu 16.04).
Some external programs should be installed before running the pipeline:
- SolexaQA++ (http://solexaqa.sourceforge.net/)
- FOCUS (https://sourceforge.net/projects/metagenomefocus/)
- bowtie, v.1.1.2 (http://bowtie-bio.sourceforge.net/index.shtml)
- samtools (http://www.htslib.org/)
- MACS2 (https://github.com/taoliu/MACS)
- MEME Suite (http://meme-suite.org/)
bowtie, samtools and MACS2 can be installed from Ubuntu package manager. Other programs should be linked to /usr/bin or other directory, where binaries are stored normally. Pipeline depends on "SolexaQA++", "focus" and "meme" commands.
Note about FOCUS: if you have problems running focus from outside of its directory, put the following lines into /usr/bin/focus file:
#!/bin/sh
cd **PATH TO FOCUS DIRECTORY HERE **
python **PATH TO FOCUS DIRECTORY HERE**/focus.py $1 $2 $3 $4
And make it executable.
Out of the box, three reference genomes are available: Pseudomonas stutzeri RCH2, Pseudomonas fluorescens FW300-N2E2 and Pseudomonas putida KT2440. To build Bowtie indices for these reference genomes, execute script build_bowtie_indices.sh before running the pipeline.
If you want to run the pipeline for other organisms, read how to add new reference genome first.
The pipeline works with single-end sequence reads in gzipped FASTQ file. All FASTQ files must be in the same directory. All FASTQ files must have names ending with _R1_001.fastq.gz. Example: RCH2_DAP_12_R1_001.fastq.gz.
One additional file (called "directory file") with descrption of samples required for configuration the pipeline. The directory file contains six tab-separated fields for each sample in the following order:
- Sample label. Label must be unique in the file and correspond to the beginning of FASTQ file name before the second "_" symbol. Example: for RCH2_14_S110_L006_R1_001.fastq.gz file, sample label is RCH2_14
- Reference genome name (as defined in perl/libs.tsv)
- Protein ID
- Replicate number (1, 2, 3...)
- Treatment label (like "Phosphorylated", "N/A" etc.)
- Label of control sample (one of labels from column 1. Designates control sample for peak calling. Usually, all samples have the same control label)
All samples in a directory file must be from the same genome. If you have samples from different organisms in one batch, make separate directory files for each organism.
Directory file may have comment lines starting with "#".
Example:
| #Sample | Genome | Protein | Replicate | Treatment | Control |
|---|---|---|---|---|---|
| N2E2_01 | P. fluorescens N2E2 | Pf6N2E2_1296 | 1 | N/A | N2E2_99 |
To create the shell scripts running three steps of the pipeline, run configuration Perl scripts configure_dapseq_qc_mapping.pl and configure_dapseq_peak_calling.pl. If you want to run motif search, run additional configuration script configure_meme.pl
After running configuration scripts, go to working directory and run shell scripts:
run_qc.sh
run_mapping.sh
run_macs_meme.sh
Please, do not rename or move any files generated by the pipeline before you run all the steps.
If you want to run the entire pipeline with one command, create the following shell script:
#!/bin/sh
./run_qc.sh
./run_mapping.sh
./run_macs_meme.sh
Make it executable, put it into the working directory and run.
Run:
cd perl
perl configure_dapseq_qc_mapping.pl <full path to directory file> <path to directory with FASTQ files> <working directory> <output directory for BAM files> <output directory for FOCUS files>
This script creates two shell scripts in the working directory: run_qc.sh and run_mapping.sh
The run_qc.sh script runs SolexaQA++ for read trimming and removal of reads shorter than 30 bp. If you'd like to use other program for read trimming (Trimmomatic etc.), see below.
For the reads that pass read trimming, FOCUS generates phylogenetic profile. If you have many unmapped reads, inspect FOCUS output for possible cross-contamination with DNA from unrelated organisms.
Please note, that FOCUS may not recognize some reads from recently sequenced genome, if the genome is absent in its database. However, it will report presence of related species in your samples. For instance, FOCUS fails to recognize Pseudomonas fluorescens FW300-N2E2 reads, but still reports 80% abundance of Pseudomonas.
Output FASTQ files are gzipped. Files with discarded reads and moved to the directory with original FASTQ files. In addition, SolexaQA++ reports results of read trimming/filtering in text and pdf format into the working directory.
The run_mapping.sh script runs bowtie (v.1) to map reads against reference genome and samtools to convert SAM file into sorted BAM. Since bowtie runs with parameter "-m 1", only unique alignments are reported.
In addition to BAM files, this script writes bowtie run log into mapping_quality.txt file in the working directory. You can convert the bowtie log into tab-separated table with additional Perl script perl/make_quality_table.pl (full path to mapping_quality.txt and output file name required).
Output BAM files are moved to separate directory (defined by parameter of configuration script). Compressed FASTQ files are moved to the directory with original FASTQ files.
If number of mapped reads looks too small, inspect the output BAM files in genome viewer (Artemis works fine). Or look at the FOCUS output (see previous section).
Run:
cd perl
perl configure_dapseq_peak_calling.pl <full path to directory file> <working directory> <BAM files directory> <output directory>
This script creates one shell script in the working directory: run_macs_meme.sh
The run_macs_meme.sh shell script runs MACS2 for each sample in the directory and then runs MEME for selected peaks.
MACS2 and MEME output files are created in the output directory(see previous section).
MACS2 runs with parameters "-q 0.0001 --extsize 250 --nomodel --keep-dup=auto". Such parameters may result in high number of peaks. MEME runs on subset of peaks, selected by p-value (log10(p-value) < -100), q-value (log10(q-value) < -100), enrichment (> 3-fold) and length (< 12000 bp). MEME parameters are: "anr" mode (Any Number of Repetitions), one motif reported, motif length from 12 bp to 29 bp. These parameters may be changed in the source code of Perl scripts configure_dapseq_peak_calling.pl and run_meme_on_filtered_peaks.pl.
When all steps of the pipeline finished successfully, run a script generating final report in XLSX format:
cd perl
perl perl generate_report.pl <directory file> <directory with MACS2 output> <directory with MEME output> <report file name>
- Genomes with large duplicated segments (for instance, Xanthomonas oryzae) may have missing sites in the duplicated regions (because of -m 1 parameter of bowtie).
- Many microbial proteins produce high number of peaks with low enrichment.
However,
universal criteria for q-value or enrichment threshold selection havn't been
formulated. - Plasmids often produce peaks covering long regions (or even entire plasmids). It may caused by high copy number of pasmids.
- High number of false peaks complicates motif search. Even if all true peaks are found by MACS2, MEME fails to predict true motif if >90% of sequences have no occurrences of this motif. So, running MEME for more than 300 input sequences is not recommended, unless you work with global regulator that is known to have dozens of sites throughout genome.
Motif search: if you want to repeat motif search for one sample with different settings (without re-running motif calling), use Perl script perl/run_meme_on_filtered_peaks.pl
Run:
cd perl
perl run_meme_on_filtered_peaks.pl <input file name with list of MACS2 peaks without extension> <directory where output files will be written> <genome name> <enrichment cut-off> <-log10(pvalue) cutoff> <-log10(qvalue) cutoff>
For example, setting the last three parameters to 2, 50 and 50 will run motif search on peaks having enrichment > 2-fold, p-value < 10^-50 and q-value < 10^-50.
You can ignore run_qc.sh script and use any program at your own risk. Just copy trimmed FASTQ files to the working directory and rename them. If initial files have names *.fastq.gz, trimmed files must have names *.fastq.trimmed.single.gz.
Then run mapping and peak calling/motif search steps as described above.
Contact authors if you need sequence data for testing purposes.
Garber ME, Rajeev L, Kazakov AE, Trinh J, Masuno D, Thompson MG, Kaplan N, Luk J, Novichkov PS, Mukhopadhyay A. Multiple signaling systems target a core set of transition metal homeostasis genes using similar binding motifs. Mol Microbiol. 2018, 107(6):704-717. Full text at eScholarship