Skip to content

Commit

Permalink
Fixed bug in regular expression and Snakemake rule order in parallel …
Browse files Browse the repository at this point in the history
…execution
  • Loading branch information
mazzalab committed May 26, 2024
1 parent 97bd396 commit c7be89b
Show file tree
Hide file tree
Showing 4 changed files with 19 additions and 15 deletions.
17 changes: 9 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ There are <b>QUICK</b> and a <b>SLOW</b> methods to configure `FastqWiper`'s wor

2. Once downloaded the image, type:

CMD: `docker run --rm -ti --name fastqwiper -v "YOUR_LOCAL_PATH_TO_DATA_FOLDER:/fastqwiper/data" mazzalab/fastqwiper paired 8 sample 50000000 33 ACGTN`
CMD: `docker run --rm -ti --name fastqwiper -v "YOUR_LOCAL_PATH_TO_DATA_FOLDER:/fastqwiper/data" mazzalab/fastqwiper paired 8 sample 50000000 33 ACGTN 500000`

#### Another quick way (Singularity)
1. Pull the Singularity image from the Cloud Library:
Expand All @@ -80,7 +80,7 @@ CMD: `docker run --rm -ti --name fastqwiper -v "YOUR_LOCAL_PATH_TO_DATA_FOLDER:/

2. Once downloaded the image (e.g., fastqwiper.sif_2024.1.89.sif), type:

CMD `singularity run --bind YOUR_LOCAL_PATH_TO_DATA_FOLDER:/fastqwiper/data --writable-tmpfs fastqwiper.sif_2024.1.89.sif paired 8 sample 50000000 33 ACGTN`
CMD `singularity run --bind YOUR_LOCAL_PATH_TO_DATA_FOLDER:/fastqwiper/data --writable-tmpfs fastqwiper.sif_2024.1.89.sif paired 8 sample 50000000 33 ACGTN 500000`

If you want to bind the `.singularity` cache folder and the `logs` folder, you can omit `--writable-tmpfs`, create the folders `.singularity` and `logs` (`mkdir .singularity logs`) on the host system, and use this command instead:

Expand All @@ -95,6 +95,7 @@ For both **Docker** and **Singularity**:
- `50000000` (optional) is the number of rows-per-chunk (used when cores>1. It must be a number multiple of 4). Increasing this number too much would reduce the parallelism advantage. Decreasing this number too much would increase the number of chunks more than the number of available cpus, making parallelism unefficient. Choose this number wisely depending on the total number of reads in your starting file.
- `33` (optional) is the ASCII offset (33=Sanger, 64=old Solexa)
- `ACGTN` (optional) is the allowed alphabet in the SEQ line of the FASTQ file
- `500000` (optional) is the log frequency (# reads)

### <code style="color : red">The slow way (Linux & Mac OS)</code>
To enable the use of preconfigured [pipelines](https://github.com/mazzalab/fastqwiper/tree/main/pipeline), you need to install **Snakemake**. The recommended way to install Snakemake is via Conda, because it enables **Snakemake** to [handle software dependencies of your workflow](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#integrated-package-management).
Expand Down Expand Up @@ -149,14 +150,14 @@ Copy the fastq files you want to fix in the `data` folder.
#### Paired-end files

- **Get a dry run** of a pipeline (e.g., `fix_wipe_pairs_reads_sequential.smk`):<br />
`snakemake --config sample_name=my_sample qin=33 alphabet=ACGTN -s pipeline/fix_wipe_pairs_reads_sequential.smk --use-conda --cores 4`
`snakemake --config sample_name=my_sample qin=33 alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_pairs_reads_sequential.smk --use-conda --cores 4 -np`

- **Generate the planned DAG**:<br />
`snakemake --config sample_name=my_sample qin=33 alphabet=ACGTN -s pipeline/fix_wipe_pairs_reads_sequential.smk --dag | dot -Tpdf > dag.pdf`<br /> <br />
`snakemake --config sample_name=my_sample qin=33 alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_pairs_reads_sequential.smk --dag | dot -Tpdf > dag.pdf`<br /> <br />
<img src="https://github.com/mazzalab/fastqwiper/blob/main/pipeline/fix_wipe_pairs_reads.png?raw=true" width="400">

- **Run the pipeline** (n.b., during the first execution, Snakemake will download and install some required remote packages and may take longer). The number of computing cores can be tuned accordingly:<br />
`snakemake --config sample_name=my_sample alphabet=ACGTN -s pipeline/fix_wipe_single_reads_sequential.smk --use-conda --cores 2`
`snakemake --config sample_name=my_sample alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_pairs_reads_sequential.smk --use-conda --cores 2`

Fixed files will be copied in the `data` folder and will be suffixed with the string `_fixed_wiped_paired_interleaving`.
We remind that the `fix_wipe_pairs_reads_sequential.smk` and `fix_wipe_pairs_reads_parallel.smk` pipelines perform the following actions:
Expand All @@ -169,14 +170,14 @@ We remind that the `fix_wipe_pairs_reads_sequential.smk` and `fix_wipe_pairs_rea
`fix_wipe_single_reads_parallel.smk` and `fix_wipe_single_reads_sequential.smk` will not execute `trimmomatic` and BBmap's `repair.sh`.

- **Get a dry run** of a pipeline (e.g., `fix_wipe_single_reads_sequential.smk`):<br />
`snakemake --config sample_name=my_sample alphabet=ACGTN -s pipeline/fix_wipe_single_reads_sequential.smk --use-conda --cores 2 -np`
`snakemake --config sample_name=my_sample alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_single_reads_sequential.smk --use-conda --cores 2 -np`

- **Generate the planned DAG**:<br />
`snakemake --config sample_name=my_sample alphabet=ACGTN -s pipeline/fix_wipe_single_reads_sequential.smk --dag | dot -Tpdf > dag.pdf`<br /><br />
`snakemake --config sample_name=my_sample alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_single_reads_sequential.smk --dag | dot -Tpdf > dag.pdf`<br /><br />
<img src="https://github.com/mazzalab/fastqwiper/blob/main/pipeline/fix_wipe_single_reads.png?raw=true" width="200">

- **Run the pipeline** (n.b., The number of computing cores can be tuned accordingly):<br />
`snakemake --config sample_name=my_sample alphabet=ACGTN -s pipeline/fix_wipe_single_reads_sequential.smk --use-conda --cores 2`
`snakemake --config sample_name=my_sample alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_single_reads_sequential.smk --use-conda --cores 2`

# Author
**Tommaso Mazza**
Expand Down
14 changes: 7 additions & 7 deletions fastq_wiper/wiper.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import logging

logging.basicConfig(level=logging.DEBUG)
reg = None

# region Variables for final report
blank: int = 0
Expand Down Expand Up @@ -94,7 +95,7 @@ def check_header_line(line: str) -> str:
return header


def check_seq_line(line: str) -> str:
def check_seq_line(line: str, reg: str) -> str:
global bad_seq
raw_seq: str = ""

Expand Down Expand Up @@ -233,7 +234,7 @@ def print_log_to_screen():
logging.warning(f"Blank lines: {blank}/{tot_lines}")


def wipe_fastq(fastq_in: str, fastq_out: str, log_out: str, log_frequency: int):
def wipe_fastq(fastq_in: str, fastq_out: str, log_out: str, log_frequency: int, alphabet: str):
# MIN_HEADER_LEN = 10

fin = open_fastq_file(fastq_in)
Expand All @@ -247,6 +248,8 @@ def wipe_fastq(fastq_in: str, fastq_out: str, log_out: str, log_frequency: int):

# global variables anchor
global tot_lines, clean_reads, seq_len_neq_qual_len
# Allowed characters of the SEQ line
reg = re.compile(f"^[{alphabet}]+$")

# Loop through 4-line reads
for line in fin:
Expand All @@ -264,7 +267,7 @@ def wipe_fastq(fastq_in: str, fastq_out: str, log_out: str, log_frequency: int):

# PROCESS THE SEQ LINE
line = read_next_line(fin, log_frequency)
raw_seq: str = check_seq_line(line.rstrip())
raw_seq: str = check_seq_line(line.rstrip(), reg)
if not raw_seq:
continue

Expand Down Expand Up @@ -311,10 +314,7 @@ def main():
help='Allowed character in the SEQ line. Default: ACGTN')
args = parser.parse_args()

# Allowed character of the SEQ line
reg = re.compile(f"^[{args.alphabet}]+$")

wipe_fastq(args.fastq_in, args.fastq_out, args.log_out, args.log_frequency)
wipe_fastq(args.fastq_in, args.fastq_out, args.log_out, args.log_frequency, args.alphabet)


if __name__ == "__main__":
Expand Down
2 changes: 2 additions & 0 deletions pipeline/fix_wipe_pairs_reads_parallel.smk
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ checkpoint split_fastq:
split -l {params.chunk_size} --numeric-suffixes {input} data/{wildcards.sample}_chunks/chunk --additional-suffix=.fastq
'''

ruleorder: wipe_fastq_parallel > aggregate

rule wipe_fastq_parallel:
input:
"data/{sample}_chunks/chunk{i}.fastq"
Expand Down
1 change: 1 addition & 0 deletions pipeline/fix_wipe_single_reads_parallel.smk
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ checkpoint split_fastq:
split -l {params.chunk_size} --numeric-suffixes {input} data/{wildcards.sample}_chunks/chunk --additional-suffix=.fastq
'''

ruleorder: wipe_fastq_parallel > aggregate

rule wipe_fastq_parallel:
input:
Expand Down

0 comments on commit c7be89b

Please sign in to comment.