Fixed bug in regular expression and Snakemake rule order in parallel …

…execution
mazzalab · May 26, 2024 · c7be89b · c7be89b
1 parent 97bd396
commit c7be89b
Show file tree

Hide file tree

Showing 4 changed files with 19 additions and 15 deletions.
diff --git a/README.md b/README.md
@@ -71,7 +71,7 @@ There are <b>QUICK</b> and a <b>SLOW</b> methods to configure `FastqWiper`'s wor
 
 2. Once downloaded the image, type:
 
-CMD: `docker run --rm -ti --name fastqwiper -v "YOUR_LOCAL_PATH_TO_DATA_FOLDER:/fastqwiper/data" mazzalab/fastqwiper paired 8 sample 50000000 33 ACGTN`
+CMD: `docker run --rm -ti --name fastqwiper -v "YOUR_LOCAL_PATH_TO_DATA_FOLDER:/fastqwiper/data" mazzalab/fastqwiper paired 8 sample 50000000 33 ACGTN 500000`
 
 #### Another quick way (Singularity)
 1. Pull the Singularity image from the Cloud Library:
@@ -80,7 +80,7 @@ CMD: `docker run --rm -ti --name fastqwiper -v "YOUR_LOCAL_PATH_TO_DATA_FOLDER:/
 
 2. Once downloaded the image (e.g., fastqwiper.sif_2024.1.89.sif), type:
 
-CMD `singularity run --bind YOUR_LOCAL_PATH_TO_DATA_FOLDER:/fastqwiper/data --writable-tmpfs fastqwiper.sif_2024.1.89.sif paired 8 sample 50000000 33 ACGTN`
+CMD `singularity run --bind YOUR_LOCAL_PATH_TO_DATA_FOLDER:/fastqwiper/data --writable-tmpfs fastqwiper.sif_2024.1.89.sif paired 8 sample 50000000 33 ACGTN 500000`
 
 If you want to bind the `.singularity` cache folder and the `logs` folder, you can omit `--writable-tmpfs`, create the folders `.singularity` and `logs` (`mkdir .singularity logs`) on the host system, and use this command instead:
 
@@ -95,6 +95,7 @@ For both **Docker** and **Singularity**:
 - `50000000` (optional) is the number of rows-per-chunk (used when cores>1. It must be a number multiple of 4). Increasing this number too much would reduce the parallelism advantage. Decreasing this number too much would increase the number of chunks more than the number of available cpus, making parallelism unefficient. Choose this number wisely depending on the total number of reads in your starting file.
 - `33` (optional) is the ASCII offset (33=Sanger, 64=old Solexa)
 - `ACGTN` (optional) is the allowed alphabet in the SEQ line of the FASTQ file
+- `500000` (optional) is the log frequency (# reads)
 
 ### <code style="color : red">The slow way (Linux & Mac OS)</code>
 To enable the use of preconfigured [pipelines](https://github.com/mazzalab/fastqwiper/tree/main/pipeline), you need to install **Snakemake**. The recommended way to install Snakemake is via Conda, because it enables **Snakemake** to [handle software dependencies of your workflow](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#integrated-package-management).
@@ -149,14 +150,14 @@ Copy the fastq files you want to fix in the `data` folder.
 #### Paired-end files
 
 - **Get a dry run** of a pipeline (e.g., `fix_wipe_pairs_reads_sequential.smk`):<br />
-`snakemake --config sample_name=my_sample qin=33 alphabet=ACGTN -s pipeline/fix_wipe_pairs_reads_sequential.smk --use-conda --cores 4`
+`snakemake --config sample_name=my_sample qin=33 alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_pairs_reads_sequential.smk --use-conda --cores 4 -np`
 
 - **Generate the planned DAG**:<br />
-`snakemake --config sample_name=my_sample qin=33 alphabet=ACGTN -s pipeline/fix_wipe_pairs_reads_sequential.smk --dag | dot -Tpdf > dag.pdf`<br /> <br />
+`snakemake --config sample_name=my_sample qin=33 alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_pairs_reads_sequential.smk --dag | dot -Tpdf > dag.pdf`<br /> <br />
 <img src="https://github.com/mazzalab/fastqwiper/blob/main/pipeline/fix_wipe_pairs_reads.png?raw=true" width="400">
 
 - **Run the pipeline** (n.b., during the first execution, Snakemake will download and install some required remote packages and may take longer). The number of computing cores can be tuned accordingly:<br />
-`snakemake --config sample_name=my_sample alphabet=ACGTN -s pipeline/fix_wipe_single_reads_sequential.smk --use-conda --cores 2`
+`snakemake --config sample_name=my_sample alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_pairs_reads_sequential.smk --use-conda --cores 2`
 
 Fixed files will be copied in the `data` folder and will be suffixed with the string `_fixed_wiped_paired_interleaving`.
 We remind that the `fix_wipe_pairs_reads_sequential.smk` and `fix_wipe_pairs_reads_parallel.smk` pipelines perform the following actions:
@@ -169,14 +170,14 @@ We remind that the `fix_wipe_pairs_reads_sequential.smk` and `fix_wipe_pairs_rea
 `fix_wipe_single_reads_parallel.smk` and `fix_wipe_single_reads_sequential.smk` will not execute `trimmomatic` and BBmap's `repair.sh`.
 
 - **Get a dry run** of a pipeline (e.g., `fix_wipe_single_reads_sequential.smk`):<br />
-`snakemake --config sample_name=my_sample alphabet=ACGTN -s pipeline/fix_wipe_single_reads_sequential.smk --use-conda --cores 2 -np`
+`snakemake --config sample_name=my_sample alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_single_reads_sequential.smk --use-conda --cores 2 -np`
 
 - **Generate the planned DAG**:<br />
-`snakemake --config sample_name=my_sample alphabet=ACGTN -s pipeline/fix_wipe_single_reads_sequential.smk --dag | dot -Tpdf > dag.pdf`<br /><br />
+`snakemake --config sample_name=my_sample alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_single_reads_sequential.smk --dag | dot -Tpdf > dag.pdf`<br /><br />
 <img src="https://github.com/mazzalab/fastqwiper/blob/main/pipeline/fix_wipe_single_reads.png?raw=true" width="200">
 
 - **Run the pipeline** (n.b., The number of computing cores can be tuned accordingly):<br />
-`snakemake --config sample_name=my_sample alphabet=ACGTN -s pipeline/fix_wipe_single_reads_sequential.smk --use-conda --cores 2`
+`snakemake --config sample_name=my_sample alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_single_reads_sequential.smk --use-conda --cores 2`
 
 # Author
 **Tommaso Mazza**  

diff --git a/fastq_wiper/wiper.py b/fastq_wiper/wiper.py
@@ -10,6 +10,7 @@
 import logging
 
 logging.basicConfig(level=logging.DEBUG)
+reg = None
 
 # region Variables for final report
 blank: int = 0
@@ -94,7 +95,7 @@ def check_header_line(line: str) -> str:
     return header
 
 
-def check_seq_line(line: str) -> str:
+def check_seq_line(line: str, reg: str) -> str:
     global bad_seq
     raw_seq: str = ""
 
@@ -233,7 +234,7 @@ def print_log_to_screen():
         logging.warning(f"Blank lines: {blank}/{tot_lines}")
 
 
-def wipe_fastq(fastq_in: str, fastq_out: str, log_out: str, log_frequency: int):
+def wipe_fastq(fastq_in: str, fastq_out: str, log_out: str, log_frequency: int, alphabet: str):
     # MIN_HEADER_LEN = 10
 
     fin = open_fastq_file(fastq_in)
@@ -247,6 +248,8 @@ def wipe_fastq(fastq_in: str, fastq_out: str, log_out: str, log_frequency: int):
 
         # global variables anchor
         global tot_lines, clean_reads, seq_len_neq_qual_len
+        # Allowed characters of the SEQ line
+        reg = re.compile(f"^[{alphabet}]+$")
 
         # Loop through 4-line reads
         for line in fin:
@@ -264,7 +267,7 @@ def wipe_fastq(fastq_in: str, fastq_out: str, log_out: str, log_frequency: int):
 
             # PROCESS THE SEQ LINE
             line = read_next_line(fin, log_frequency)
-            raw_seq: str = check_seq_line(line.rstrip())
+            raw_seq: str = check_seq_line(line.rstrip(), reg)
             if not raw_seq:
                 continue
 
@@ -311,10 +314,7 @@ def main():
                         help='Allowed character in the SEQ line. Default: ACGTN')
     args = parser.parse_args()
 
-    # Allowed character of the SEQ line
-    reg = re.compile(f"^[{args.alphabet}]+$")
-
-    wipe_fastq(args.fastq_in, args.fastq_out, args.log_out, args.log_frequency)
+    wipe_fastq(args.fastq_in, args.fastq_out, args.log_out, args.log_frequency, args.alphabet)
 
 
 if __name__ == "__main__":

diff --git a/pipeline/fix_wipe_pairs_reads_parallel.smk b/pipeline/fix_wipe_pairs_reads_parallel.smk
@@ -40,6 +40,8 @@ checkpoint split_fastq:
         split -l {params.chunk_size} --numeric-suffixes {input} data/{wildcards.sample}_chunks/chunk --additional-suffix=.fastq
         '''
 
+ruleorder: wipe_fastq_parallel > aggregate
+
 rule wipe_fastq_parallel:
     input:
         "data/{sample}_chunks/chunk{i}.fastq"

diff --git a/pipeline/fix_wipe_single_reads_parallel.smk b/pipeline/fix_wipe_single_reads_parallel.smk
@@ -40,6 +40,7 @@ checkpoint split_fastq:
         split -l {params.chunk_size} --numeric-suffixes {input} data/{wildcards.sample}_chunks/chunk --additional-suffix=.fastq
         '''
 
+ruleorder: wipe_fastq_parallel > aggregate
 
 rule wipe_fastq_parallel:
     input:
-Original file line number
+Diff line change
@@ Expand Up / @@ -40,6 +40,7 @@ checkpoint split_fastq: @@
             split -l {params.chunk_size} --numeric-suffixes {input} data/{wildcards.sample}_chunks/chunk --additional-suffix=.fastq
             '''
+    ruleorder: wipe_fastq_parallel > aggregate
     rule wipe_fastq_parallel:
         input:
@@ Expand Down @@