-
Notifications
You must be signed in to change notification settings - Fork 349
Description
Hello,
This is an issue related to #534 and #560 , but I thought it was worth raising a separate issue as it is not quite the same.
I have several fastqs with empty sequences, as described in #560 , e.g.
@K00371:221:H2NKWBBXY:6:1104:25824:8260 2:N:0:AGTACAAG
AGGCCAACAGGTAGGTCTCTGAAAAATGAAGAACAGATATTCATAAGCTATAATGAAATAATTCAAACTTATTTCATTACCTCCCTTGAATACAGACTA
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJ
@K00371:221:H2NKWBBXY:6:1104:25986:8260 2:N:0:AGTACAAG
+
@K00371:221:H2NKWBBXY:6:1104:26250:8260 2:N:0:AGTACAAG
ATTTAGTATAATAAACATTACCAAATCTTTCTTTCCTAAGGCACCATTCTGATTTATAGGTCAGGCTGCCTGACTCTAAGGAAATAACTGGTAAGGATAC
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJ
@K00371:221:H2NKWBBXY:6:1104:26880:8260 2:N:0:AGTACAAG
Running different versions of fastp results in different outcomes for this file. Indeed, with version 0.24.1 I get the error
@K00371:221:H2NKWBBXY:6:1104:25986:8260 2:N:0:AGTACAAG
Expected '+', got
ERROR: '+' expected
With the latest version 1.0.1 I get a warning instead, but the run finishes:
@K00371:221:H2NKWBBXY:6:1104:25986:8260 2:N:0:AGTACAAG
Expected '+', got
Your FASTQ may be invalid, please check the tail of your FASTQ file
WARNNIG: different read numbers of the 3435 pack
Read1 pack size: 256
Read2 pack size: 230
Ignore the unmatched reads
Read1 before filtering:
total reads: 879590
total bases: 87538237
Q20 bases: 86810309(99.1684%)
Q30 bases: 85308802(97.4532%)
Q40 bases: 75170632(85.8718%)
Read2 before filtering:
total reads: 879590
total bases: 87526817
Q20 bases: 86546608(98.8801%)
Q30 bases: 84989839(97.1015%)
Q40 bases: 74988693(85.6751%)
Read1 after filtering:
total reads: 876494
total bases: 87081659
Q20 bases: 86404863(99.2228%)
Q30 bases: 84958415(97.5618%)
Q40 bases: 74928827(86.0443%)
Read2 after filtering:
total reads: 876494
total bases: 87054021
Q20 bases: 86214331(99.0354%)
Q30 bases: 84730416(97.3308%)
Q40 bases: 74811097(85.9364%)
Filtering result:
reads passed filter: 1752988
reads failed due to low quality: 4938
reads failed due to too many N: 820
reads failed due to too short: 434
reads with adapter trimmed: 1854
bases trimmed due to adapters: 73802
Duplication rate: 7.83626%
Insert size peak (evaluated by paired-end reads): 163
But with the older version 0.20.0 it runs through with no issues.
Read1 before filtering:
total reads: 32508342
total bases: 3235377165
Q20 bases: 3201955685(98.967%)
Q30 bases: 3136726311(96.9509%)
Read2 before filtering:
total reads: 32508342
total bases: 3235090844
Q20 bases: 3175582248(98.1605%)
Q30 bases: 3085760868(95.3841%)
Read1 after filtering:
total reads: 32326557
total bases: 3213262362
Q20 bases: 3182580204(99.0451%)
Q30 bases: 3120584889(97.1158%)
Read2 aftering filtering:
total reads: 32326557
total bases: 3212818050
Q20 bases: 3162074455(98.4206%)
Q30 bases: 3076691989(95.763%)
Filtering result:
reads passed filter: 64653114
reads failed due to low quality: 343946
reads failed due to too many N: 5236
reads failed due to too short: 14388
reads with adapter trimmed: 826473
bases trimmed due to adapters: 9257428
Duplication rate: 11.3393%
Insert size peak (evaluated by paired-end reads): 161
Now, here's the worrying part, with version 0.20.0, the oldest, I get ~32M reads after filtering, with the latest version 1.0.1, I only get ~800K, which suggests fastp is simply stopping at the empty read, rather than ignoring it, and throwing away most of the reads? Whereas previous versions were seemingly able to deal with this appropriately.
Here's the fastp command I ran in all cases, run on the same fastq, just changing the version
fastp --in1 fq1.fq.gz --in2 fq2.fq.gz \
--out1 r1.fq.gz --out2 r2.fq.gz \
--length_required 36 \
--adapter_fasta "illumina.fa" \
--cut_mean_quality 10 \
--cut_window_size 4 \
-5 \
-3 \
--thread 1 \
--average_qual 20 \
--report_title "sample_name-flowcell-lane" \
--json sample_name-flowcell-lane.fastp.json