Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates for QC, tranches and optional EXIT-RIF gvcf dataset #92

Merged
merged 33 commits into from
Mar 29, 2022
Merged

Conversation

abhi18av
Copy link
Member

@abhi18av abhi18av commented Mar 21, 2022

DRAFT PR do not merge yet.

Updates

  • Move the FASTQC/MULTIQC checks to the QUALITY_CHECK_WF stage to catch data corruption earlier 👉 fdb959b
  • Update the readme to point to the website 👉 66cc26d
  • Update the containers to reuse the same conda.yml file 👉 03fa87e
  • Add the optional GVCF file from the reference EXIT-RIF dataset 👉 aa6ed01

NOTE: This added the requirement for git lfs install since the file is not downloaded properly without it. Normal Git repositories can't have large files without git lfs. For now, I've sourced that file via http but it can also be downloaded as part of this repo if git-lfs conda package is installed.

  • 🚧 Use the tranches file for computing the best set of annotations (MOVED TO A DIFFERENT PR) Tranches optimization #95
  • Tweak for the pipeline logic regression due to the updated CSV format 👉 95d450e
  • Remove the dead code for TB_PROFILER_LOAD_LIBRARY (initially needed for previous version of tb-profiler) 👉 82cd641

Updated tasks after the meeting on 22-03-2022

  • Confirm if the gzip file is corrupted or not within the QC_CHECK workflow; confirm with samples sent via Lennert if FASTQC catches

    • The results of direct gzip -t $fq -v ( bad quality ERR779852_1.fastq.gz 🔴 )

      (fastqc-env) PS /home/abhinav/projects/xbs-nf-dataset/bad-fastq-file-check> foreach ($fq in $listOfFastqs) {
      >> gzip -t $fq -v
      >> }
      ERR751371_1.fastq.gz:
       OK
      ERR751371_2.fastq.gz:
       OK
      ERR779852_1.fastq.gz:
      
      gzip: ERR779852_1.fastq.gz: unexpected end of file
      ERR779852_2.fastq.gz:
       OK
      
      
    • Corresponding results of fastqc $fq (fails ✅ for the bad quality ERR779852_1.fastq.gz and passes for all others)

      
      Approx 85% complete for ERR779852_1.fastq.gz
      Approx 90% complete for ERR779852_1.fastq.gz
      Approx 95% complete for ERR779852_1.fastq.gz
      Failed to process file ERR779852_1.fastq.gz
      uk.ac.babraham.FastQC.Sequence.SequenceFormatException: Ran out of data in the middle of a fastq entry.  Your file is probably truncated
          at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:179)
          at uk.ac.babraham.FastQC.Sequence.FastQFile.next(FastQFile.java:125)
          at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:77)
          at java.base/java.lang.Thread.run(Thread.java:834)
      
      
  • Test without the optional EXIT-RIF GVCF file 👉 Update the parameters and mechanism to use optional input files #94

@abhi18av abhi18av changed the title Updates for QC, tranches and optional EXIT-RIF samples [DRAFT] Updates for QC, tranches and optional EXIT-RIF samples Mar 21, 2022
@abhi18av abhi18av changed the title [DRAFT] Updates for QC, tranches and optional EXIT-RIF samples [DRAFT] Updates for QC, tranches and optional EXIT-RIF gvcf dataset Mar 21, 2022
@abhi18av abhi18av self-assigned this Mar 22, 2022
Updated results_dirs #47 , added couple of notes and changed ```-a``` for LoFreq filtering to a more standard value.
fixed overzealous substitutions that often resulted in altered sample names
Also output a realigned bam file.
Also output a realigned bam file.
@abhi18av abhi18av linked an issue Mar 28, 2022 that may be closed by this pull request
abhi18av and others added 2 commits March 28, 2022 19:45
* add a viewer to prepare_cohort_workflow

* add another view

* reuse the TEST workflow

* fix braces mismatch

* test only till call workflow

* disable other flows

* test output of gatk_combine_gvcf as well

* print the computed value of optional file

* add optional logs and debug info

* tweak the output log

* remove print and log from combine process and reenable resistance analysis

* test directly against the parameter

* change the identifier for optional file

* Completely remove the optional exit-rif file

* update default parameter value to check if issue is due to overrides

* remove minimal NF version req

* add test profile

* update the gitignore file

* explicitely set the absent files as []

* add a dummy file

* add dummy files

* increase the test surface

* WORKING - using dummy files

* use the staged file within gatk combine process

* WORKING after integration

* simplify the usage of exit rif dataset

* use the staged file in the process

* simplify the user-interface for resistance_db parameter

* enable entire workflow again

* update the generation of file names

* update the comments

* update the maxForks for tbprofile-profile-lofreq to manage parallel who
dataset on a cluster

* update the file name to source it from local folder

Co-authored-by: biosharp-ou <[email protected]>
@abhi18av
Copy link
Member Author

abhi18av commented Mar 28, 2022

@TimHHH , for some reason the changes introduced in GATK_VARIANTS_TO_TABLE 7f62d81 are causing an issue with the SNPSITES process -> Warning: No SNPs were detected so there is nothing to output.

Full error message below

Error executing process > 'MERGE_WF:PHYLOGENY_ANALYSIS__EXCOMPLEX:SNPSITES (joint)'

Caused by:
  Process `MERGE_WF:PHYLOGENY_ANALYSIS__EXCOMPLEX:SNPSITES (joint)` terminated with an error exit status (1)

Command executed:

  snp-sites -o joint.variable.ExDR.ExComplex.fa joint.ExDR.ExComplex.fa

Command exit status:
  1

Command output:
  (empty)

Command error:
  Warning: No SNPs were detected so there is nothing to output.

Work dir:
  /home/biosharpou/xbs-nf-runs/xbs-nf/work/1f/dd422d5f82f6428aa557b3a7cf27d5

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

@abhi18av abhi18av changed the title [DRAFT] Updates for QC, tranches and optional EXIT-RIF gvcf dataset Updates for QC, tranches and optional EXIT-RIF gvcf dataset Mar 29, 2022
@abhi18av abhi18av merged commit fea3bc9 into master Mar 29, 2022
@TimHHH
Copy link
Collaborator

TimHHH commented Mar 29, 2022

@abhi18av I am not seeing any issue with the updated sed code, at least when I run it manually on some standard datasets. Could you have a look if the input VCF for these processes has any content? e.g. zcat joint.filtered_SNP.ExDR.IncComplex.vcf.gz | grep -v "#" should give some lines of data. Alternatively could you try again with a larger dataset (include EXIT-RIF)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment