Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include pairtools protocol #162

Open
4 of 5 tasks
nservant opened this issue May 5, 2023 · 6 comments
Open
4 of 5 tasks

Include pairtools protocol #162

nservant opened this issue May 5, 2023 · 6 comments
Labels
enhancement New feature or request
Milestone

Comments

@nservant
Copy link
Collaborator

nservant commented May 5, 2023

Description of feature

Add a new paramater --processing hicpro or --processing pairtools

propose an alternative to HiC-Pro with the analysis protocol proposed by 'dovotail' and based on bwa-mem2 and pairtools with the following steps :

  • Mapping with bwa-mem (version 2 instead of 1)
  • pairtools parse - get valid ligation product / parameters --min_mapq
  • pairtools sort - sorting pairs
  • paritools merge - merge pairs
  • pairtools dedup - remove PCR dups / parameters --keep_dups
  • pairtools split - generate pairs files (and final bam if useful ?)
  • pairtools select - filter pairs
  • pairtools stats - generate final stats

To validate ;

  • Remove multi-hits - parameter --keep_multi ?
  • Compatibility of pairtools with --digestion, --restriction_site, --ligation_site, --chromosome_size, --restriction_fragments parameters ?
  • Filtering on fragment size with pairtools ? options --max_insert_size --min_insert_size --max_restriction_fragment_size --min_fragment_size
  • Mode --dnase with pairtools ?
  • Filtering based on the distance (--min_cis_dist) for --dnase mode with --pairtools

Other ideas:

  • Replace --dnase option by --no_digestion for DNAseq, microC, etc.
@nservant nservant added the enhancement New feature or request label May 5, 2023
@nservant nservant changed the title Include Dovotail protocol Include pairtools protocol May 5, 2023
nservant added a commit to nservant/nf-core-hic that referenced this issue Jun 15, 2023
@nservant
Copy link
Collaborator Author

First test version available with --processing pairtools

@nservant
Copy link
Collaborator Author

The options --keep_multi / --min_mapq / --min_cis_dist / --save_interaction_bam have been managed

@nservant nservant pinned this issue Jun 15, 2023
nservant added a commit to nservant/nf-core-hic that referenced this issue Jun 15, 2023
@nservant
Copy link
Collaborator Author

nservant commented Jun 21, 2023

First version tested.

To further validate before release ... still need to find a way to use the options :

  • --min_insert_size / --max_insertisize
  • --min_restriction_fragment_size / --max_restriction_fragment_size
       ext.args = { [
            "(mapq1>${params.min_mapq} and mapq2>${params.min_mapq})",
            params.min_cis_dist > 0 ? " and (abs(pos1-pos2) < ${params.min_cis_dist})" : '',
            params.keep_multi ? " and ((pair_type=='UU') or (pair_type=='UR') or (pair_type=='RU') or (pair_type=='MM') or (pair_type=='MU'))" : 
                                " and ((pair_type=='UU') or (pair_type=='UR') or (pair_type=='RU'))",
            params.dnase ? '' : " and (abs(int(rfrag1) - int(rfrag2)) > 1)",
            //params.min_insert_size > 0 ?  " and ( (rfrag_end1 - r1pos) + (rfrag_end2 - r2pos)) > ${params.min_insert_size}" : '',
            //params.max_insert_size > 0 ? " and ( (rfrag_end1 - r1pos) + (rfrag_end2 - r2pos)) < ${params.max_insert_size}" : '',
            //params.min_restriction_fragment_size > 0 ? " -t ${params.min_restriction_fragment_size}" : '',
            //params.max_restriction_fragment_size > 0 ? " -m ${params.max_restriction_fragment_size}" : '',
        ].join(' ').trim() }

@nservant nservant added this to the version-2.2.0 milestone Jan 26, 2024
@jeremymsimon
Copy link

jeremymsimon commented Jan 16, 2025

Hi @nservant
I have been using this version as a means of processing HiChIP data, with the eventual goal of running FitHiChIP on the outputs here, consistent with what is described in this vignette.

The pipeline runs fine for me with the following executable:

nextflow run nf-core-hic \
   -r dev \
   -c /jsimonlab/pipelines/nfcore/nextflow.config \
   --input HiChIP_nextflow_samplesheet.csv \
   --processing pairtools \
   --save_pairs_intermediates \
   --fasta GRCh38.primary_assembly.genome.fa \
   --bwa_index bwa_v0717/hg38 \
   --no_digestion \
   --min_cis_dist 1000 \
   --bin_size '5000,10000,20000,50000,100000,500000,1000000' \
   --outdir HiChIP_nextflow_hic_bwamem_pairtools

and in my pairtools directory, I have

*split.pairs.gz
*unselected.pairs.gz
*selected.pairs.gz.px2
*selected.pairs.gz
stats/

I thought that using the *selected.pairs.gz file as input to FitHiChIP would then work equivalently as the allValidPairs output from HiC-Pro, but I'm getting an error this way; I think it could be because these two files are formatted very slightly differently:

$ head /path/to/output.allValidPairs
GWNJ-0842:984:GW2205304471st:4:1216:15483:48459	GL000008.2	998	+	GL000008.2	1000	-	NA	NA	NA	42	23	
GWNJ-0842:984:GW2205304471st:4:1202:27590:7023	GL000008.2	1561	-	GL000008.2	3229	-	NA	NA	NA	23	42	
GWNJ-0842:984:GW2205304471st:4:1218:16782:47123	GL000008.2	1819	+	GL000008.2	2246	-	NA	NA	NA	25	42	
GWNJ-0842:984:GW2205304471st:4:1216:27874:61169	GL000008.2	2089	-	GL000008.2	2103	+	NA	NA	NA	40	42	
GWNJ-0842:984:GW2205304471st:4:2212:12357:64843	GL000008.2	2315	-	GL000008.2	2499	-	NA	NA	NA	23	40	
GWNJ-0842:984:GW2205304471st:4:1121:3021:68535	GL000008.2	2382	-	GL000008.2	2759	-	NA	NA	NA	40	23	
GWNJ-0842:984:GW2205304471st:4:2219:15179:69203	GL000008.2	2440	-	GL000008.2	2838	-	NA	NA	NA	42	42	
GWNJ-0842:984:GW2205304471st:4:1204:7415:51711	GL000008.2	2466	+	KI270722.1	180699	+	NA	NA	NA	24	26	
GWNJ-0842:984:GW2205304471st:4:2118:24688:47843	GL000008.2	2491	+	GL000008.2	3247	-	NA	NA	NA	23	42	
GWNJ-0842:984:GW2205304471st:4:2120:14519:44925	GL000008.2	2518	-	GL000008.2	2720	-	NA	NA	NA	42	24
$ zcat pairtools/DTG-HiChIP-pooled.selected.pairs.gz | grep -v '#' | head
GWNJ-0842:984:GW2205304471st:4:1119:13717:67937	GL000008.2	24	GL000008.2	1209	+	-	UU	60	60
GWNJ-0842:984:GW2205304471st:4:2205:11231:42464	GL000008.2	139	GL000008.2	1787	-	+	uU	55	22
GWNJ-0842:984:GW2205304471st:4:1116:24089:7989	GL000008.2	447	GL000008.2	3216	+	-	uU	17	37
GWNJ-0842:984:GW2205304471st:4:1224:27255:72121	GL000008.2	567	GL000008.2	1754	+	-	UU	60	60
GWNJ-0842:984:GW2205304471st:4:2205:16782:23460	GL000008.2	699	GL000008.2	2404	+	-	UU	31	60
GWNJ-0842:984:GW2205304471st:4:2219:13565:61239	GL000008.2	841	GL000008.2	2937	-	-	Uu	42	25
GWNJ-0842:984:GW2205304471st:4:2208:12997:16340	GL000008.2	1030	GL000008.2	5498	-	+	Uu	36	29
GWNJ-0842:984:GW2205304471st:4:1222:3123:30668	GL000008.2	1045	GL000008.2	111679	-	-	UU	60	22
GWNJ-0842:984:GW2205304471st:4:2218:27762:32461	GL000008.2	1091	GL000008.2	3762	+	+	uU	15	60
GWNJ-0842:984:GW2205304471st:4:1121:8643:35168	GL000008.2	1182	GL000008.2	3302	-	+	UU	60	38

Note the '#' header lines, but aside from those, the number of columns is different (12 vs 10) and the column with strand identifiers isn't the same

Is there a more appropriate file that the pipeline here creates that is equivalent to the allValidPairs format? Or do you have other suggestions for how to take the existing output here and run FitHiChIP?

Another option is to use the contact_maps/cool/*.cool files created here if the others are not compatible, which does seem to work if need be

Thanks!

@nservant
Copy link
Collaborator Author

nservant commented Feb 2, 2025

Hi @jeremymsimon
Sorry for being so slow to answer. This is because the HiC-Pro validPairs files are not exactly the same than the pairs files.
If you run the pip with --processing pairtools, the HiC-Pro outputs are not created ... If you want to have the allValidPairs file you must run --processing hicpro ... does it make sense ?
N

@jeremymsimon
Copy link

jeremymsimon commented Feb 3, 2025

Ah I see. So I'm trying to replicate the processing steps in the vignette linked above as closely as possible here, meaning I want to align with bwa and then adjust the reads with pairtools before proceeding with FitHiChIP etc (given my sample is prepped with MNase).

It sounds like from your description here that the current --processing pairtools workflow does not produce the valid pairs file I need, but the --processing hicpro workflow doesn't use the aligner I need.

Is there (or can there be) a middle-ground option? Perhaps an additional parameter we can pass in to run the bwa workflow but produce the valid pairs output? Or perhaps --save_pairs_intermediates can include this behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants