Skip to content

Commit 86a4f8d

Browse files
committed
Create statistical pipeline and update readme
1 parent b8f07be commit 86a4f8d

File tree

5 files changed

+338
-5
lines changed

5 files changed

+338
-5
lines changed

Pipelines.md

Lines changed: 105 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,12 @@ Here three possible usage pipelines of MetaFast toolkit are presented. Each pipe
77
* [Pipeline 2. Unique metagenomic features finder](#pipeline-2-unique-metagenomic-features-finder)
88
* [Pipeline 3. Specific metagenomic features counter](#pipeline-3-specific-metagenomic-features-counter)
99
* [Pipeline 4. Colored metagenomic features finder](#pipeline-4-colored-metagenomic-features-finder)
10+
* [Pipeline 5. Statistical metagenomic features finder](#pipeline-5-statistical-metagenomic-features-finder)
1011
* [Format conversion tools](#format-conversion-tools)
12+
* [Components to GFA converter](#components-to-gfa-converter)
13+
* [Binary to Fasta converter](#binary-to-fasta-converter)
14+
* [Binary to tab-separated converter](#binary-to-tab-separated-converter)
15+
* [Sequence to components converter](#sequence-to-components-converter)
1116

1217
## Pipeline 1. Metagenomic distance estimation
1318

@@ -60,7 +65,7 @@ java -jar metafast.jar -t heatmap-maker -i <dist_matrix_<date>_<time>_original_o
6065
Pipeline for extracting **unique** features from groups of metagenomic samples and manipulating with them. Also, supports features construction for samples with unknown category for further category prediction.
6166

6267
`
63-
java -jar metafast.jar -t unique-features -k <k> -pos <postiveFiles> -neg <negativeFiles> --min-samples <int> --max-samples <int> -b <int> [--split]
68+
java -jar metafast.jar -t unique-features -k <k> -pos <positiveFiles> -neg <negativeFiles> --min-samples <int> --max-samples <int> -b <int> [--split]
6469
`
6570

6671
`k` – k-mer size
@@ -213,26 +218,122 @@ java -jar metafast.jar -t features-calculator -k <k> -cm <components.bin> -ka <*
213218
`
214219

215220

221+
## Pipeline 5. Statistical metagenomic features finder
222+
223+
Pipeline for extracting statistically-significant features from groups of metagenomic samples and manipulating with them. Also, supports features construction for samples with unknown category for further category prediction.
224+
225+
`
226+
java -jar metafast.jar -t stats-features -k <k> -pos <postiveFiles> -neg <negativeFiles> -pchi2 <float> -pmw <float> -b <int> [--split]
227+
`
228+
229+
230+
`k` – k-mer size
231+
232+
`pos` – list of reads files from positive group
233+
234+
`neg` – list of reads files from negative group
235+
236+
`pchi2` – p-value for chi-squared test
237+
238+
`pmw` – p-value for Mann-Whitney test
239+
240+
`b` – maximal frequency for a k-mer to be assumed erroneous
241+
242+
`split` – saves each component in separate file
243+
244+
**Output files in _workDir_:**
245+
246+
`kmer-counter-posneg\pos\kmers\*.kmers.bin` – k-mers from input files from **positive** group in binary format
247+
248+
`kmer-counter-posneg\neg\kmers\*.kmers.bin` – k-mers from input files from **negative** group in binary format
249+
250+
`stats-kmers\kmers\filtered_groupA.kmers.bin` – statistically significant k-mers for **positive** group in binary format
251+
252+
`stats-kmers\kmers\filtered_groupB.kmers.bin` – statistically significant k-mers for **negative** group in binary format
253+
254+
`component-extractor\components.bin` – components built around unique k-mers in binary format
255+
256+
`comp2seq\kmers-counter-many\kmers\component_*.kmers.bin` – k-mers from components in binary format
257+
258+
`comp2seq\kmers_fasta\component_*.fasta` – k-mers from components in fasta format
259+
260+
`comp2seq\seq-builder-many\sequences\component_*.seq.fasta` – contigs from components in fasta format
261+
262+
`features-calculator\vectors\*` – feature vectors of components' coverage by input files
263+
264+
265+
266+
![Pipeline 5](img/pipe5.svg)
267+
268+
Step-by-step data processing is presented on the image above. Order of tools to run:
269+
270+
1. **K-mers counter**
271+
Extract k-mers from each metagenomic sample and saves in internal binary format for further processing (`workDir/kmers/*.kmers.bin`). This step can be performed separately for metagenomes with known and unknown categories. For the convenience of further explanations we will refer to samples with known categories as _group\_1.kmers.bin_ and _group\_2.kmers.bin_ for two categories and _ungroupped.kmers.bin_ for samples with unknown category.
272+
`
273+
java -jar metafast.jar -t kmer-counter-many -k <k> -i <inputFiles>
274+
`
275+
2. **Statistical k-mers finder**
276+
Extract k-mers with a statistically significant difference in abundance between categories of samples.
277+
`
278+
java -jar metafast.jar -t stats-kmers -k <k> -A <group_1.kmers.bin> -B <group_2.kmers.bin> -pchi2 <float> -pmw <float>
279+
`
280+
3. **Component extractor**
281+
Extract local environment in the de Bruijn graph around specified k-mers. These subgraph components can be used as features specific for analyzed category (`workDir/components.bin`)
282+
`
283+
java -jar metafast.jar -t component-extractor -k <k> -i <group_1.kmers.bin> --pivot <filtered_groupA.kmers.bin>
284+
`
285+
4. **Features calculator**
286+
Counts coverage of each component (subgraph) by k-mers for each metagenomic sample independently. For each sample outputs numerical features vector of coverages (`workDir/vectors/*.vec`). Features vectors for samples with known categories can be further used to train machine learning model to predict categories for samples with unknown categories.
287+
`
288+
java -jar metafast.jar -t features-calculator -k <k> -cm <components.bin> -ka <*.kmers.bin>
289+
`
290+
5. **Sequence extractor**
291+
292+
Split specific subgraph components into linear genomic sequences, that can be used for analysis with external tools. It can be run via a single command:
293+
`
294+
java -jar metafast.jar -t comp2seq -k <k> -cf <components.bin> [--split]
295+
`
296+
`Split` flag determines, whether to save sequences from all components into separate files. Resulting sequences can be found in `workDir/seq-builder-many/sequences/*.seq.fasta`. Alternatively, the following steps result in the same sequences:
297+
298+
1. Transform subgraphs from binary format to fasta
299+
`
300+
java -jar metafast.jar -t bin2fasta -k <k> -cf <components.bin> -o <components.fasta>
301+
`
302+
1. Extract linear sequences from subgraphs' k-mers
303+
`
304+
java -jar metafast.jar -f seq-builder-many -k <k> -i <components.fasta>
305+
`
306+
307+
216308

217309
## Format conversion tools
218310

219-
#### Binary to Fasta convertor
311+
312+
#### Components to GFA converter
313+
314+
Tool accepts de Bruijn graph components in internal MetaFast binary format and outputs them in GFA format. Which can be further visualised via [Bandage](https://github.com/rrwick/Bandage). Nodes graph coverage is controlled with `-i` parameter containing samples' k-mers and `-cov` parameter responsible for total coverage calculation mode.
315+
316+
`
317+
java -jar metafast.jar -t comp2graph -k <k> -cf <binary components> [-i <samples' k-mers>] [-cov]
318+
`
319+
320+
#### Binary to Fasta converter
220321

221322
Tool accepts k-mers or components in internal MetaFast binary format and outputs them in fasta format. Components are printed as a set of k-mers and different components can be printed separately (`--split` option).
222323

223324
`
224325
java -jar metafast.jar -t bin2fasta -k <k> {-kf <binary k-mers> | -cf <binary components>} [--split]
225326
`
226327

227-
#### Binary to tab-separated convertor
328+
#### Binary to tab-separated converter
228329

229330
Tool accepts k-mers or components in internal MetaFast binary format and outputs them in tab-separated format. First column contains k-mers and second column contains their respective number of occurences.
230331

231332
`
232333
java -jar metafast.jar -t view -k <k> {-kf <binary k-mers> | -cf <binary components>}
233334
`
234335

235-
#### Sequence to components convertor
336+
#### Sequence to components converter
236337

237338
Tool accepts genomic sequences in fasta format and converts them to components, so that they can be used for features calculations.
238339

README.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ Here is a short version of it.
3232
* [Citation](#citation)
3333
* [Contact](#contact)
3434
* [License](#license)
35+
* [Publications using MetaFast](#publications-using-metafast)
3536
* [See also](#see-also)
3637

3738

@@ -135,14 +136,26 @@ Bioinformatics, 32(18), 2760-2767.
135136

136137
## Contact
137138

138-
Please report any problems directly to the GitHub [issue tracker](https://github.com/ctlab/metafast/issues).<br/>
139+
Please report any problems directly to the GitHub [issue tracker](https://github.com/ctlab/metafast/issues). <br/>
139140
Also, you can send your feedback to [[email protected]](mailto:[email protected]).
140141

141142

142143
## License
143144

144145
The MIT License (MIT)
145146

147+
## Publications using MetaFast
148+
149+
There are several papers about bioinformatics projects, which used various MetaFast pipelines for data analysis:
150+
151+
* Analysis of human gut microbiota of patients with Crohn's disease, ulcerative colitis and healthy controls <br/>
152+
Khachatryan, L., Xiang, Y., Ivanov, A., Glaab, E., Graham, G., Granata, I., ... & Poussin, C. (2023). Results and lessons learned from the sbv IMPROVER metagenomics diagnostics for inflammatory bowel disease challenge. Scientific Reports, 13(1), [doi: 10.1038/s41598-023-33050-0](https://doi.org/10.1038/s41598-023-33050-0)
153+
* Analysis of human gut microbiota of patients undergoing melanoma immunotherapy <br/>
154+
Olekhnovich, E. I., Ivanov, A. B., Babkina, A. A., Sokolov, A. A., Ulyantsev, V. I., Fedorov, D. E., & Ilina, E. N. (2023). Consistent Stool Metagenomic Biomarkers Associated with the Response To Melanoma Immunotherapy. Msystems, 8(2), e01023-22. [doi: 10.1128/msystems.01023-22](https://doi.org/10.1128/msystems.01023-22)
155+
* Analysis of gut microbiota time-series samples from patients undergoing microbiome transplantation
156+
Olekhnovich, E. I., Ivanov, A. B., Ulyantsev, V. I., & Ilina, E. N. (2021). Separation of donor and recipient microbial diversity allows determination of taxonomic and functional features of gut microbiota restructuring following fecal transplantation. Msystems, 6(4), e00811-21. [doi: 10.1128/msystems.00811-21](https://doi.org/10.1128/msystems.00811-21)
157+
158+
146159

147160
## See also
148161

img/pipe5.svg

Lines changed: 3 additions & 0 deletions
Loading
Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
package tools;
2+
3+
import ru.ifmo.genetics.Runner;
4+
import ru.ifmo.genetics.statistics.Timer;
5+
import ru.ifmo.genetics.utils.tool.ExecutionFailedException;
6+
import ru.ifmo.genetics.utils.tool.Parameter;
7+
import ru.ifmo.genetics.utils.tool.Tool;
8+
import ru.ifmo.genetics.utils.tool.inputParameterBuilder.BoolParameterBuilder;
9+
import ru.ifmo.genetics.utils.tool.inputParameterBuilder.DoubleParameterBuilder;
10+
import ru.ifmo.genetics.utils.tool.inputParameterBuilder.FileMVParameterBuilder;
11+
import ru.ifmo.genetics.utils.tool.inputParameterBuilder.IntParameterBuilder;
12+
13+
import java.io.File;
14+
import java.io.FileNotFoundException;
15+
import java.io.PrintWriter;
16+
import java.text.SimpleDateFormat;
17+
18+
public class StatsFeaturesBuilderMain extends Tool {
19+
public static final String NAME = "stats-features";
20+
public static final String DESCRIPTION = "Builds statistically significant features for group of metagenomic samples";
21+
22+
23+
static {
24+
forceParameter.set(true);
25+
launchOptions.remove(forceParameter);
26+
}
27+
28+
// input parameters
29+
public final Parameter<Integer> k = addParameter(new IntParameterBuilder("k")
30+
.mandatory()
31+
.withShortOpt("k")
32+
.withDescription("k-mer size (in nucleotides, maximum 31 due to realization details)")
33+
.withDescriptionShort("k-mer size")
34+
.withDescriptionRu("Длина k-мера при построении графа де Брейна")
35+
.withDescriptionRuShort("Длина k-мера")
36+
.create());
37+
38+
public final Parameter<File[]> positiveFiles = addParameter(new FileMVParameterBuilder("positiveReads")
39+
.withShortOpt("pos")
40+
.mandatory()
41+
.withDescription("list of reads files from positive group. FASTQ, FASTA")
42+
.create());
43+
44+
public final Parameter<File[]> negativeFiles = addParameter(new FileMVParameterBuilder("negativeReads")
45+
.withShortOpt("neg")
46+
.mandatory()
47+
.withDescription("list of reads files from negative group. FASTQ, FASTA")
48+
.create());
49+
50+
public final Parameter<Double> PValueChi2 = addParameter(new DoubleParameterBuilder("p-value-chi2")
51+
.withShortOpt("pchi2")
52+
.withDescription("p-value for chi-squared test")
53+
.withDefaultValue(0.05)
54+
.create());
55+
56+
public final Parameter<Double> PValueMW = addParameter(new DoubleParameterBuilder("p-value-mw")
57+
.withShortOpt("pmw")
58+
.withDescription("p-value for Mann-Whitney test")
59+
.withDefaultValue(0.05)
60+
.create());
61+
62+
public final Parameter<Boolean> splitComponents = addParameter(new BoolParameterBuilder("split")
63+
.withDescription("Save each component in separate file?")
64+
.withDefaultValue(false)
65+
.create());
66+
67+
public final Parameter<Integer> maximalBadFrequency = addParameter(new IntParameterBuilder("maximal-bad-frequency")
68+
.important()
69+
.withShortOpt("b")
70+
.withDescription("maximal frequency for a k-mer to be assumed erroneous")
71+
.withDefaultValue(1)
72+
.withDescriptionShort("Maximal bad frequency")
73+
.withDescriptionRu("Максимальная частота ошибочного k-мера")
74+
.withDescriptionRuShort("Максимальная частота ошибочного k-мера")
75+
.create());
76+
77+
78+
public final File[] outputDescFiles = new File[]{
79+
new File("output_description.txt"),
80+
workDir.append("output_description.txt").get()};
81+
82+
83+
84+
// adding sub tools
85+
public KmersCounterPositiveNegative kmersCounterPosNeg = new KmersCounterPositiveNegative();
86+
{
87+
setFix(kmersCounterPosNeg.k, k);
88+
setFix(kmersCounterPosNeg.positiveFiles, positiveFiles);
89+
setFix(kmersCounterPosNeg.negativeFiles, negativeFiles);
90+
setFix(kmersCounterPosNeg.maximalBadFrequency, maximalBadFrequency);
91+
setFixDefault(kmersCounterPosNeg.outputDir);
92+
kmersCounterPosNeg.outputDescFiles = outputDescFiles;
93+
addSubTool(kmersCounterPosNeg);
94+
}
95+
96+
public StatsKmersFinder statsKmers = new StatsKmersFinder();
97+
{
98+
setFix(statsKmers.Afiles, kmersCounterPosNeg.resultingPositiveKmerFiles);
99+
setFix(statsKmers.Bfiles, kmersCounterPosNeg.resultingNegativeKmerFiles);
100+
setFix(statsKmers.PValueChi2, PValueChi2);
101+
setFix(statsKmers.PValueMW, PValueMW);
102+
setFix(statsKmers.maximalBadFrequency, maximalBadFrequency);
103+
setFixDefault(statsKmers.outputDir);
104+
addSubTool(statsKmers);
105+
}
106+
107+
108+
public ComponentExtractorMain compExtractor = new ComponentExtractorMain();
109+
{
110+
setFix(compExtractor.k, k);
111+
setFix(compExtractor.inputFiles, kmersCounterPosNeg.resultingPositiveKmerFiles);
112+
setFix(compExtractor.pivotFiles, statsKmers.resultingKmerFiles);
113+
addSubTool(compExtractor);
114+
}
115+
116+
public FeaturesCalculatorMain featuresCalculator = new FeaturesCalculatorMain();
117+
{
118+
setFix(featuresCalculator.k, k);
119+
setFix(featuresCalculator.kmersFiles, kmersCounterPosNeg.resultingPositiveKmerFiles);
120+
setFix(featuresCalculator.componentsFile, compExtractor.componentsFile);
121+
setFix(featuresCalculator.selectedKmers, statsKmers.resultingKmerFiles);
122+
addSubTool(featuresCalculator);
123+
}
124+
125+
public ComponentsToSequences comp2seq = new ComponentsToSequences();
126+
{
127+
setFix(comp2seq.k, k);
128+
setFix(comp2seq.componentsFile, compExtractor.componentsFile);
129+
setFix(comp2seq.splitComponents, splitComponents);
130+
addSubTool(comp2seq);
131+
}
132+
133+
134+
private Timer t;
135+
136+
@Override
137+
protected void runImpl() throws ExecutionFailedException {
138+
// preparing
139+
t = new Timer();
140+
141+
info("Found " + positiveFiles.get().length + " samples in positive class and "
142+
+ negativeFiles.get().length + " samples in negative class to process");
143+
if (positiveFiles.get().length == 0 || negativeFiles.get().length == 0) {
144+
throw new ExecutionFailedException("No libraries to process!!! Can't continue the calculations.");
145+
}
146+
147+
148+
outputDescFiles[1] = workDir.append("output_description.txt").get(); // updating workdir
149+
createOutputDescFiles();
150+
151+
// running steps
152+
addStep(kmersCounterPosNeg);
153+
addStep(statsKmers);
154+
addStep(compExtractor);
155+
addStep(featuresCalculator);
156+
addStep(comp2seq);
157+
}
158+
159+
private void createOutputDescFiles() {
160+
for (File f : outputDescFiles) {
161+
try {
162+
PrintWriter out = new PrintWriter(f);
163+
out.println("# Output files' description for run started at " +
164+
new SimpleDateFormat("dd-MMM-yyyy (EEE) HH:mm:ss").format(startDate));
165+
out.println();
166+
for (File ff : getLogFiles()) {
167+
out.println(ff);
168+
}
169+
out.println(" Identical files with run log");
170+
171+
out.println();
172+
for (File ff : outputDescFiles) {
173+
out.println(ff);
174+
}
175+
out.println(" Identical files with output files' description");
176+
out.close();
177+
} catch (FileNotFoundException e) {
178+
warn("Can't create file " + f + ", skipping");
179+
debug(e.getClass().getName() + ": " + e.getMessage());
180+
}
181+
}
182+
}
183+
184+
185+
@Override
186+
protected void cleanImpl() {
187+
debug("Statistically significant features builder has finished! Time = " + t);
188+
}
189+
190+
191+
@Override
192+
public void mainImpl(String[] args) {
193+
if (Runner.containsOption(args, Runner.getOptKeys(continueParameter)) ||
194+
Runner.containsOption(args, Runner.getOptKeys(Tool.startParameter))) {
195+
forceParameter.set(false);
196+
}
197+
super.mainImpl(args);
198+
}
199+
200+
public static void main(String[] args) {
201+
new StatsFeaturesBuilderMain().mainImpl(args);
202+
}
203+
204+
public StatsFeaturesBuilderMain() {
205+
super(NAME, DESCRIPTION);
206+
}
207+
}

0 commit comments

Comments
 (0)