Principle Component Analysis

This method performs a principle component analysis (PCA) using ANGSD and ngsPopGen for PCA calculation. Please see NGSPopGen for full details on this method.

Basic Usage

To run this method, use the following command

angsd-wrapper PCA Principal_Component_Analysis_Config

where Principal_Component_Analysis_Config is the full path to the configuration file for the PCA.

Input files

All inputs should be specified in Principal_Component_Analysis_Config.

Common Variables

This method does make use of Common_Config, those that are used are listed below:

Variable	Function
`SAMPLE_LIST` `GROUP_SAMPLES` on `dev`	A list of samples to be used in calculations
`PROJECT`	Name given to all outputs in ANGSD-wrapper
`SCRATCH`	Place to store files, the full path is `SCRATCH/PROJECT/PCA`
`REGIONS`	Limit the scope of ANGSD-wrapper to certain regions

Method-Specific Variables

This method has no method-specifc variables

Method Parameters

The parameters for this method can be tweaked as necessary, they have been set for optimal generalized function:

Parameter	Function
`DO_MAF`	Calculate per-site frequencies
`DO_MAJORMINOR`	Estimate major/minor alleles
`DO_GENO`	Call genotypes and setup the output
`DO_POST`	Calculate the posterior probability using per-site frequencies
`N_CORES`	Number of cores to use, please do not set above the limits of your system
`CALL`	Call genotype from maximum probability
`GT_LIKELIHOOD`	Estimates genotype likelihoods
`N_SITES`	Set the maximum number of sites to use

Output files

Naming Scheme	Contents
`PROJECT_PCA.arg`	Details of arguments
`PROJECT_PCA.covar`	Results of the principle component analysis
`PROJECT_PCA.geno`	Genotype calls
`PROJECT_PCA.mafs.gz`	Per-site frequencies

Visualization

PROJECT_PCA.covar (renamed to PROJECT_PCA.graph.me during processing) can be visualized with the Shiny graphing interface. A web browser with a graphical user interface is required.

(Optional) Subgroups

If known subgroups exist within the samples, then create a .clst clusters file that contains labels for each sample. The file should be formatted as follows, with the CLUSTER value determining the color of the data point within the final plot:

FID	IID	CLUSTER
[Sample Name]	1	[Group Name]
...

Main information

Methods

Formatting Files

Regions File Format

Provide feedback

Saved searches

Use saved searches to filter your results more quickly