CPCA

Classification using Constrained principal component analysis and k-means clustering

Overview

The aim of classification is to group structurally different particles that are part of a particle list into different bins. You may want to use classification directly after localization to discard obvious false positives or 'bad' particles or you may want to classify subtomograms after alignment to get some insights into possible conformational hetergeneity of your molecules of interest.

In PyTom, there are currently two different methods for subtomogram classification implemented: (i) Constrained Principal Component Analysis (CPCA) in conjunction with k-means clustering and (ii) multiple correlation optimization (MCO). Both methods and their usage are explained in the following.

CPCA-based classification using calculate_correlation_matrix.py and classifyCPCA.py

CPCA is explained in detail in the original publication Foerster et. al. 2008. Classification by CPCA consists of three major steps. Firstly, the constrained correlation matrix of all subtomograms is computed. This matrix contains all pairwise constrained correlation coefficients (CCCs) of the subtomograms. Computation of the CCC is by far the computationally most demanding step. Hence, we have parallelized this step. Secondly, the principal components of this matrix are computed. In this step, the data is compressed using a specified number of eigenvectors. Thirdly, the reduced data are clustered using a k-means method.

The CCC matrix is computed using the script


      mpirun --hostfile "pathToYourHostfile" -c "numberOfCPUs" pytom
      PathToPytom/classification/calculate_correlation_matrix.py -p "ParticleList"
      -m "MaskFile" -f "Lowpass" -b "Binning" -g "gpuID(s)"

The arguments are:

ParticleList: XML file containing the aligned subtomograms.
Mask: File containing mask for focus of classification. Typically EM format, but MRC or CCP4 also possible. Make sure dimensions are identical to subtomograms.
Lowpass: Frequency of lowpass filter. The lowpass filter is applied to all subtomograms after binning.
Binning: Binning Factor for subtomograms. Binning greatly increases computational speed, but you must make sure that the classifier of interest is still appropriately resolved in the subtomograms. For many applications the effective pixelsize should not be below ~2 nm (corresponding to Nyquist of 4 nm). Binning factor of 2 makes 1 voxel out of 2x2x2 voxels.
-g, --gpuID.: Index or indices of the gpu's one wants to use. CCC can run on multiple gpu's simultaneously. The indices of multiple gpu's are separated by a comma (no space). For example 0,2,3,5. **Please note that the number of mpi cores should be one more than the number of GPUs you are using.**
Verbose: More output.

The script will generate a file called 'correlation_matrix.csv'. It contains the CCC in an ascii format.

The CCC is further used for classification using the script classifyCPCA.py. This script computes the eigenvectors of the CCC and projects the data on the first neig eigenvectors. Subsequently, these multidimensional vectors are clustered into nclass groups using a kmeans algorithm. The usage is:


      pytom PathToPytom/bin/classifyCPCA.py -p "ParticleList"
      -o "OutputParticleList" -c "CCC" -e "neig" -n "nclass"
      -a "Average"

In detail the parameters of the script are:

ParticleList: XML file containing the aligned subtomograms.
OutputParticleList: Filename for generated XML file that includes the assigned classes for each particle.
CCC: Filename of constrained correlation matrix. It will typically be correlation_matrix.csv.
neig: Number of eigenvectors (corresponding to largest eigenvectors) used for clustering.
nclass: Number of classes used for kmeans classification.
Average: Root for generated averages of the corresponding classes. The files will be called 'Average'_iclass.em.

The output of the method are the ParticleList with assigned classes as well as the different class averages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPCA

Classification using Constrained principal component analysis and k-means clustering

Overview

CPCA-based classification using calculate_correlation_matrix.py and classifyCPCA.py

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally