-
Notifications
You must be signed in to change notification settings - Fork 11
CPCA
The aim of classification is to group structurally different particles that are part of a particle list into different bins. You may want to use classification directly after localization to discard obvious false positives or 'bad' particles or you may want to classify subtomograms after alignment to get some insights into possible conformational hetergeneity of your molecules of interest.
In PyTom, there are currently two different methods for subtomogram classification implemented: (i) Constrained Principal Component Analysis (CPCA) in conjunction with k-means clustering and (ii) multiple correlation optimization (MCO). Both methods and their usage are explained in the following.
CPCA is explained in detail in the original publication Foerster et. al. 2008. Classification by CPCA consists of three major steps. Firstly, the constrained correlation matrix of all subtomograms is computed. This matrix contains all pairwise constrained correlation coefficients (CCCs) of the subtomograms. Computation of the CCC is by far the computationally most demanding step. Hence, we have parallelized this step. Secondly, the principal components of this matrix are computed. In this step, the data is compressed using a specified number of eigenvectors. Thirdly, the reduced data are clustered using a k-means method.
The CCC matrix is computed using the script
mpirun --hostfile "pathToYourHostfile" -c "numberOfCPUs" pytom
PathToPytom/classification/calculate_correlation_matrix.py -p "ParticleList"
-m "MaskFile" -f "Lowpass" -b "Binning" -g "gpuID(s)"
The arguments are:
- ParticleList: XML file containing the aligned subtomograms.
- Mask: File containing mask for focus of classification. Typically EM format, but MRC or CCP4 also possible. Make sure dimensions are identical to subtomograms.
- Lowpass: Frequency of lowpass filter. The lowpass filter is applied to all subtomograms after binning.
- Binning: Binning Factor for subtomograms. Binning greatly increases computational speed, but you must make sure that the classifier of interest is still appropriately resolved in the subtomograms. For many applications the effective pixelsize should not be below ~2 nm (corresponding to Nyquist of 4 nm). Binning factor of 2 makes 1 voxel out of 2x2x2 voxels.
- -g, --gpuID.: Index or indices of the gpu's one wants to use. CCC can run on multiple gpu's simultaneously. The indices of multiple gpu's are separated by a comma (no space). For example 0,2,3,5. **Please note that the number of mpi cores should be one more than the number of GPUs you are using.**
- Verbose: More output.
The CCC is further used for classification using the script
classifyCPCA.py. This script computes the eigenvectors of
the CCC and projects the data on the first neig eigenvectors.
Subsequently, these multidimensional vectors are clustered into
nclass groups using a kmeans algorithm. The usage is:
pytom PathToPytom/bin/classifyCPCA.py -p "ParticleList"
-o "OutputParticleList" -c "CCC" -e "neig" -n "nclass"
-a "Average"
- ParticleList: XML file containing the aligned subtomograms.
- OutputParticleList: Filename for generated XML file that includes the assigned classes for each particle.
- CCC: Filename of constrained correlation matrix. It will typically be correlation_matrix.csv.
- neig: Number of eigenvectors (corresponding to largest eigenvectors) used for clustering.
- nclass: Number of classes used for kmeans classification.
- Average: Root for generated averages of the corresponding classes. The files will be called 'Average'_iclass.em.