-
Notifications
You must be signed in to change notification settings - Fork 1
Validation Errors
This page consists of all possible validation errors and its possible debugging directions:
Raised when the provided file does not have the correct .h5ad extension.
Raised when the count matrix is not found in the AnnData file.
The raw matrix is expected to be stored as follows:
- It is placed in
anndata.raw.Xif arawlayer exists. - It is placed in
anndata.Xif norawlayer exists. - It is stored in dense (
numpy.ndarray) or sparse CSR (scipy.sparse.csr_matrix) format.
The raw matrix is valid if:
- It consists of non-negative integer values only (float data types are supported).
Raised if valid embeddings are not found in the file.
The embedding is valid if:
- It is placed in the
anndata.obsmsection - It is stored in dense (numpy.ndarray) form
- Its shape is
[n-cells x 2] - Its name starts with
X_(examples:X_umap,X_tsne_1,X_pca 2)
Raised when the .obs metadata table is missing.
The .obs section is required and must contain cell-level metadata.
Raised when one or more required general metadata columns are missing. Each AnnData file must contain the following columns in the anndata.obs section:
- organism
- assay
- disease
- tissue
All columns must contain valid string values. Supported values for the organism column are Homo sapiens and Mus musculus, but no error will be raised for other values. See the AnnDataNonStandardVarError section for details.
Raised when required general metadata columns exist, but contain missing or invalid values.
All required .obs metadata fields must be filled with valid values.
Raised when var.index is not a list of proper gene ENSEMBL IDs. This error will only be raised for datasets with supported organisms. In the case of multiple organisms in the dataset, Homo sapiens ortholog gene IDs are required. For an unsupported organism, it is recommended to store gene symbols in the anndata.var["gene_names"] column.
The full list of supported genes can be accessed via the Gene Mapping module.
If you have placed the ENSEMBL IDs in anndata.var.index but are still facing this issue, some genes might be missing in CAP mapping. To check for missing genes, use the following example:
from cap_upload_validator.gene_mapping import GeneMap, EnsemblOrganism
cap_genes = GeneMap.data_frame(EnsemblOrganism.HUMAN.value, index_col=0)
missing_genes = adata.var.loc[~adata.var.index.isin(cap_genes.index)]
print(missing_genes)If you need to add additional gene ENSEMBL IDs for already supported organisms or request support for a new organism, please contact us via GitHub Issues or email us at support@celltype.info
Raised when the .var.index is missing or empty.
The var.index is expected to contain gene identifiers (typically ENSEMBL IDs) for all features in the dataset.
Raised when the .var.index contains numeric values instead of gene identifiers.
Gene identifiers must be string-based ENSEMBL IDs, not integers or numeric placeholders.
Raised when var.index is not a subset of raw.var.index.
When a raw layer exists, the processed gene list (adata.var.index) must remain consistent with the raw gene list.
Raised when var.index contains duplicate gene identifiers.
Gene identifiers must be unique across all features.
Raised when the file does not contain valid ENSEMBL gene identifiers in var.index.
We currently support only the following organisms:
- Homo sapiens
- Mus musculus
In the case of multiple species in the dataset, orthologous Homo sapiens gene IDs are required.
Raised when the gene expression matrix is stored in CSC (Compressed Sparse Column) format. This error occurs if anndata.X and/or anndata.raw.X is detected as a CSC matrix.
Only dense (numpy.ndarray) and CSR sparse (scipy.sparse.csr_matrix) formats are supported for gene expression matrices. CSC matrices are not supported and must be converted to CSR (e.g., using .tocsr()) before saving the AnnData file and re-running validation.