Skip to content

Validation Errors

Andrey Isaev edited this page Feb 18, 2026 · 8 revisions

This page consists of all possible validation errors and its possible debugging directions:

BadAnnDataFile

Raised when the provided file does not have the correct .h5ad extension.

AnnDataMissingCountMatrix

Raised when the count matrix is not found in the AnnData file.

The raw matrix is expected to be stored as follows:

  • It is placed in anndata.raw.X if a raw layer exists.
  • It is placed in anndata.X if no raw layer exists.
  • It is stored in dense (numpy.ndarray) or sparse CSR (scipy.sparse.csr_matrix) format.

AnnDataInvalidCountMatrix

The raw matrix is valid if:

  • It consists of non-negative integer values only (float data types are supported).

AnnDataMissingEmbeddings

Raised if valid embeddings are not found in the file.

The embedding is valid if:

  • It is placed in the anndata.obsm section
  • It is stored in dense (numpy.ndarray) form
  • Its shape is [n-cells x 2]
  • Its name starts with X_ (examples: X_umap, X_tsne_1, X_pca 2)

AnnDataMissingObs

Raised when the .obs metadata table is missing.

The .obs section is required and must contain cell-level metadata.

AnnDataMisingObsColumns

Raised when one or more required general metadata columns are missing. Each AnnData file must contain the following columns in the anndata.obs section:

  • organism
  • assay
  • disease
  • tissue

All columns must contain valid string values. Supported values for the organism column are Homo sapiens and Mus musculus, but no error will be raised for other values. See the AnnDataNonStandardVarError section for details.

AnnDataEmptyOrNoneInGeneralMetadata

Raised when required general metadata columns exist, but contain missing or invalid values.

All required .obs metadata fields must be filled with valid values.

AnnDataNonStandardVarError:

Raised when var.index is not a list of proper gene ENSEMBL IDs. This error will only be raised for datasets with supported organisms. In the case of multiple organisms in the dataset, Homo sapiens ortholog gene IDs are required. For an unsupported organism, it is recommended to store gene symbols in the anndata.var["gene_names"] column.

The full list of supported genes can be accessed via the Gene Mapping module.

If you have placed the ENSEMBL IDs in anndata.var.index but are still facing this issue, some genes might be missing in CAP mapping. To check for missing genes, use the following example:

from cap_upload_validator.gene_mapping import GeneMap, EnsemblOrganism

cap_genes = GeneMap.data_frame(EnsemblOrganism.HUMAN.value, index_col=0)
missing_genes = adata.var.loc[~adata.var.index.isin(cap_genes.index)]
print(missing_genes)

If you need to add additional gene ENSEMBL IDs for already supported organisms or request support for a new organism, please contact us via GitHub Issues or email us at support@celltype.info

AnnDataMissingVarIndex

Raised when the .var.index is missing or empty.

The var.index is expected to contain gene identifiers (typically ENSEMBL IDs) for all features in the dataset.

AnnDataNumericVarIndex

Raised when the .var.index contains numeric values instead of gene identifiers.

Gene identifiers must be string-based ENSEMBL IDs, not integers or numeric placeholders.

AnnDataVarNotSubsetOfRawVar

Raised when var.index is not a subset of raw.var.index.

When a raw layer exists, the processed gene list (adata.var.index) must remain consistent with the raw gene list.

AnnDataGeneIndexIsNotUnique

Raised when var.index contains duplicate gene identifiers.

Gene identifiers must be unique across all features.

AnnDataUnsupportedGenes

Raised when the file does not contain valid ENSEMBL gene identifiers in var.index.

We currently support only the following organisms:

  • Homo sapiens
  • Mus musculus

In the case of multiple species in the dataset, orthologous Homo sapiens gene IDs are required.

CSCMatrixInX

Raised when the gene expression matrix is stored in CSC (Compressed Sparse Column) format. This error occurs if anndata.X and/or anndata.raw.X is detected as a CSC matrix.

Only dense (numpy.ndarray) and CSR sparse (scipy.sparse.csr_matrix) formats are supported for gene expression matrices. CSC matrices are not supported and must be converted to CSR (e.g., using .tocsr()) before saving the AnnData file and re-running validation.