Skip to content

Commit 24a9cca

Browse files
committed
SoP based metadata columns and the validation script updated
1 parent f16dc3e commit 24a9cca

File tree

4 files changed

+21
-15
lines changed

4 files changed

+21
-15
lines changed

docs/add_new_markers_quick.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ For detailed instructions, refer to the detailed [README file](../src/markers/in
1313

1414
3. **Add Metadata**
1515
- Update the `src/markers/input/metadata.csv` file with a new row describing your input file.
16-
- Include fields like `file_name`, `Organ_region`, `Species`, and others as specified in the detailed guide.
16+
- Include required fields like `file_name`, `Organ_region`, `Species`, `Species_abbreviation`, `Parent`, `Marker_set_xref` and others as specified in the detailed guide.
1717

1818
4. **Commit and Push Changes**
1919
- Commit your changes and push them to your branch.

src/markers/input/README.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -6,29 +6,31 @@ To contribute new marker data, please submit a pull request with the appropriate
66

77
Place your marker data file(s) in the `src/markers/input` directory. Each file must include the following required columns:
88

9-
| clusterName | f_score | NSForest_markers | cxg_dataset_title |
10-
|-------------|---------|------------------|-------------------|
9+
| clusterName | f_score | NSForest_markers |
10+
|-------------|---------|------------------|
1111

1212
You may include additional columns if needed. See NS-Forest SOP [here](https://docs.google.com/document/d/1gkBGF5EIATI_ki0hRjC99irbr7dsuLFk/edit).
1313

1414
## 2. Add Metadata
1515

1616
Alongside the input data, include a corresponding metadata entry in the `src/markers/input/metadata.csv` file. Each row should describe one input file and should include the following fields:
1717

18-
| file_name | Species | Species_abbreviation | Organ_region | Parent | Marker_set_xref | CxG_collection | CxG_dataset |
19-
|-----------|---------|------------------------|----------------|---------|--------------------|------------------|---------------|
18+
| file_name | Species | Species_abbreviation | Organ_region | Parent | Marker_set_xref | CxG_collection | CxG_dataset | software_version | cluster_header |
19+
|-----------|---------|----------------------|--------------|--------|-----------------|----------------|-------------|------------------|----------------|
2020

2121
Example metadata:
2222

23-
| file_name | Species | Species_abbreviation | Organ_region | Parent | Marker_set_xref | CxG_collection | CxG_dataset |
24-
|-----------|---------|------------------------|----------------|---------|--------------------|------------------|---------------|
25-
| HLCA_CellRef_MarkerPerformance_forDOS.csv | NCBITaxon:9606 | Human | UBERON:0002048 | SO:0001260 | https://doi.org/10.5281/zenodo.11165918 | https://cellxgene.cziscience.com/collections/6f6d381a-7701-4781-935c-db10d30de293 | *An integrated cell atlas of the human lung in health and disease (core)* |
26-
| nsforest_human_neocortex_global_cluster_combinatorial_results.csv | NCBITaxon:9606 | Human | UBERON:0001950 | SO:0001260 | https://doi.org/10.5281/zenodo.11165918 | https://cellxgene.cziscience.com/collections/d17249d2-0e6e-4500-abb8-e6c93fa1ac6f | |
27-
| nsforest_human_neocortex_global_subclass_results.csv | NCBITaxon:9606 | Human | UBERON:0001950 | SO:0001260 | https://doi.org/10.5281/zenodo.11165918 | https://cellxgene.cziscience.com/collections/d17249d2-0e6e-4500-abb8-e6c93fa1ac6f | |
23+
| file_name | Species | Species_abbreviation | Organ_region | Parent | Marker_set_xref | CxG_collection | CxG_dataset | software_version | cluster_header |
24+
|-------------------------------------------------------------------|----------------|----------------------|----------------|------------|-----------------------------------------|-----------------------------------------------------------------------------------|---------------------------------------------------------------------------|------------------|----------------|
25+
| HLCA_CellRef_MarkerPerformance_forDOS.csv | NCBITaxon:9606 | Human | UBERON:0002048 | SO:0001260 | https://doi.org/10.5281/zenodo.11165918 | https://cellxgene.cziscience.com/collections/6f6d381a-7701-4781-935c-db10d30de293 | *An integrated cell atlas of the human lung in health and disease (core)* | | |
26+
| nsforest_human_neocortex_global_cluster_combinatorial_results.csv | NCBITaxon:9606 | Human | UBERON:0001950 | SO:0001260 | https://doi.org/10.5281/zenodo.11165918 | https://cellxgene.cziscience.com/collections/d17249d2-0e6e-4500-abb8-e6c93fa1ac6f | | | cluster |
27+
| nsforest_human_neocortex_global_subclass_results.csv | NCBITaxon:9606 | Human | UBERON:0001950 | SO:0001260 | https://doi.org/10.5281/zenodo.11165918 | https://cellxgene.cziscience.com/collections/d17249d2-0e6e-4500-abb8-e6c93fa1ac6f | | | subclass |
2828

2929
### Notes:
3030
- `CxG_dataset` and `CxG_collection` are optional. If provided, the pipeline will use them to query the CL_KG.
3131
- If `CxG_dataset` is omitted, the pipeline will default to the `cxg_dataset_title` in the input file.
32+
- `software_version`: NS-Forest version used.
33+
- `cluster_header`: The obs key used in the NS-Forest analysis. Usually, it refers to the annotation level.
3234

3335
## 3. GitHub Action: Validate Input
3436
After adding your files and metadata, create a pull request. This will trigger an automated GitHub Action that validates the metadata and input files. The action will check for:

src/markers/input/metadata.csv

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
file_name,Species,Species_abbreviation,Organ_region,Parent,Marker_set_xref,CxG_collection,CxG_dataset
2-
HLCA_CellRef_MarkerPerformance_forDOS.csv,NCBITaxon:9606,Human,UBERON:0002048,SO:0001260,https://doi.org/10.5281/zenodo.11165918,https://cellxgene.cziscience.com/collections/6f6d381a-7701-4781-935c-db10d30de293,An integrated cell atlas of the human lung in health and disease (core)
3-
nsforest_human_neocortex_global_cluster_combinatorial_results.csv,NCBITaxon:9606,Human,UBERON:0001950,SO:0001260,https://doi.org/10.5281/zenodo.11165918,https://cellxgene.cziscience.com/collections/d17249d2-0e6e-4500-abb8-e6c93fa1ac6f,
4-
nsforest_human_neocortex_global_subclass_results.csv,NCBITaxon:9606,Human,UBERON:0001950,SO:0001260,https://doi.org/10.5281/zenodo.11165918,https://cellxgene.cziscience.com/collections/d17249d2-0e6e-4500-abb8-e6c93fa1ac6f,
1+
file_name,Species,Species_abbreviation,Organ_region,Parent,Marker_set_xref,CxG_collection,CxG_dataset,software_version,cluster_header
2+
HLCA_CellRef_MarkerPerformance_forDOS.csv,NCBITaxon:9606,Human,UBERON:0002048,SO:0001260,https://doi.org/10.5281/zenodo.11165918,https://cellxgene.cziscience.com/collections/6f6d381a-7701-4781-935c-db10d30de293,An integrated cell atlas of the human lung in health and disease (core),,
3+
nsforest_human_neocortex_global_cluster_combinatorial_results.csv,NCBITaxon:9606,Human,UBERON:0001950,SO:0001260,https://doi.org/10.5281/zenodo.11165918,https://cellxgene.cziscience.com/collections/d17249d2-0e6e-4500-abb8-e6c93fa1ac6f,,,cluster
4+
nsforest_human_neocortex_global_subclass_results.csv,NCBITaxon:9606,Human,UBERON:0001950,SO:0001260,https://doi.org/10.5281/zenodo.11165918,https://cellxgene.cziscience.com/collections/d17249d2-0e6e-4500-abb8-e6c93fa1ac6f,,,subclass
5+
,,,,,,,,

src/scripts/validate_input_files.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,12 @@ def validate_metadata_record(file_name, metadata_df):
3939
if metadata_record.empty:
4040
return f"Metadata record not found for file: {file_name}. Please update the src/markers/input/metadata.csv file."
4141

42+
missing_columns = list()
4243
for col in METADATA_REQUIRED_COLUMNS:
4344
if pd.isna(metadata_record.iloc[0][col]) or metadata_record.iloc[0][col] == "":
44-
return f"Metadata record in metadata.csv for {file_name} is missing value in column: {col}"
45+
missing_columns.append(col)
46+
if missing_columns:
47+
return f"Metadata record in metadata.csv for {file_name} is missing columns: {', '.join(missing_columns)}"
4548
return None
4649

4750
def validate_metadata(issues):

0 commit comments

Comments
 (0)