Cellular-Semantics · dosumis · Sep 26, 2025
diff --git a/Claude.md b/Claude.md
@@ -0,0 +1,13 @@
+You are a coding review agent.  Your job is to:
+  - Analyse the code base for test coverage - unit tests and integration tests.
+  - Test and log performance
+  - Analyse code to report on potential causes of poor performance (what are the rate-limiting steps?)
+  - Report on any code which follows poor practises, especially where this could lead to bugs - e.g. in corner/edge cases that we should be testing in integration tests.
+
+Background material:
+
+UberGraph: https://github.com/INCATools/ubergraph
+
+CellCensus API: https://chanzuckerberg.github.io/cellxgene-census/
+
+CxG Schema: https://github.com/chanzuckerberg/single-cell-curation/tree/main/schema/5.2.0
diff --git a/code_review_analysis.md b/code_review_analysis.md
@@ -0,0 +1,164 @@
+# Code Review Analysis: cxg-query-enhancer
+
+## Test Coverage Analysis
+
+### ✅ Unit Tests - Good Coverage
+- **15 passing tests** covering core functionality
+- **Three test modules**: `test_enhance.py`, `test_ontology_extractor.py`, `test_sparql_client.py`
+- **Good mocking strategy**: Proper use of mocks for external dependencies (SPARQL queries, Census API)
+- **Edge case coverage**: Tests handle error conditions, unsupported categories, missing organisms
+
+### ❌ Integration Tests - Missing
+- **No real integration tests**: All tests use mocks, no actual API calls to Ubergraph or CellXGene Census
+- **Missing end-to-end validation**: `run_cxg_query_enhancer.py` exists but not integrated into test suite
+- **No performance benchmarks**: No tests measuring actual response times from external APIs
+
+## Performance Analysis
+
+### ❌ Critical Performance Issues Identified
+
+**Major Bottleneck: SPARQL Query Latency**
+- Live test timed out at 60 seconds
+- SPARQL queries to Ubergraph taking ~40+ seconds each
+- Multiple sequential queries compound delays (per ontology term)
+- No timeout configuration for SPARQL requests
+
+**Rate-Limiting Steps:**
+1. **SPARQL queries to Ubergraph** (`enhancer.py:130-139`) - Each term requires 2+ queries
+2. **CellXGene Census data loading** (`enhancer.py:42-46`) - Full dataset scan per category
+3. **Sequential processing** - No parallelization of independent queries
+
+### Performance Issues in Code
+
+1. **No request timeouts** (`enhancer.py:111-139`):
+```python
+# SPARQLClient has no timeout configuration
+self.sparql = SPARQLWrapper(self.endpoint)
+```
+
+2. **Inefficient Census filtering** (`enhancer.py:11-54`):
+```python
+# Loads entire Census columns into memory
+df = (obs_reader.read(column_names=[ontology_column_name])
+      .concat().to_pandas())
+```
+
+3. **No connection pooling or caching** beyond @lru_cache on census data
+
+## Code Quality & Bug Risk Analysis
+
+### ❌ Poor Practices & Bug Risks
+
+**High-Risk Areas:**
+
+1. **Regex-based parsing** (`enhancer.py:447-483`):
+   - Complex regex for query parsing prone to edge case failures
+   - No validation of malformed queries
+   - Potential injection vulnerabilities with unsanitized inputs
+
+2. **Exception handling** (`enhancer.py:51-53`):
+```python
+except Exception as e:
+    logging.error(f"Error accessing CellXGene Census: {e}")
+    return None
+```
+   - Overly broad exception catching masks specific error types
+   - Silent failures could lead to incorrect results
+
+3. **String manipulation bugs** (`enhancer.py:575-591`):
+```python
+# Dangerous regex replacement without validation
+query_filter = re.sub(
+    rf"{category}\s+in\s+\[.*?\]",
+    f"{category} in [{expanded_terms_str}]",
+    query_filter, flags=re.IGNORECASE
+)
+```
+   - Could malform queries if regex matches incorrectly
+   - No validation of final query syntax
+
+**Edge Cases Needing Integration Tests:**
+
+1. **Network failures** - No graceful degradation when Ubergraph/Census unavailable
+2. **Large result sets** - Memory issues with thousands of ontology terms
+3. **Malformed queries** - No input sanitization or validation
+4. **Mixed organism queries** - Development stage handling with wrong organisms
+5. **Empty results** - Behavior when no terms match census data
+
+### ✅ Good Practices Found
+
+1. **Proper logging** throughout codebase
+2. **LRU caching** for census data (`@lru_cache`)
+3. **Comprehensive parameter validation** for organisms/categories
+4. **Clear separation of concerns** (SPARQLClient, OntologyExtractor, enhance)
+
+## Repository Structure Assessment
+
+- **Clean organization**: Source in `src/`, tests in `tests/`, docs in `notebooks/`
+- **Modern Python packaging**: Uses Poetry for dependency management
+- **Good documentation**: Comprehensive README with examples
+- **Missing CI/CD**: No GitHub Actions for automated testing
+
+## Recommendations
+
+### Critical (Fix Immediately)
+1. **Add request timeouts** to SPARQL queries (30s max)
+2. **Implement query validation** to prevent malformed outputs
+3. **Add integration tests** with real API calls (marked as slow tests)
+4. **Add error boundaries** for network failures
+
+### Important (Next Sprint)
+1. **Optimize Census data loading** - column-specific queries vs full scans
+2. **Implement parallel SPARQL queries** for multiple terms
+3. **Add input sanitization** for query filters
+4. **Create performance benchmarks** as part of test suite
+
+### Nice-to-Have
+1. **Connection pooling** for SPARQL client
+2. **Query result caching** beyond Census data
+3. **Async/await support** for better performance
+4. **More granular logging levels**
+
+## Query Syntax Documentation Research
+
+### ✅ SOMA/TileDB Query Syntax Found
+
+**Official Documentation**: The `obs_value_filter` uses **SOMA value_filter syntax** which is well-documented:
+
+**Supported Syntax Pattern**:
+```
+value_filter="column_name operator value [logical_operator ...]"
+```
+
+**Supported Operators**:
+- **Equality**: `==`, `!=`
+- **Comparison**: `>`, `<`, `>=`, `<=`
+- **Membership**: `in`, `not in`
+- **Logical**: `and`, `or`
+- **Grouping**: `()` for precedence
+
+**Valid Examples**:
+```python
+# Simple equality
+"cell_type == 'neuron'"
+
+# List membership
+"tissue in ['kidney', 'liver']"
+
+# Complex conditions
+"cell_type == 'neuron' and tissue in ['brain', 'spinal cord']"
+
+# Numeric comparisons
+"n_genes > 500 and total_counts < 10000"
+```
+
+### 🔧 Recommendation: Replace Regex Parser
+
+The current regex-based parsing (`enhancer.py:447-483`) should be **replaced with proper SOMA syntax validation**:
+
+1. **Use AST parsing** instead of regex to validate query structure
+2. **Leverage existing SOMA validation** by testing queries against dummy DataFrames
+3. **Add syntax validation** before executing expensive SPARQL queries
+4. **Support full SOMA syntax** rather than limited regex patterns
+
+This would eliminate the major bug risk from malformed query generation and ensure compatibility with the official SOMA specification.
diff --git a/performance_bottleneck_analysis.md b/performance_bottleneck_analysis.md
@@ -0,0 +1,115 @@
+# Performance Bottleneck Analysis: Label Resolution
+
+## 🔍 Test Results Summary
+
+**Test Case**: `get_subclasses('astrocyte', 'cell_type')`
+
+### Critical Performance Findings
+
+| Operation | Time (seconds) | Percentage of Total |
+|-----------|----------------|-------------------|
+| **Label Resolution** | **40.49s - 53.53s** | **~99%** |
+| Subclass Query | 0.42s - 0.55s | ~1% |
+| **TOTAL** | **~41-54s** | **100%** |
+
+### 🐛 Root Cause: Complex SPARQL Label Resolution Query
+
+The bottleneck is in `OntologyExtractor.get_ontology_id_from_label()` at **lines 207-246** of `enhancer.py`.
+
+**The Problematic Query Structure**:
+```sparql
+SELECT DISTINCT ?term WHERE {
+    # 5 UNION clauses searching across multiple synonym types
+    { ?term rdfs:label ?label . FILTER(LCASE(?label) = LCASE("astrocyte")) }
+    UNION
+    { ?term oboInOwl:hasExactSynonym ?synonym . FILTER(LCASE(?synonym) = LCASE("astrocyte")) }
+    UNION
+    { ?term oboInOwl:hasRelatedSynonym ?synonym . FILTER(LCASE(?synonym) = LCASE("astrocyte")) }
+    UNION
+    { ?term oboInOwl:hasBroadSynonym ?synonym . FILTER(LCASE(?synonym) = LCASE("astrocyte")) }
+    UNION
+    { ?term oboInOwl:hasNarrowSynonym ?synonym . FILTER(LCASE(?synonym) = LCASE("astrocyte")) }
+
+    FILTER(STRSTARTS(STR(?term), "http://purl.obolibrary.org/obo/CL_"))
+}
+```
+
+### Why This Query Is Slow
+
+1. **5 UNION Operations**: Each UNION clause requires a separate scan
+2. **String Functions**: `LCASE()` and `STRSTARTS()` are expensive on large datasets
+3. **Multiple Property Searches**: Searching across 5 different synonym properties
+4. **Full Graph Scan**: Without proper indexes, this scans the entire knowledge graph
+
+### Performance Comparison
+
+```
+Direct SPARQL Queries:
+├── Label Resolution: 40.49s - 53.53s  ⚠️  BOTTLENECK
+└── Subclass Lookup:  0.42s - 0.55s   ✅  FAST
+
+Workflow Impact:
+├── With Labels:    ~41s total (99% label resolution)
+└── With Direct ID: ~0.4s total (100x faster!)
+```
+
+## 💡 Optimization Recommendations
+
+### Immediate Fixes (High Impact)
+
+1. **Optimize SPARQL Query** - Rewrite to use more efficient patterns:
+   ```sparql
+   # Instead of 5 UNIONs, use a single pattern with property paths
+   SELECT DISTINCT ?term WHERE {
+     ?term (rdfs:label|oboInOwl:hasExactSynonym|oboInOwl:hasRelatedSynonym|
+            oboInOwl:hasBroadSynonym|oboInOwl:hasNarrowSynonym) ?labelValue .
+     FILTER(LCASE(STR(?labelValue)) = LCASE("astrocyte"))
+     FILTER(STRSTARTS(STR(?term), "http://purl.obolibrary.org/obo/CL_"))
+   }
+   ```
+
+2. **Add Local Caching Layer**:
+   ```python
+   @lru_cache(maxsize=1000)
+   def get_ontology_id_from_label_cached(label, category, organism=None):
+       # Cache resolved label->ID mappings
+   ```
+
+3. **Implement Request Timeouts**:
+   ```python
+   # In SPARQLClient.__init__
+   self.sparql.setTimeout(30)  # 30-second timeout
+   ```
+
+### Medium-Term Solutions
+
+1. **Pre-build Label Index**: Create a local SQLite database with label→ID mappings
+2. **Batch Processing**: Resolve multiple labels in a single SPARQL query
+3. **Fallback Strategy**: Try exact label match first, then synonyms only if needed
+
+### Long-Term Architecture Changes
+
+1. **Local Ontology Cache**: Download and index commonly used ontologies locally
+2. **Async Processing**: Use asyncio for parallel SPARQL queries
+3. **Smart Defaults**: Provide common cell type mappings as built-in constants
+
+## 📊 Impact Analysis
+
+**Current User Experience**:
+- Simple query: 40+ seconds (unusable)
+- Multiple terms: 2+ minutes per term (completely unusable)
+
+**After Optimization**:
+- Target: <5 seconds for label resolution
+- Best case: <1 second with proper caching
+- Improvement: **8-40x speedup**
+
+## 🎯 Recommended Implementation Priority
+
+1. **CRITICAL**: Add SPARQL timeout (5 minutes to implement)
+2. **HIGH**: Optimize SPARQL query structure (2 hours)
+3. **HIGH**: Add LRU caching for label resolution (1 hour)
+4. **MEDIUM**: Build label→ID lookup table (1 day)
+5. **LOW**: Implement async processing (1 week)
+
+The **40+ second delay** for a single label lookup makes the current system unusable for production. The label resolution SPARQL query needs immediate optimization.