Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions Claude.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
You are a coding review agent. Your job is to:
- Analyse the code base for test coverage - unit tests and integration tests.
- Test and log performance
- Analyse code to report on potential causes of poor performance (what are the rate-limiting steps?)
- Report on any code which follows poor practises, especially where this could lead to bugs - e.g. in corner/edge cases that we should be testing in integration tests.

Background material:

UberGraph: https://github.com/INCATools/ubergraph

CellCensus API: https://chanzuckerberg.github.io/cellxgene-census/

CxG Schema: https://github.com/chanzuckerberg/single-cell-curation/tree/main/schema/5.2.0
164 changes: 164 additions & 0 deletions code_review_analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# Code Review Analysis: cxg-query-enhancer

## Test Coverage Analysis

### ✅ Unit Tests - Good Coverage
- **15 passing tests** covering core functionality
- **Three test modules**: `test_enhance.py`, `test_ontology_extractor.py`, `test_sparql_client.py`
- **Good mocking strategy**: Proper use of mocks for external dependencies (SPARQL queries, Census API)
- **Edge case coverage**: Tests handle error conditions, unsupported categories, missing organisms

### ❌ Integration Tests - Missing
- **No real integration tests**: All tests use mocks, no actual API calls to Ubergraph or CellXGene Census
- **Missing end-to-end validation**: `run_cxg_query_enhancer.py` exists but not integrated into test suite
- **No performance benchmarks**: No tests measuring actual response times from external APIs

## Performance Analysis

### ❌ Critical Performance Issues Identified

**Major Bottleneck: SPARQL Query Latency**
- Live test timed out at 60 seconds
- SPARQL queries to Ubergraph taking ~40+ seconds each
- Multiple sequential queries compound delays (per ontology term)
- No timeout configuration for SPARQL requests

**Rate-Limiting Steps:**
1. **SPARQL queries to Ubergraph** (`enhancer.py:130-139`) - Each term requires 2+ queries
2. **CellXGene Census data loading** (`enhancer.py:42-46`) - Full dataset scan per category
3. **Sequential processing** - No parallelization of independent queries

### Performance Issues in Code

1. **No request timeouts** (`enhancer.py:111-139`):
```python
# SPARQLClient has no timeout configuration
self.sparql = SPARQLWrapper(self.endpoint)
```

2. **Inefficient Census filtering** (`enhancer.py:11-54`):
```python
# Loads entire Census columns into memory
df = (obs_reader.read(column_names=[ontology_column_name])
.concat().to_pandas())
```

3. **No connection pooling or caching** beyond @lru_cache on census data

## Code Quality & Bug Risk Analysis

### ❌ Poor Practices & Bug Risks

**High-Risk Areas:**

1. **Regex-based parsing** (`enhancer.py:447-483`):
- Complex regex for query parsing prone to edge case failures
- No validation of malformed queries
- Potential injection vulnerabilities with unsanitized inputs

2. **Exception handling** (`enhancer.py:51-53`):
```python
except Exception as e:
logging.error(f"Error accessing CellXGene Census: {e}")
return None
```
- Overly broad exception catching masks specific error types
- Silent failures could lead to incorrect results

3. **String manipulation bugs** (`enhancer.py:575-591`):
```python
# Dangerous regex replacement without validation
query_filter = re.sub(
rf"{category}\s+in\s+\[.*?\]",
f"{category} in [{expanded_terms_str}]",
query_filter, flags=re.IGNORECASE
)
```
- Could malform queries if regex matches incorrectly
- No validation of final query syntax

**Edge Cases Needing Integration Tests:**

1. **Network failures** - No graceful degradation when Ubergraph/Census unavailable
2. **Large result sets** - Memory issues with thousands of ontology terms
3. **Malformed queries** - No input sanitization or validation
4. **Mixed organism queries** - Development stage handling with wrong organisms
5. **Empty results** - Behavior when no terms match census data

### ✅ Good Practices Found

1. **Proper logging** throughout codebase
2. **LRU caching** for census data (`@lru_cache`)
3. **Comprehensive parameter validation** for organisms/categories
4. **Clear separation of concerns** (SPARQLClient, OntologyExtractor, enhance)

## Repository Structure Assessment

- **Clean organization**: Source in `src/`, tests in `tests/`, docs in `notebooks/`
- **Modern Python packaging**: Uses Poetry for dependency management
- **Good documentation**: Comprehensive README with examples
- **Missing CI/CD**: No GitHub Actions for automated testing

## Recommendations

### Critical (Fix Immediately)
1. **Add request timeouts** to SPARQL queries (30s max)
2. **Implement query validation** to prevent malformed outputs
3. **Add integration tests** with real API calls (marked as slow tests)
4. **Add error boundaries** for network failures

### Important (Next Sprint)
1. **Optimize Census data loading** - column-specific queries vs full scans
2. **Implement parallel SPARQL queries** for multiple terms
3. **Add input sanitization** for query filters
4. **Create performance benchmarks** as part of test suite

### Nice-to-Have
1. **Connection pooling** for SPARQL client
2. **Query result caching** beyond Census data
3. **Async/await support** for better performance
4. **More granular logging levels**

## Query Syntax Documentation Research

### ✅ SOMA/TileDB Query Syntax Found

**Official Documentation**: The `obs_value_filter` uses **SOMA value_filter syntax** which is well-documented:

**Supported Syntax Pattern**:
```
value_filter="column_name operator value [logical_operator ...]"
```

**Supported Operators**:
- **Equality**: `==`, `!=`
- **Comparison**: `>`, `<`, `>=`, `<=`
- **Membership**: `in`, `not in`
- **Logical**: `and`, `or`
- **Grouping**: `()` for precedence

**Valid Examples**:
```python
# Simple equality
"cell_type == 'neuron'"

# List membership
"tissue in ['kidney', 'liver']"

# Complex conditions
"cell_type == 'neuron' and tissue in ['brain', 'spinal cord']"

# Numeric comparisons
"n_genes > 500 and total_counts < 10000"
```

### 🔧 Recommendation: Replace Regex Parser

The current regex-based parsing (`enhancer.py:447-483`) should be **replaced with proper SOMA syntax validation**:

1. **Use AST parsing** instead of regex to validate query structure
2. **Leverage existing SOMA validation** by testing queries against dummy DataFrames
3. **Add syntax validation** before executing expensive SPARQL queries
4. **Support full SOMA syntax** rather than limited regex patterns

This would eliminate the major bug risk from malformed query generation and ensure compatibility with the official SOMA specification.
115 changes: 115 additions & 0 deletions performance_bottleneck_analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Performance Bottleneck Analysis: Label Resolution

## 🔍 Test Results Summary

**Test Case**: `get_subclasses('astrocyte', 'cell_type')`

### Critical Performance Findings

| Operation | Time (seconds) | Percentage of Total |
|-----------|----------------|-------------------|
| **Label Resolution** | **40.49s - 53.53s** | **~99%** |
| Subclass Query | 0.42s - 0.55s | ~1% |
| **TOTAL** | **~41-54s** | **100%** |

### 🐛 Root Cause: Complex SPARQL Label Resolution Query

The bottleneck is in `OntologyExtractor.get_ontology_id_from_label()` at **lines 207-246** of `enhancer.py`.

**The Problematic Query Structure**:
```sparql
SELECT DISTINCT ?term WHERE {
# 5 UNION clauses searching across multiple synonym types
{ ?term rdfs:label ?label . FILTER(LCASE(?label) = LCASE("astrocyte")) }
UNION
{ ?term oboInOwl:hasExactSynonym ?synonym . FILTER(LCASE(?synonym) = LCASE("astrocyte")) }
UNION
{ ?term oboInOwl:hasRelatedSynonym ?synonym . FILTER(LCASE(?synonym) = LCASE("astrocyte")) }
UNION
{ ?term oboInOwl:hasBroadSynonym ?synonym . FILTER(LCASE(?synonym) = LCASE("astrocyte")) }
UNION
{ ?term oboInOwl:hasNarrowSynonym ?synonym . FILTER(LCASE(?synonym) = LCASE("astrocyte")) }

FILTER(STRSTARTS(STR(?term), "http://purl.obolibrary.org/obo/CL_"))
}
```

### Why This Query Is Slow

1. **5 UNION Operations**: Each UNION clause requires a separate scan
2. **String Functions**: `LCASE()` and `STRSTARTS()` are expensive on large datasets
3. **Multiple Property Searches**: Searching across 5 different synonym properties
4. **Full Graph Scan**: Without proper indexes, this scans the entire knowledge graph

### Performance Comparison

```
Direct SPARQL Queries:
├── Label Resolution: 40.49s - 53.53s ⚠️ BOTTLENECK
└── Subclass Lookup: 0.42s - 0.55s ✅ FAST

Workflow Impact:
├── With Labels: ~41s total (99% label resolution)
└── With Direct ID: ~0.4s total (100x faster!)
```

## 💡 Optimization Recommendations

### Immediate Fixes (High Impact)

1. **Optimize SPARQL Query** - Rewrite to use more efficient patterns:
```sparql
# Instead of 5 UNIONs, use a single pattern with property paths
SELECT DISTINCT ?term WHERE {
?term (rdfs:label|oboInOwl:hasExactSynonym|oboInOwl:hasRelatedSynonym|
oboInOwl:hasBroadSynonym|oboInOwl:hasNarrowSynonym) ?labelValue .
FILTER(LCASE(STR(?labelValue)) = LCASE("astrocyte"))
FILTER(STRSTARTS(STR(?term), "http://purl.obolibrary.org/obo/CL_"))
}
```

2. **Add Local Caching Layer**:
```python
@lru_cache(maxsize=1000)
def get_ontology_id_from_label_cached(label, category, organism=None):
# Cache resolved label->ID mappings
```

3. **Implement Request Timeouts**:
```python
# In SPARQLClient.__init__
self.sparql.setTimeout(30) # 30-second timeout
```

### Medium-Term Solutions

1. **Pre-build Label Index**: Create a local SQLite database with label→ID mappings
2. **Batch Processing**: Resolve multiple labels in a single SPARQL query
3. **Fallback Strategy**: Try exact label match first, then synonyms only if needed

### Long-Term Architecture Changes

1. **Local Ontology Cache**: Download and index commonly used ontologies locally
2. **Async Processing**: Use asyncio for parallel SPARQL queries
3. **Smart Defaults**: Provide common cell type mappings as built-in constants

## 📊 Impact Analysis

**Current User Experience**:
- Simple query: 40+ seconds (unusable)
- Multiple terms: 2+ minutes per term (completely unusable)

**After Optimization**:
- Target: <5 seconds for label resolution
- Best case: <1 second with proper caching
- Improvement: **8-40x speedup**

## 🎯 Recommended Implementation Priority

1. **CRITICAL**: Add SPARQL timeout (5 minutes to implement)
2. **HIGH**: Optimize SPARQL query structure (2 hours)
3. **HIGH**: Add LRU caching for label resolution (1 hour)
4. **MEDIUM**: Build label→ID lookup table (1 day)
5. **LOW**: Implement async processing (1 week)

The **40+ second delay** for a single label lookup makes the current system unusable for production. The label resolution SPARQL query needs immediate optimization.
Loading