diff --git a/Claude.md b/Claude.md new file mode 100644 index 0000000..c417099 --- /dev/null +++ b/Claude.md @@ -0,0 +1,13 @@ +You are a coding review agent. Your job is to: + - Analyse the code base for test coverage - unit tests and integration tests. + - Test and log performance + - Analyse code to report on potential causes of poor performance (what are the rate-limiting steps?) + - Report on any code which follows poor practises, especially where this could lead to bugs - e.g. in corner/edge cases that we should be testing in integration tests. + +Background material: + +UberGraph: https://github.com/INCATools/ubergraph + +CellCensus API: https://chanzuckerberg.github.io/cellxgene-census/ + +CxG Schema: https://github.com/chanzuckerberg/single-cell-curation/tree/main/schema/5.2.0 \ No newline at end of file diff --git a/code_review_analysis.md b/code_review_analysis.md new file mode 100644 index 0000000..6203ea1 --- /dev/null +++ b/code_review_analysis.md @@ -0,0 +1,164 @@ +# Code Review Analysis: cxg-query-enhancer + +## Test Coverage Analysis + +### ✅ Unit Tests - Good Coverage +- **15 passing tests** covering core functionality +- **Three test modules**: `test_enhance.py`, `test_ontology_extractor.py`, `test_sparql_client.py` +- **Good mocking strategy**: Proper use of mocks for external dependencies (SPARQL queries, Census API) +- **Edge case coverage**: Tests handle error conditions, unsupported categories, missing organisms + +### ❌ Integration Tests - Missing +- **No real integration tests**: All tests use mocks, no actual API calls to Ubergraph or CellXGene Census +- **Missing end-to-end validation**: `run_cxg_query_enhancer.py` exists but not integrated into test suite +- **No performance benchmarks**: No tests measuring actual response times from external APIs + +## Performance Analysis + +### ❌ Critical Performance Issues Identified + +**Major Bottleneck: SPARQL Query Latency** +- Live test timed out at 60 seconds +- SPARQL queries to Ubergraph taking ~40+ seconds each +- Multiple sequential queries compound delays (per ontology term) +- No timeout configuration for SPARQL requests + +**Rate-Limiting Steps:** +1. **SPARQL queries to Ubergraph** (`enhancer.py:130-139`) - Each term requires 2+ queries +2. **CellXGene Census data loading** (`enhancer.py:42-46`) - Full dataset scan per category +3. **Sequential processing** - No parallelization of independent queries + +### Performance Issues in Code + +1. **No request timeouts** (`enhancer.py:111-139`): +```python +# SPARQLClient has no timeout configuration +self.sparql = SPARQLWrapper(self.endpoint) +``` + +2. **Inefficient Census filtering** (`enhancer.py:11-54`): +```python +# Loads entire Census columns into memory +df = (obs_reader.read(column_names=[ontology_column_name]) + .concat().to_pandas()) +``` + +3. **No connection pooling or caching** beyond @lru_cache on census data + +## Code Quality & Bug Risk Analysis + +### ❌ Poor Practices & Bug Risks + +**High-Risk Areas:** + +1. **Regex-based parsing** (`enhancer.py:447-483`): + - Complex regex for query parsing prone to edge case failures + - No validation of malformed queries + - Potential injection vulnerabilities with unsanitized inputs + +2. **Exception handling** (`enhancer.py:51-53`): +```python +except Exception as e: + logging.error(f"Error accessing CellXGene Census: {e}") + return None +``` + - Overly broad exception catching masks specific error types + - Silent failures could lead to incorrect results + +3. **String manipulation bugs** (`enhancer.py:575-591`): +```python +# Dangerous regex replacement without validation +query_filter = re.sub( + rf"{category}\s+in\s+\[.*?\]", + f"{category} in [{expanded_terms_str}]", + query_filter, flags=re.IGNORECASE +) +``` + - Could malform queries if regex matches incorrectly + - No validation of final query syntax + +**Edge Cases Needing Integration Tests:** + +1. **Network failures** - No graceful degradation when Ubergraph/Census unavailable +2. **Large result sets** - Memory issues with thousands of ontology terms +3. **Malformed queries** - No input sanitization or validation +4. **Mixed organism queries** - Development stage handling with wrong organisms +5. **Empty results** - Behavior when no terms match census data + +### ✅ Good Practices Found + +1. **Proper logging** throughout codebase +2. **LRU caching** for census data (`@lru_cache`) +3. **Comprehensive parameter validation** for organisms/categories +4. **Clear separation of concerns** (SPARQLClient, OntologyExtractor, enhance) + +## Repository Structure Assessment + +- **Clean organization**: Source in `src/`, tests in `tests/`, docs in `notebooks/` +- **Modern Python packaging**: Uses Poetry for dependency management +- **Good documentation**: Comprehensive README with examples +- **Missing CI/CD**: No GitHub Actions for automated testing + +## Recommendations + +### Critical (Fix Immediately) +1. **Add request timeouts** to SPARQL queries (30s max) +2. **Implement query validation** to prevent malformed outputs +3. **Add integration tests** with real API calls (marked as slow tests) +4. **Add error boundaries** for network failures + +### Important (Next Sprint) +1. **Optimize Census data loading** - column-specific queries vs full scans +2. **Implement parallel SPARQL queries** for multiple terms +3. **Add input sanitization** for query filters +4. **Create performance benchmarks** as part of test suite + +### Nice-to-Have +1. **Connection pooling** for SPARQL client +2. **Query result caching** beyond Census data +3. **Async/await support** for better performance +4. **More granular logging levels** + +## Query Syntax Documentation Research + +### ✅ SOMA/TileDB Query Syntax Found + +**Official Documentation**: The `obs_value_filter` uses **SOMA value_filter syntax** which is well-documented: + +**Supported Syntax Pattern**: +``` +value_filter="column_name operator value [logical_operator ...]" +``` + +**Supported Operators**: +- **Equality**: `==`, `!=` +- **Comparison**: `>`, `<`, `>=`, `<=` +- **Membership**: `in`, `not in` +- **Logical**: `and`, `or` +- **Grouping**: `()` for precedence + +**Valid Examples**: +```python +# Simple equality +"cell_type == 'neuron'" + +# List membership +"tissue in ['kidney', 'liver']" + +# Complex conditions +"cell_type == 'neuron' and tissue in ['brain', 'spinal cord']" + +# Numeric comparisons +"n_genes > 500 and total_counts < 10000" +``` + +### 🔧 Recommendation: Replace Regex Parser + +The current regex-based parsing (`enhancer.py:447-483`) should be **replaced with proper SOMA syntax validation**: + +1. **Use AST parsing** instead of regex to validate query structure +2. **Leverage existing SOMA validation** by testing queries against dummy DataFrames +3. **Add syntax validation** before executing expensive SPARQL queries +4. **Support full SOMA syntax** rather than limited regex patterns + +This would eliminate the major bug risk from malformed query generation and ensure compatibility with the official SOMA specification. \ No newline at end of file diff --git a/performance_bottleneck_analysis.md b/performance_bottleneck_analysis.md new file mode 100644 index 0000000..c04c09d --- /dev/null +++ b/performance_bottleneck_analysis.md @@ -0,0 +1,115 @@ +# Performance Bottleneck Analysis: Label Resolution + +## 🔍 Test Results Summary + +**Test Case**: `get_subclasses('astrocyte', 'cell_type')` + +### Critical Performance Findings + +| Operation | Time (seconds) | Percentage of Total | +|-----------|----------------|-------------------| +| **Label Resolution** | **40.49s - 53.53s** | **~99%** | +| Subclass Query | 0.42s - 0.55s | ~1% | +| **TOTAL** | **~41-54s** | **100%** | + +### 🐛 Root Cause: Complex SPARQL Label Resolution Query + +The bottleneck is in `OntologyExtractor.get_ontology_id_from_label()` at **lines 207-246** of `enhancer.py`. + +**The Problematic Query Structure**: +```sparql +SELECT DISTINCT ?term WHERE { + # 5 UNION clauses searching across multiple synonym types + { ?term rdfs:label ?label . FILTER(LCASE(?label) = LCASE("astrocyte")) } + UNION + { ?term oboInOwl:hasExactSynonym ?synonym . FILTER(LCASE(?synonym) = LCASE("astrocyte")) } + UNION + { ?term oboInOwl:hasRelatedSynonym ?synonym . FILTER(LCASE(?synonym) = LCASE("astrocyte")) } + UNION + { ?term oboInOwl:hasBroadSynonym ?synonym . FILTER(LCASE(?synonym) = LCASE("astrocyte")) } + UNION + { ?term oboInOwl:hasNarrowSynonym ?synonym . FILTER(LCASE(?synonym) = LCASE("astrocyte")) } + + FILTER(STRSTARTS(STR(?term), "http://purl.obolibrary.org/obo/CL_")) +} +``` + +### Why This Query Is Slow + +1. **5 UNION Operations**: Each UNION clause requires a separate scan +2. **String Functions**: `LCASE()` and `STRSTARTS()` are expensive on large datasets +3. **Multiple Property Searches**: Searching across 5 different synonym properties +4. **Full Graph Scan**: Without proper indexes, this scans the entire knowledge graph + +### Performance Comparison + +``` +Direct SPARQL Queries: +├── Label Resolution: 40.49s - 53.53s ⚠️ BOTTLENECK +└── Subclass Lookup: 0.42s - 0.55s ✅ FAST + +Workflow Impact: +├── With Labels: ~41s total (99% label resolution) +└── With Direct ID: ~0.4s total (100x faster!) +``` + +## 💡 Optimization Recommendations + +### Immediate Fixes (High Impact) + +1. **Optimize SPARQL Query** - Rewrite to use more efficient patterns: + ```sparql + # Instead of 5 UNIONs, use a single pattern with property paths + SELECT DISTINCT ?term WHERE { + ?term (rdfs:label|oboInOwl:hasExactSynonym|oboInOwl:hasRelatedSynonym| + oboInOwl:hasBroadSynonym|oboInOwl:hasNarrowSynonym) ?labelValue . + FILTER(LCASE(STR(?labelValue)) = LCASE("astrocyte")) + FILTER(STRSTARTS(STR(?term), "http://purl.obolibrary.org/obo/CL_")) + } + ``` + +2. **Add Local Caching Layer**: + ```python + @lru_cache(maxsize=1000) + def get_ontology_id_from_label_cached(label, category, organism=None): + # Cache resolved label->ID mappings + ``` + +3. **Implement Request Timeouts**: + ```python + # In SPARQLClient.__init__ + self.sparql.setTimeout(30) # 30-second timeout + ``` + +### Medium-Term Solutions + +1. **Pre-build Label Index**: Create a local SQLite database with label→ID mappings +2. **Batch Processing**: Resolve multiple labels in a single SPARQL query +3. **Fallback Strategy**: Try exact label match first, then synonyms only if needed + +### Long-Term Architecture Changes + +1. **Local Ontology Cache**: Download and index commonly used ontologies locally +2. **Async Processing**: Use asyncio for parallel SPARQL queries +3. **Smart Defaults**: Provide common cell type mappings as built-in constants + +## 📊 Impact Analysis + +**Current User Experience**: +- Simple query: 40+ seconds (unusable) +- Multiple terms: 2+ minutes per term (completely unusable) + +**After Optimization**: +- Target: <5 seconds for label resolution +- Best case: <1 second with proper caching +- Improvement: **8-40x speedup** + +## 🎯 Recommended Implementation Priority + +1. **CRITICAL**: Add SPARQL timeout (5 minutes to implement) +2. **HIGH**: Optimize SPARQL query structure (2 hours) +3. **HIGH**: Add LRU caching for label resolution (1 hour) +4. **MEDIUM**: Build label→ID lookup table (1 day) +5. **LOW**: Implement async processing (1 week) + +The **40+ second delay** for a single label lookup makes the current system unusable for production. The label resolution SPARQL query needs immediate optimization. \ No newline at end of file diff --git a/performance_test.py b/performance_test.py new file mode 100644 index 0000000..044f747 --- /dev/null +++ b/performance_test.py @@ -0,0 +1,221 @@ +#!/usr/bin/env python3 +""" +Performance testing script to identify bottlenecks in OntologyExtractor operations. +Focus on testing the 'astrocyte' label resolution and subclass queries. +""" + +import time +import logging +from cxg_query_enhancer import OntologyExtractor, SPARQLClient + +# Configure logging to see timing information +logging.basicConfig( + level=logging.INFO, + format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", + handlers=[logging.StreamHandler()], +) + +logger = logging.getLogger(__name__) + + +def time_function(func_name, func, *args, **kwargs): + """Time a function and log results.""" + logger.info(f"🔍 Starting {func_name}...") + start_time = time.time() + + try: + result = func(*args, **kwargs) + end_time = time.time() + duration = end_time - start_time + + logger.info(f"✅ {func_name} completed in {duration:.2f} seconds") + return result, duration + except Exception as e: + end_time = time.time() + duration = end_time - start_time + logger.error(f"❌ {func_name} failed after {duration:.2f} seconds: {e}") + return None, duration + + +def test_label_resolution_bottleneck(): + """Test the specific bottleneck for 'astrocyte' label resolution.""" + logger.info("=" * 60) + logger.info("PERFORMANCE TEST: Label Resolution Bottleneck Analysis") + logger.info("=" * 60) + + # Initialize components + sparql_client = SPARQLClient() + extractor = OntologyExtractor(sparql_client) + + # Test case: 'astrocyte' label + test_label = "astrocyte" + test_category = "cell_type" + + logger.info(f"Testing with label: '{test_label}', category: '{test_category}'") + + # Step 1: Test get_ontology_id_from_label (suspected bottleneck) + resolved_id, duration1 = time_function( + "get_ontology_id_from_label", + extractor.get_ontology_id_from_label, + test_label, test_category + ) + + if resolved_id: + logger.info(f"📍 Resolved '{test_label}' to ID: {resolved_id}") + + # Step 2: Test get_subclasses with the resolved ID + subclasses, duration2 = time_function( + "get_subclasses (with resolved ID)", + extractor.get_subclasses, + resolved_id, test_category + ) + + if subclasses: + logger.info(f"📊 Found {len(subclasses)} subclasses") + logger.info("First 3 subclasses:") + for i, subclass in enumerate(subclasses[:3]): + logger.info(f" {i+1}. {subclass['ID']} - {subclass['Label']}") + + # Step 3: Compare with direct ID input (should be faster) + logger.info("\n" + "=" * 40) + logger.info("COMPARISON: Direct ID vs Label Input") + logger.info("=" * 40) + + direct_subclasses, duration3 = time_function( + "get_subclasses (direct ID input)", + extractor.get_subclasses, + resolved_id, test_category + ) + + # Summary + logger.info("\n" + "=" * 40) + logger.info("PERFORMANCE SUMMARY") + logger.info("=" * 40) + logger.info(f"1. Label Resolution: {duration1:.2f}s") + logger.info(f"2. Subclasses (label): {duration2:.2f}s") + logger.info(f"3. Subclasses (direct): {duration3:.2f}s") + logger.info(f"Total (label workflow): {duration1 + duration2:.2f}s") + logger.info(f"Direct ID speedup: {(duration2/duration3):.1f}x faster") + + # Identify bottleneck + if duration1 > duration2: + logger.warning("🐛 BOTTLENECK IDENTIFIED: Label resolution is the slowest step!") + else: + logger.warning("🐛 BOTTLENECK IDENTIFIED: Subclass retrieval is the slowest step!") + + else: + logger.error("❌ Could not resolve label - cannot continue with subclass testing") + + +def test_sparql_query_components(): + """Test individual SPARQL query components to identify specific bottlenecks.""" + logger.info("\n" + "=" * 60) + logger.info("DETAILED SPARQL QUERY ANALYSIS") + logger.info("=" * 60) + + sparql_client = SPARQLClient() + + # Test 1: Simple label resolution query (what get_ontology_id_from_label does) + label_query = """ + PREFIX rdfs: + PREFIX obo: + PREFIX oboInOwl: + + SELECT DISTINCT ?term + WHERE { + # Match main label + { + ?term rdfs:label ?label . + FILTER(LCASE(?label) = LCASE("astrocyte")) + } + UNION + # Match exact synonyms + { + ?term oboInOwl:hasExactSynonym ?synonym . + FILTER(LCASE(?synonym) = LCASE("astrocyte")) + } + UNION + # Match related synonyms + { + ?term oboInOwl:hasRelatedSynonym ?synonym . + FILTER(LCASE(?synonym) = LCASE("astrocyte")) + } + UNION + # Match broad synonyms + { + ?term oboInOwl:hasBroadSynonym ?synonym . + FILTER(LCASE(?synonym) = LCASE("astrocyte")) + } + UNION + # Match narrow synonyms + { + ?term oboInOwl:hasNarrowSynonym ?synonym . + FILTER(LCASE(?synonym) = LCASE("astrocyte")) + } + FILTER(STRSTARTS(STR(?term), "http://purl.obolibrary.org/obo/CL_")) + } + LIMIT 1 + """ + + result1, duration1 = time_function( + "Direct label resolution SPARQL", + sparql_client.query, + label_query + ) + + # Test 2: Subclasses query (assumes astrocyte is CL:0000127) + subclass_query = """ + PREFIX obo: + PREFIX rdfs: + + SELECT DISTINCT ?term (STR(?term_label) as ?label) + WHERE { + VALUES ?inputTerm { obo:CL_0000127 } + + { + ?term rdfs:subClassOf ?inputTerm . + } + UNION + { + ?term obo:BFO_0000050 ?inputTerm . + } + + ?term rdfs:label ?term_label . + FILTER(STRSTARTS(STR(?term), "http://purl.obolibrary.org/obo/CL_")) + } + LIMIT 1000 + """ + + result2, duration2 = time_function( + "Direct subclasses SPARQL", + sparql_client.query, + subclass_query + ) + + logger.info("\n" + "=" * 40) + logger.info("SPARQL QUERY TIMING COMPARISON") + logger.info("=" * 40) + logger.info(f"Label resolution query: {duration1:.2f}s") + logger.info(f"Subclasses query: {duration2:.2f}s") + + if duration1 > duration2: + logger.warning("🐛 CONFIRMED: Label resolution SPARQL is the bottleneck!") + logger.info("💡 RECOMMENDATION: Cache label-to-ID mappings or use direct IDs when possible") + else: + logger.warning("🐛 UNEXPECTED: Subclasses query is slower than label resolution") + + +if __name__ == "__main__": + logger.info("🚀 Starting Performance Analysis for 'astrocyte' label...") + + try: + # Test the full workflow + test_label_resolution_bottleneck() + + # Test individual SPARQL components + test_sparql_query_components() + + logger.info("\n🎯 Performance testing completed!") + + except Exception as e: + logger.error(f"💥 Performance test failed: {e}", exc_info=True) \ No newline at end of file