-
Notifications
You must be signed in to change notification settings - Fork 473
Feat/algolia migration #20302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feat/algolia migration #20302
Conversation
✅ Deploy Preview for cockroachdb-api-docs canceled.
|
✅ Deploy Preview for cockroachdb-interactivetutorials-docs canceled.
|
Files changed:
|
✅ Deploy Preview for cockroachdb-api-docs canceled.
|
✅ Deploy Preview for cockroachdb-interactivetutorials-docs canceled.
|
✅ Netlify Preview
To edit notification comments on pull requests, go to your Netlify project configuration. |
✅ Netlify Preview
To edit notification comments on pull requests, go to your Netlify project configuration. |
| **`check_ranking_parity.py`** | Production parity verification | ❌ Optional validation | | ||
| **`compare_to_prod_explain.py`** | Index comparison analysis | ❌ Optional analysis | | ||
| **`test_all_files.py`** | File processing validation | ❌ Dev only | | ||
| **`algolia_index_prod_match.py`** | Legacy production matcher | ❌ Reference only | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean going forward? Will this become obsolete when the new indexing version is released?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry i did not understand it, what would become obsolete?
**Indexing Rules:** | ||
- ✅ Always include: `/releases/`, `/cockroachcloud/`, `/advisories/`, `/molt/` | ||
- ✅ Include stable version files: Files containing `v25.3` | ||
- ❌ Exclude old versions: `v24.x`, `v23.x`, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would corrections to old versions of code be indexed if changes are detected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, corrections to old versions (v24.x, v23.x, etc.) will NOT be indexed, and this behavior is the same in both
systems. Both the current Jekyll Algolia system and our new Python system use the versions.stable configuration to filter
what gets indexed. They only index:
- The current stable version (v25.3)
- Evergreen content like /releases/, /cockroachcloud/, /advisories/, and /molt/
Even if old version files are modified, they are explicitly excluded by the version detection logic in both systems. This is intentional to keep the search index focused on current, relevant documentation and prevent users from finding outdated information in search results.
## 🧠 Intelligent Bloat Removal | ||
|
||
### What Gets Removed | ||
- **85K+ Duplicate Records**: Content deduplication using MD5 hashing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this process run every time? Will we need to run this after the first re-indexing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MD5 deduplication runs every time during indexing, but it's very fast (< 1 second). It's necessary because
Jekyll generates many duplicate content blocks across pages. This process will continue to run after the first re-indexing because it's part of the core content extraction logic - it prevents the same paragraph from appearing in multiple search results.
2. **Force Override**: `ALGOLIA_FORCE_FULL=true` | ||
3. **Corrupted State**: Invalid state file | ||
4. **Stale State**: State file >7 days old | ||
5. **Content Changes**: Git commits affecting source files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean? Affecting which source files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Content Changes" refers to any Git commits that modify files in the Jekyll source directories that affect the
built site. Specifically:
- Any .md files in /src/current/docs/
- Any .html files in /src/current/_includes/ or /src/current/_layouts/
- Any changes to _config_cockroachdb.yml
- Any files that Jekyll processes to generate the _site/ directory
The system uses git log to detect if any commits have occurred since the last successful indexing run.
|
||
## 📊 Performance Metrics | ||
|
||
### Size Optimization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do these numbers represent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answer: These numbers represent actual Algolia index record counts:
- 157,471 records: Current production index size (cockroachcloud_docs)
- ~350,000 records: What naive indexing would produce without bloat removal(these were generated earlier, which i did not upload)
- 55% smaller: The size reduction achieved by intelligent bloat removal
This is measured by counting searchable records in the Algolia index, not file sizes.
- **Cause**: First run or state file was deleted | ||
- **Solution**: Normal - will do full indexing automatically | ||
|
||
**❌ "Git commits detected"** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what changes result in incremental indexing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incremental indexing happens when:
- The state file exists and is less than 7 days old
- No Git commits have been made since the last successful indexing
- The _config_cockroachdb.yml file hasn't changed
- The previous indexing run completed successfully (>100 files tracked)
- ALGOLIA_FORCE_FULL is not set to true
Any other scenario triggers a full reindex for safety.
- ✅ Comprehensive test coverage (100% pass rate) | ||
- ✅ Performance optimization and bloat removal | ||
|
||
### Phase 2: Staging Deployment (Next) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this complete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the actual teamcity job creation is not done, but the code is ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once approved it should not take a lot of time
### 5. **Zero-Downtime Deployment** | ||
Incremental indexing allows continuous updates without search interruption. | ||
|
||
## 📞 Support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List contacts or a channel for support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, will do that
- v23.1/** | ||
index_name: cockroachcloud_docs | ||
index_name: stage_cockroach_docs | ||
search_api_key: 372a10456f4ed7042c531ff3a658771b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might consider making this an env var rather than including directly in the config in plain text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
] | ||
|
||
# Content that should ALWAYS be preserved (even if short) | ||
self.preserve_patterns = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure we've captured all SQL commands and keywords?
Algolia Search Migration: Jekyll to Python
Replaces the Jekyll Algolia gem with a custom Python indexing
system that provides intelligent content extraction,
incremental updates, and production-ready CI/CD integration.
Key Benefits
(15-20 min)
removal
support
decision logic
index
Performance Improvements
Intelligent Features
Smart Decision Logic
Automatically chooses full vs incremental indexing based on:
Intelligent Bloat Removal
download repetition
release notes
Dynamic Version Detection
Automatically reads from _config_cockroachdb.yml
versions:
stable: v25.3 # Detected and used for filtering
Files Changed
New Production Files
TeamCity
with bloat removal
Modified Files
detection
Removed Legacy Files
TeamCity Integration
Simple Deployment
Build Steps
Environment Variables
ALGOLIA_APP_ID=7RXZLDVR5F
ALGOLIA_ADMIN_API_KEY=
ALGOLIA_INDEX_ENVIRONMENT=staging|production
Zero-Configuration Operation
Comprehensive Testing
validation, parity testing
coverage, full field compatibility