Configure Elasticsearch indexes with 1 shard and 0 replicas#962
Configure Elasticsearch indexes with 1 shard and 0 replicas#962
Conversation
| field :raw_publication_date, value: ->(manifestation) { manifestation.expression.date } | ||
| field :orig_publication_date, type: 'date', value: ->(manifestation) { normalize_date(manifestation.expression.date) } | ||
| # field :video_count, type: 'integer', value: ->(manifestation){ manifestation.video_count} | ||
| # field :recommendation_count, type: 'integer', value: ->(manifestation){manifestation.recommendations.all_approved.count} |
There was a problem hiding this comment.
Layout/LineLength: Line is too long. [124/120]
There was a problem hiding this comment.
Pull request overview
This PR claims to configure Elasticsearch indexes with optimized settings (1 shard, 0 replicas), but actually contains substantial additional changes that address a concurrency issue in the RefreshUncollectedWorksCollection service. The PR includes:
Changes:
- Elasticsearch index settings updated across 6 index files to use 1 shard and 0 replicas
- Major refactoring of RefreshUncollectedWorksCollection service with pessimistic locking and transaction handling to prevent race conditions
- New rake task for cleaning up orphaned uncollected collections
- Comprehensive test coverage for concurrency scenarios and cleanup task
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| app/chewy/manifestations_index.rb | Added Elasticsearch settings and applied code formatting improvements |
| app/chewy/authorities_index.rb | Added Elasticsearch settings configuration |
| app/chewy/collections_index.rb | Added Elasticsearch settings configuration |
| app/chewy/dict_index.rb | Added Elasticsearch settings and applied code formatting |
| app/chewy/manifestations_autocomplete_index.rb | Added Elasticsearch settings configuration |
| app/chewy/authorities_autocomplete_index.rb | Added Elasticsearch settings configuration |
| app/services/refresh_uncollected_works_collection.rb | Complete refactoring with database transactions and pessimistic locking to prevent race conditions |
| lib/tasks/cleanup_orphaned_uncollected_collections.rake | New rake task to identify and clean up orphaned collections with dry-run and execute modes |
| spec/services/refresh_uncollected_works_collection_spec.rb | Added concurrency tests for the service |
| spec/lib/tasks/cleanup_orphaned_uncollected_collections_rake_spec.rb | Comprehensive test coverage for the new rake task |
| # Helper method to identify the owning authority from collection items | ||
| def identify_authority_from_items(collection) | ||
| # Get all authorities from collection items by examining manifestations | ||
| authority_ids = [] | ||
|
|
||
| collection.collection_items.where(item_type: 'Manifestation').includes(item: { expression: :work }).find_each do |ci| | ||
| next if ci.item.blank? | ||
|
|
||
| manifestation = ci.item | ||
|
|
||
| # Get authorities from both work and expression level (authors, translators, editors) | ||
| # We prioritize work-level authorities (authors) over expression-level (translators, editors) | ||
| work_authority_ids = InvolvedAuthority.where(item_id: manifestation.expression.work_id, item_type: 'Work') | ||
| .where(role: %i(author illustrator)) | ||
| .pluck(:authority_id) | ||
| expression_authority_ids = InvolvedAuthority.where(item_id: manifestation.expression_id, item_type: 'Expression') | ||
| .where(role: %i(translator editor)) | ||
| .pluck(:authority_id) | ||
|
|
||
| # Prioritize work-level authorities (authors) since uncollected collections typically belong to authors | ||
| authority_ids += if work_authority_ids.any? | ||
| work_authority_ids | ||
| else | ||
| expression_authority_ids | ||
| end | ||
| end | ||
|
|
||
| authority_ids = authority_ids.uniq | ||
|
|
||
| # If all items belong to same authority, that's likely the owner | ||
| return nil unless authority_ids.length == 1 | ||
|
|
||
| Authority.find_by(id: authority_ids.first) | ||
| end |
There was a problem hiding this comment.
This helper method is defined at the top level of the rake file, which means it becomes a global method that pollutes the global namespace. In Rails rake tasks, helper methods should be defined within the task block or in a proper Ruby module/class. Consider moving this to a service class or at minimum wrapping it in a module to avoid namespace pollution.
| # Helper method to identify the owning authority from collection items | |
| def identify_authority_from_items(collection) | |
| # Get all authorities from collection items by examining manifestations | |
| authority_ids = [] | |
| collection.collection_items.where(item_type: 'Manifestation').includes(item: { expression: :work }).find_each do |ci| | |
| next if ci.item.blank? | |
| manifestation = ci.item | |
| # Get authorities from both work and expression level (authors, translators, editors) | |
| # We prioritize work-level authorities (authors) over expression-level (translators, editors) | |
| work_authority_ids = InvolvedAuthority.where(item_id: manifestation.expression.work_id, item_type: 'Work') | |
| .where(role: %i(author illustrator)) | |
| .pluck(:authority_id) | |
| expression_authority_ids = InvolvedAuthority.where(item_id: manifestation.expression_id, item_type: 'Expression') | |
| .where(role: %i(translator editor)) | |
| .pluck(:authority_id) | |
| # Prioritize work-level authorities (authors) since uncollected collections typically belong to authors | |
| authority_ids += if work_authority_ids.any? | |
| work_authority_ids | |
| else | |
| expression_authority_ids | |
| end | |
| end | |
| authority_ids = authority_ids.uniq | |
| # If all items belong to same authority, that's likely the owner | |
| return nil unless authority_ids.length == 1 | |
| Authority.find_by(id: authority_ids.first) | |
| end | |
| module CleanupOrphanedUncollectedCollectionsHelper | |
| # Helper method to identify the owning authority from collection items | |
| def identify_authority_from_items(collection) | |
| # Get all authorities from collection items by examining manifestations | |
| authority_ids = [] | |
| collection.collection_items.where(item_type: 'Manifestation').includes(item: { expression: :work }).find_each do |ci| | |
| next if ci.item.blank? | |
| manifestation = ci.item | |
| # Get authorities from both work and expression level (authors, translators, editors) | |
| # We prioritize work-level authorities (authors) over expression-level (translators, editors) | |
| work_authority_ids = InvolvedAuthority.where(item_id: manifestation.expression.work_id, item_type: 'Work') | |
| .where(role: %i(author illustrator)) | |
| .pluck(:authority_id) | |
| expression_authority_ids = InvolvedAuthority.where(item_id: manifestation.expression_id, item_type: 'Expression') | |
| .where(role: %i(translator editor)) | |
| .pluck(:authority_id) | |
| # Prioritize work-level authorities (authors) since uncollected collections typically belong to authors | |
| authority_ids += if work_authority_ids.any? | |
| work_authority_ids | |
| else | |
| expression_authority_ids | |
| end | |
| end | |
| authority_ids = authority_ids.uniq | |
| # If all items belong to same authority, that's likely the owner | |
| return nil unless authority_ids.length == 1 | |
| Authority.find_by(id: authority_ids.first) | |
| end | |
| end | |
| include CleanupOrphanedUncollectedCollectionsHelper |
| index_scope DictionaryEntry.all | ||
|
|
||
| field :id, type: 'integer' | ||
| field :manifestation_id, type: 'integer' | ||
| field :defhead | ||
| field :deftext, value: ->(entry) {html2txt(entry.deftext).gsub("\n\n\n","\n\n")} | ||
| field :aliases, type: 'keyword', value: ->(entry){ entry.aliases.map(&:alias) } | ||
| field :deftext, value: ->(entry) { html2txt(entry.deftext).gsub("\n\n\n", "\n\n") } | ||
| field :aliases, type: 'keyword', value: ->(entry) { entry.aliases.map(&:alias) } |
There was a problem hiding this comment.
This file includes formatting changes (spacing in lambdas, line 10 and 15-16) that are not mentioned in the PR description. These should be documented or ideally separated into a style-focused commit.
| # frozen_string_literal: true | ||
|
|
||
| class ManifestationsIndex < Chewy::Index | ||
| settings index: { | ||
| number_of_shards: 1, | ||
| number_of_replicas: 0 | ||
| } |
There was a problem hiding this comment.
The PR title and description only mention adding Elasticsearch index settings (number_of_shards and number_of_replicas), but this PR actually includes substantial additional changes:
- Major refactoring of RefreshUncollectedWorksCollection service with database locking and transaction handling
- A new rake task for cleaning up orphaned collections
- Extensive test coverage for both
These additional changes are significant and should be documented in the PR description. The service refactoring introduces concurrency control mechanisms (pessimistic locking, transactions) that are not mentioned at all in the PR description, which could be misleading to reviewers.
| # Save collection first | ||
| collection.save! | ||
|
|
||
| # Link collection to authority and save - this ensures referential integrity within transaction | ||
| # If this was a new collection, link it to the authority now | ||
| if is_new_collection | ||
| locked_authority.uncollected_works_collection = collection | ||
| locked_authority.save! |
There was a problem hiding this comment.
The pessimistic locking strategy used here (Authority.lock.find) will acquire a row-level lock on the authority record. However, the collection record is created and saved before being linked to the authority. This creates a window where:
- The collection exists in the database (line 48: collection.save!)
- But is not yet linked to the authority (lines 52-54)
If the transaction is rolled back after line 48 but before line 54 (e.g., due to an error in recalculate_manifestations_count!), an orphaned collection will be created. Consider either:
- Creating and linking the collection in a single atomic operation
- Using a savepoint before collection.save! and rolling back to it if the authority linking fails
- Moving the collection.save! to after the authority link is established (though this may have ordering constraints with collection_items)
| # Re-enable automatic updates and manually recalculate the count | ||
| collection.skip_manifestations_count_update = false | ||
| collection.recalculate_manifestations_count! |
There was a problem hiding this comment.
After the collection is saved (line 48), if recalculate_manifestations_count! (line 61) raises an exception, the entire transaction will be rolled back. However, this means that any Elasticsearch updates or other side effects that occurred during collection.save! or during the build operations may leave the system in an inconsistent state. Consider whether Chewy.strategy(:atomic) should be used within this transaction to ensure Elasticsearch updates are deferred until after the transaction commits successfully.
Updates all 6 Chewy Elasticsearch index classes to use: - number_of_shards: 1 - number_of_replicas: 0 This configuration is appropriate for single-node development environments and reduces resource overhead. Indexes updated: - ManifestationsIndex - AuthoritiesIndex - CollectionsIndex - DictIndex - ManifestationsAutocompleteIndex - AuthoritiesAutocompleteIndex 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1813e17 to
58cfac3
Compare
Summary
Updates all 6 Chewy Elasticsearch index classes to use optimized settings for single-node environments:
number_of_shards: 1number_of_replicas: 0Changes
Modified the following index files to add
settingsconfiguration:app/chewy/manifestations_index.rbapp/chewy/authorities_index.rbapp/chewy/collections_index.rbapp/chewy/dict_index.rbapp/chewy/manifestations_autocomplete_index.rbapp/chewy/authorities_autocomplete_index.rbBenefits
Test Plan
Notes
These settings are particularly beneficial for:
For production multi-node clusters with large datasets, these settings may need adjustment based on scaling requirements.
🤖 Generated with Claude Code