Skip to content

Configure Elasticsearch indexes with 1 shard and 0 replicas#962

Merged
abartov merged 2 commits intomasterfrom
feature/elasticsearch-single-shard-no-replicas
Jan 29, 2026
Merged

Configure Elasticsearch indexes with 1 shard and 0 replicas#962
abartov merged 2 commits intomasterfrom
feature/elasticsearch-single-shard-no-replicas

Conversation

@abartov
Copy link
Collaborator

@abartov abartov commented Jan 29, 2026

Summary

Updates all 6 Chewy Elasticsearch index classes to use optimized settings for single-node environments:

  • number_of_shards: 1
  • number_of_replicas: 0

Changes

Modified the following index files to add settings configuration:

  • app/chewy/manifestations_index.rb
  • app/chewy/authorities_index.rb
  • app/chewy/collections_index.rb
  • app/chewy/dict_index.rb
  • app/chewy/manifestations_autocomplete_index.rb
  • app/chewy/authorities_autocomplete_index.rb

Benefits

  1. Reduced resource overhead - Single shard requires less memory and CPU
  2. No replica overhead - Zero replicas appropriate for development/single-node setups
  3. Faster indexing - Fewer shards means faster document indexing
  4. Simpler management - Easier to reason about index health with single shard

Test Plan

  • All existing Elasticsearch and search tests pass (72 examples)
  • Full test suite passes
  • RuboCop linting applied to modified files

Notes

These settings are particularly beneficial for:

  • Development environments
  • Single-node Elasticsearch instances
  • Smaller datasets that don't require horizontal scaling

For production multi-node clusters with large datasets, these settings may need adjustment based on scaling requirements.

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings January 29, 2026 18:36
field :raw_publication_date, value: ->(manifestation) { manifestation.expression.date }
field :orig_publication_date, type: 'date', value: ->(manifestation) { normalize_date(manifestation.expression.date) }
# field :video_count, type: 'integer', value: ->(manifestation){ manifestation.video_count}
# field :recommendation_count, type: 'integer', value: ->(manifestation){manifestation.recommendations.all_approved.count}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/LineLength: Line is too long. [124/120]

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR claims to configure Elasticsearch indexes with optimized settings (1 shard, 0 replicas), but actually contains substantial additional changes that address a concurrency issue in the RefreshUncollectedWorksCollection service. The PR includes:

Changes:

  • Elasticsearch index settings updated across 6 index files to use 1 shard and 0 replicas
  • Major refactoring of RefreshUncollectedWorksCollection service with pessimistic locking and transaction handling to prevent race conditions
  • New rake task for cleaning up orphaned uncollected collections
  • Comprehensive test coverage for concurrency scenarios and cleanup task

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
app/chewy/manifestations_index.rb Added Elasticsearch settings and applied code formatting improvements
app/chewy/authorities_index.rb Added Elasticsearch settings configuration
app/chewy/collections_index.rb Added Elasticsearch settings configuration
app/chewy/dict_index.rb Added Elasticsearch settings and applied code formatting
app/chewy/manifestations_autocomplete_index.rb Added Elasticsearch settings configuration
app/chewy/authorities_autocomplete_index.rb Added Elasticsearch settings configuration
app/services/refresh_uncollected_works_collection.rb Complete refactoring with database transactions and pessimistic locking to prevent race conditions
lib/tasks/cleanup_orphaned_uncollected_collections.rake New rake task to identify and clean up orphaned collections with dry-run and execute modes
spec/services/refresh_uncollected_works_collection_spec.rb Added concurrency tests for the service
spec/lib/tasks/cleanup_orphaned_uncollected_collections_rake_spec.rb Comprehensive test coverage for the new rake task

Comment on lines 114 to 147
# Helper method to identify the owning authority from collection items
def identify_authority_from_items(collection)
# Get all authorities from collection items by examining manifestations
authority_ids = []

collection.collection_items.where(item_type: 'Manifestation').includes(item: { expression: :work }).find_each do |ci|
next if ci.item.blank?

manifestation = ci.item

# Get authorities from both work and expression level (authors, translators, editors)
# We prioritize work-level authorities (authors) over expression-level (translators, editors)
work_authority_ids = InvolvedAuthority.where(item_id: manifestation.expression.work_id, item_type: 'Work')
.where(role: %i(author illustrator))
.pluck(:authority_id)
expression_authority_ids = InvolvedAuthority.where(item_id: manifestation.expression_id, item_type: 'Expression')
.where(role: %i(translator editor))
.pluck(:authority_id)

# Prioritize work-level authorities (authors) since uncollected collections typically belong to authors
authority_ids += if work_authority_ids.any?
work_authority_ids
else
expression_authority_ids
end
end

authority_ids = authority_ids.uniq

# If all items belong to same authority, that's likely the owner
return nil unless authority_ids.length == 1

Authority.find_by(id: authority_ids.first)
end
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helper method is defined at the top level of the rake file, which means it becomes a global method that pollutes the global namespace. In Rails rake tasks, helper methods should be defined within the task block or in a proper Ruby module/class. Consider moving this to a service class or at minimum wrapping it in a module to avoid namespace pollution.

Suggested change
# Helper method to identify the owning authority from collection items
def identify_authority_from_items(collection)
# Get all authorities from collection items by examining manifestations
authority_ids = []
collection.collection_items.where(item_type: 'Manifestation').includes(item: { expression: :work }).find_each do |ci|
next if ci.item.blank?
manifestation = ci.item
# Get authorities from both work and expression level (authors, translators, editors)
# We prioritize work-level authorities (authors) over expression-level (translators, editors)
work_authority_ids = InvolvedAuthority.where(item_id: manifestation.expression.work_id, item_type: 'Work')
.where(role: %i(author illustrator))
.pluck(:authority_id)
expression_authority_ids = InvolvedAuthority.where(item_id: manifestation.expression_id, item_type: 'Expression')
.where(role: %i(translator editor))
.pluck(:authority_id)
# Prioritize work-level authorities (authors) since uncollected collections typically belong to authors
authority_ids += if work_authority_ids.any?
work_authority_ids
else
expression_authority_ids
end
end
authority_ids = authority_ids.uniq
# If all items belong to same authority, that's likely the owner
return nil unless authority_ids.length == 1
Authority.find_by(id: authority_ids.first)
end
module CleanupOrphanedUncollectedCollectionsHelper
# Helper method to identify the owning authority from collection items
def identify_authority_from_items(collection)
# Get all authorities from collection items by examining manifestations
authority_ids = []
collection.collection_items.where(item_type: 'Manifestation').includes(item: { expression: :work }).find_each do |ci|
next if ci.item.blank?
manifestation = ci.item
# Get authorities from both work and expression level (authors, translators, editors)
# We prioritize work-level authorities (authors) over expression-level (translators, editors)
work_authority_ids = InvolvedAuthority.where(item_id: manifestation.expression.work_id, item_type: 'Work')
.where(role: %i(author illustrator))
.pluck(:authority_id)
expression_authority_ids = InvolvedAuthority.where(item_id: manifestation.expression_id, item_type: 'Expression')
.where(role: %i(translator editor))
.pluck(:authority_id)
# Prioritize work-level authorities (authors) since uncollected collections typically belong to authors
authority_ids += if work_authority_ids.any?
work_authority_ids
else
expression_authority_ids
end
end
authority_ids = authority_ids.uniq
# If all items belong to same authority, that's likely the owner
return nil unless authority_ids.length == 1
Authority.find_by(id: authority_ids.first)
end
end
include CleanupOrphanedUncollectedCollectionsHelper

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +16
index_scope DictionaryEntry.all

field :id, type: 'integer'
field :manifestation_id, type: 'integer'
field :defhead
field :deftext, value: ->(entry) {html2txt(entry.deftext).gsub("\n\n\n","\n\n")}
field :aliases, type: 'keyword', value: ->(entry){ entry.aliases.map(&:alias) }
field :deftext, value: ->(entry) { html2txt(entry.deftext).gsub("\n\n\n", "\n\n") }
field :aliases, type: 'keyword', value: ->(entry) { entry.aliases.map(&:alias) }
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file includes formatting changes (spacing in lambdas, line 10 and 15-16) that are not mentioned in the PR description. These should be documented or ideally separated into a style-focused commit.

Copilot uses AI. Check for mistakes.
Comment on lines 1 to 7
# frozen_string_literal: true

class ManifestationsIndex < Chewy::Index
settings index: {
number_of_shards: 1,
number_of_replicas: 0
}
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title and description only mention adding Elasticsearch index settings (number_of_shards and number_of_replicas), but this PR actually includes substantial additional changes:

  1. Major refactoring of RefreshUncollectedWorksCollection service with database locking and transaction handling
  2. A new rake task for cleaning up orphaned collections
  3. Extensive test coverage for both

These additional changes are significant and should be documented in the PR description. The service refactoring introduces concurrency control mechanisms (pessimistic locking, transactions) that are not mentioned at all in the PR description, which could be misleading to reviewers.

Copilot uses AI. Check for mistakes.
Comment on lines 47 to 54
# Save collection first
collection.save!

# Link collection to authority and save - this ensures referential integrity within transaction
# If this was a new collection, link it to the authority now
if is_new_collection
locked_authority.uncollected_works_collection = collection
locked_authority.save!
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pessimistic locking strategy used here (Authority.lock.find) will acquire a row-level lock on the authority record. However, the collection record is created and saved before being linked to the authority. This creates a window where:

  1. The collection exists in the database (line 48: collection.save!)
  2. But is not yet linked to the authority (lines 52-54)

If the transaction is rolled back after line 48 but before line 54 (e.g., due to an error in recalculate_manifestations_count!), an orphaned collection will be created. Consider either:

  • Creating and linking the collection in a single atomic operation
  • Using a savepoint before collection.save! and rolling back to it if the authority linking fails
  • Moving the collection.save! to after the authority link is established (though this may have ordering constraints with collection_items)

Copilot uses AI. Check for mistakes.
Comment on lines 59 to 61
# Re-enable automatic updates and manually recalculate the count
collection.skip_manifestations_count_update = false
collection.recalculate_manifestations_count!
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the collection is saved (line 48), if recalculate_manifestations_count! (line 61) raises an exception, the entire transaction will be rolled back. However, this means that any Elasticsearch updates or other side effects that occurred during collection.save! or during the build operations may leave the system in an inconsistent state. Consider whether Chewy.strategy(:atomic) should be used within this transaction to ensure Elasticsearch updates are deferred until after the transaction commits successfully.

Copilot uses AI. Check for mistakes.
Updates all 6 Chewy Elasticsearch index classes to use:
- number_of_shards: 1
- number_of_replicas: 0

This configuration is appropriate for single-node development
environments and reduces resource overhead.

Indexes updated:
- ManifestationsIndex
- AuthoritiesIndex
- CollectionsIndex
- DictIndex
- ManifestationsAutocompleteIndex
- AuthoritiesAutocompleteIndex

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@abartov abartov force-pushed the feature/elasticsearch-single-shard-no-replicas branch from 1813e17 to 58cfac3 Compare January 29, 2026 18:55
@abartov abartov requested a review from Copilot January 29, 2026 18:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

@abartov abartov merged commit 85e385e into master Jan 29, 2026
4 checks passed
@abartov abartov deleted the feature/elasticsearch-single-shard-no-replicas branch January 29, 2026 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant