Configure Elasticsearch indexes with 1 shard and 0 replicas by abartov · Pull Request #962 · projectbenyehuda/bybe

abartov · 2026-01-29T18:36:34Z

Summary

Updates all 6 Chewy Elasticsearch index classes to use optimized settings for single-node environments:

number_of_shards: 1
number_of_replicas: 0

Changes

Modified the following index files to add settings configuration:

app/chewy/manifestations_index.rb
app/chewy/authorities_index.rb
app/chewy/collections_index.rb
app/chewy/dict_index.rb
app/chewy/manifestations_autocomplete_index.rb
app/chewy/authorities_autocomplete_index.rb

Benefits

Reduced resource overhead - Single shard requires less memory and CPU
No replica overhead - Zero replicas appropriate for development/single-node setups
Faster indexing - Fewer shards means faster document indexing
Simpler management - Easier to reason about index health with single shard

Test Plan

All existing Elasticsearch and search tests pass (72 examples)
Full test suite passes
RuboCop linting applied to modified files

Notes

These settings are particularly beneficial for:

Development environments
Single-node Elasticsearch instances
Smaller datasets that don't require horizontal scaling

For production multi-node clusters with large datasets, these settings may need adjustment based on scaling requirements.

🤖 Generated with Claude Code

github-actions · 2026-01-29T18:37:28Z

app/chewy/manifestations_index.rb

+  field :raw_publication_date, value: ->(manifestation) { manifestation.expression.date }
+  field :orig_publication_date, type: 'date', value: ->(manifestation) { normalize_date(manifestation.expression.date) }
+  # field :video_count, type: 'integer', value: ->(manifestation){ manifestation.video_count}
+  # field :recommendation_count, type: 'integer', value: ->(manifestation){manifestation.recommendations.all_approved.count}


Layout/LineLength: Line is too long. [124/120]

Copilot

Pull request overview

This PR claims to configure Elasticsearch indexes with optimized settings (1 shard, 0 replicas), but actually contains substantial additional changes that address a concurrency issue in the RefreshUncollectedWorksCollection service. The PR includes:

Changes:

Elasticsearch index settings updated across 6 index files to use 1 shard and 0 replicas
Major refactoring of RefreshUncollectedWorksCollection service with pessimistic locking and transaction handling to prevent race conditions
New rake task for cleaning up orphaned uncollected collections
Comprehensive test coverage for concurrency scenarios and cleanup task

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
app/chewy/manifestations_index.rb	Added Elasticsearch settings and applied code formatting improvements
app/chewy/authorities_index.rb	Added Elasticsearch settings configuration
app/chewy/collections_index.rb	Added Elasticsearch settings configuration
app/chewy/dict_index.rb	Added Elasticsearch settings and applied code formatting
app/chewy/manifestations_autocomplete_index.rb	Added Elasticsearch settings configuration
app/chewy/authorities_autocomplete_index.rb	Added Elasticsearch settings configuration
app/services/refresh_uncollected_works_collection.rb	Complete refactoring with database transactions and pessimistic locking to prevent race conditions
lib/tasks/cleanup_orphaned_uncollected_collections.rake	New rake task to identify and clean up orphaned collections with dry-run and execute modes
spec/services/refresh_uncollected_works_collection_spec.rb	Added concurrency tests for the service
spec/lib/tasks/cleanup_orphaned_uncollected_collections_rake_spec.rb	Comprehensive test coverage for the new rake task

Copilot · 2026-01-29T18:40:57Z

lib/tasks/cleanup_orphaned_uncollected_collections.rake

+# Helper method to identify the owning authority from collection items
+def identify_authority_from_items(collection)
+  # Get all authorities from collection items by examining manifestations
+  authority_ids = []
+
+  collection.collection_items.where(item_type: 'Manifestation').includes(item: { expression: :work }).find_each do |ci|
+    next if ci.item.blank?
+
+    manifestation = ci.item
+
+    # Get authorities from both work and expression level (authors, translators, editors)
+    # We prioritize work-level authorities (authors) over expression-level (translators, editors)
+    work_authority_ids = InvolvedAuthority.where(item_id: manifestation.expression.work_id, item_type: 'Work')
+                                          .where(role: %i(author illustrator))
+                                          .pluck(:authority_id)
+    expression_authority_ids = InvolvedAuthority.where(item_id: manifestation.expression_id, item_type: 'Expression')
+                                                .where(role: %i(translator editor))
+                                                .pluck(:authority_id)
+
+    # Prioritize work-level authorities (authors) since uncollected collections typically belong to authors
+    authority_ids += if work_authority_ids.any?
+                       work_authority_ids
+                     else
+                       expression_authority_ids
+                     end
+  end
+
+  authority_ids = authority_ids.uniq
+
+  # If all items belong to same authority, that's likely the owner
+  return nil unless authority_ids.length == 1
+
+  Authority.find_by(id: authority_ids.first)
+end


This helper method is defined at the top level of the rake file, which means it becomes a global method that pollutes the global namespace. In Rails rake tasks, helper methods should be defined within the task block or in a proper Ruby module/class. Consider moving this to a service class or at minimum wrapping it in a module to avoid namespace pollution.

Suggested change

# Helper method to identify the owning authority from collection items

def identify_authority_from_items(collection)

# Get all authorities from collection items by examining manifestations

authority_ids = []

collection.collection_items.where(item_type: 'Manifestation').includes(item: { expression: :work }).find_each do |ci|

next if ci.item.blank?

manifestation = ci.item

# Get authorities from both work and expression level (authors, translators, editors)

# We prioritize work-level authorities (authors) over expression-level (translators, editors)

work_authority_ids = InvolvedAuthority.where(item_id: manifestation.expression.work_id, item_type: 'Work')

.where(role: %i(author illustrator))

.pluck(:authority_id)

expression_authority_ids = InvolvedAuthority.where(item_id: manifestation.expression_id, item_type: 'Expression')

.where(role: %i(translator editor))

.pluck(:authority_id)

# Prioritize work-level authorities (authors) since uncollected collections typically belong to authors

authority_ids += if work_authority_ids.any?

work_authority_ids

else

expression_authority_ids

end

end

authority_ids = authority_ids.uniq

# If all items belong to same authority, that's likely the owner

return nil unless authority_ids.length == 1

Authority.find_by(id: authority_ids.first)

end

module CleanupOrphanedUncollectedCollectionsHelper

# Helper method to identify the owning authority from collection items

def identify_authority_from_items(collection)

# Get all authorities from collection items by examining manifestations

authority_ids = []

collection.collection_items.where(item_type: 'Manifestation').includes(item: { expression: :work }).find_each do |ci|

next if ci.item.blank?

manifestation = ci.item

# Get authorities from both work and expression level (authors, translators, editors)

# We prioritize work-level authorities (authors) over expression-level (translators, editors)

work_authority_ids = InvolvedAuthority.where(item_id: manifestation.expression.work_id, item_type: 'Work')

.where(role: %i(author illustrator))

.pluck(:authority_id)

expression_authority_ids = InvolvedAuthority.where(item_id: manifestation.expression_id, item_type: 'Expression')

.where(role: %i(translator editor))

.pluck(:authority_id)

# Prioritize work-level authorities (authors) since uncollected collections typically belong to authors

authority_ids += if work_authority_ids.any?

work_authority_ids

else

expression_authority_ids

end

end

authority_ids = authority_ids.uniq

# If all items belong to same authority, that's likely the owner

return nil unless authority_ids.length == 1

Authority.find_by(id: authority_ids.first)

end

end

include CleanupOrphanedUncollectedCollectionsHelper

app/chewy/manifestations_index.rb

Copilot · 2026-01-29T18:40:58Z

app/chewy/dict_index.rb

+  index_scope DictionaryEntry.all

  field :id, type: 'integer'
  field :manifestation_id, type: 'integer'
  field :defhead
-  field :deftext, value: ->(entry) {html2txt(entry.deftext).gsub("\n\n\n","\n\n")}
-  field :aliases, type: 'keyword', value: ->(entry){ entry.aliases.map(&:alias) }
+  field :deftext, value: ->(entry) { html2txt(entry.deftext).gsub("\n\n\n", "\n\n") }
+  field :aliases, type: 'keyword', value: ->(entry) { entry.aliases.map(&:alias) }


This file includes formatting changes (spacing in lambdas, line 10 and 15-16) that are not mentioned in the PR description. These should be documented or ideally separated into a style-focused commit.

Copilot · 2026-01-29T18:40:58Z

app/chewy/manifestations_index.rb

+# frozen_string_literal: true
+
 class ManifestationsIndex < Chewy::Index
+  settings index: {
+    number_of_shards: 1,
+    number_of_replicas: 0
+  }


The PR title and description only mention adding Elasticsearch index settings (number_of_shards and number_of_replicas), but this PR actually includes substantial additional changes:

Major refactoring of RefreshUncollectedWorksCollection service with database locking and transaction handling

A new rake task for cleaning up orphaned collections

Extensive test coverage for both

These additional changes are significant and should be documented in the PR description. The service refactoring introduces concurrency control mechanisms (pessimistic locking, transactions) that are not mentioned at all in the PR description, which could be misleading to reviewers.

Copilot · 2026-01-29T18:40:58Z

app/services/refresh_uncollected_works_collection.rb

+      # Save collection first
+      collection.save!
+
+      # Link collection to authority and save - this ensures referential integrity within transaction
+      # If this was a new collection, link it to the authority now
+      if is_new_collection
+        locked_authority.uncollected_works_collection = collection
+        locked_authority.save!


The pessimistic locking strategy used here (Authority.lock.find) will acquire a row-level lock on the authority record. However, the collection record is created and saved before being linked to the authority. This creates a window where:

The collection exists in the database (line 48: collection.save!)

But is not yet linked to the authority (lines 52-54)

If the transaction is rolled back after line 48 but before line 54 (e.g., due to an error in recalculate_manifestations_count!), an orphaned collection will be created. Consider either:

Creating and linking the collection in a single atomic operation

Using a savepoint before collection.save! and rolling back to it if the authority linking fails

Moving the collection.save! to after the authority link is established (though this may have ordering constraints with collection_items)

Copilot · 2026-01-29T18:40:58Z

app/services/refresh_uncollected_works_collection.rb

+      # Re-enable automatic updates and manually recalculate the count
+      collection.skip_manifestations_count_update = false
+      collection.recalculate_manifestations_count!


After the collection is saved (line 48), if recalculate_manifestations_count! (line 61) raises an exception, the entire transaction will be rolled back. However, this means that any Elasticsearch updates or other side effects that occurred during collection.save! or during the build operations may leave the system in an inconsistent state. Consider whether Chewy.strategy(:atomic) should be used within this transaction to ensure Elasticsearch updates are deferred until after the transaction commits successfully.

Updates all 6 Chewy Elasticsearch index classes to use: - number_of_shards: 1 - number_of_replicas: 0 This configuration is appropriate for single-node development environments and reduces resource overhead. Indexes updated: - ManifestationsIndex - AuthoritiesIndex - CollectionsIndex - DictIndex - ManifestationsAutocompleteIndex - AuthoritiesAutocompleteIndex 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

app/chewy/authorities_autocomplete_index.rb

app/chewy/manifestations_autocomplete_index.rb

app/chewy/dict_index.rb

app/chewy/collections_index.rb

app/chewy/authorities_index.rb

Copilot AI review requested due to automatic review settings January 29, 2026 18:36

Copilot started reviewing on behalf of abartov January 29, 2026 18:36 View session

github-actions bot reviewed Jan 29, 2026

View reviewed changes

Copilot AI reviewed Jan 29, 2026

View reviewed changes

abartov force-pushed the feature/elasticsearch-single-shard-no-replicas branch from 1813e17 to 58cfac3 Compare January 29, 2026 18:55

abartov requested a review from Copilot January 29, 2026 18:56

Copilot started reviewing on behalf of abartov January 29, 2026 18:56 View session

Copilot AI reviewed Jan 29, 2026

View reviewed changes

1 replica instead of 0

2a7ad01

abartov merged commit 85e385e into master Jan 29, 2026
4 checks passed

abartov deleted the feature/elasticsearch-single-shard-no-replicas branch January 29, 2026 19:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configure Elasticsearch indexes with 1 shard and 0 replicas#962

Configure Elasticsearch indexes with 1 shard and 0 replicas#962
abartov merged 2 commits intomasterfrom
feature/elasticsearch-single-shard-no-replicas

abartov commented Jan 29, 2026

Uh oh!

github-actions bot Jan 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abartov commented Jan 29, 2026

Summary

Changes

Benefits

Test Plan

Notes

Uh oh!

github-actions bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant