Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impossible to index Elasticsearch docs with no _id field, to leverage performance gains from ES generated IDs #3016

Open
SeanBarry opened this issue Nov 19, 2024 · 0 comments
Labels
enhancement needs investigation It looks as though have all the information needed but investigation is required outputs Any tasks or issues relating specifically to outputs

Comments

@SeanBarry
Copy link

SeanBarry commented Nov 19, 2024

Hey there, I'm using Connect to sink documents from kafka topics to Elasticsearch.
I've got a config that looks something like the following:

- switch:
    cases:
      - check: meta("record_id") != nil
        output: 
          elasticsearch:
            urls: ["${ELASTICSEARCH_ADDRESS}"]
            index: ${! meta("index_name") }
            id: ${! meta("record_id") }
            action: "index"
            ... other config ...
      - output: 
          elasticsearch:
            urls: ["${ELASTICSEARCH_ADDRESS}"]
            index: ${! meta("index_name") }
            action: "index"
            ... other config ...

The purpose here is that when a record_id is present, we want to use that as the document _id in Elasticsearch. This means that the document can be updated as new messages containing updates arrive.

However, not all datasets require updates. For some, each document is only seen once and needs to only ever be indexed. In these cases, we don't want to provide an _id to Elasticsearch. This is a best practice, directly recommended in the ES docs:

When indexing a document that has an explicit id, Elasticsearch needs to check whether a document with the same id already exists within the same shard, which is a costly operation and gets even more costly as the index grows. By using auto-generated ids, Elasticsearch can skip this check, which makes indexing faster.

It speeds up indexing performance due to not needing to check for the existence of the ID before indexing. Link to ES docs here

The specific issue I see is that when my messages do not contain a record_id field, Connect falls back to the default ${!counter()}-${!timestamp_unix()} which is documented here. This _id is being generated by Connect.

I can't see any way to avoid this. I've tried setting the id explicitly to nil, null and "" but all of these result in a single document in ES with the ID of "null", and all messages overwrite that single document.

My request is to change the default behaviour such that instead of falling back to ${!counter()}-${!timestamp_unix()} as a default _id, Connect instead by default provides no _id and allows ES to generate one?

Thanks

@mihaitodor mihaitodor added enhancement outputs Any tasks or issues relating specifically to outputs needs investigation It looks as though have all the information needed but investigation is required labels Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement needs investigation It looks as though have all the information needed but investigation is required outputs Any tasks or issues relating specifically to outputs
Projects
None yet
Development

No branches or pull requests

2 participants