Impossible to index Elasticsearch docs with no _id field, to leverage performance gains from ES generated IDs #3016

SeanBarry · 2024-11-19T15:42:50Z

Hey there, I'm using Connect to sink documents from kafka topics to Elasticsearch.
I've got a config that looks something like the following:

- switch:
    cases:
      - check: meta("record_id") != nil
        output: 
          elasticsearch:
            urls: ["${ELASTICSEARCH_ADDRESS}"]
            index: ${! meta("index_name") }
            id: ${! meta("record_id") }
            action: "index"
            ... other config ...
      - output: 
          elasticsearch:
            urls: ["${ELASTICSEARCH_ADDRESS}"]
            index: ${! meta("index_name") }
            action: "index"
            ... other config ...

The purpose here is that when a record_id is present, we want to use that as the document _id in Elasticsearch. This means that the document can be updated as new messages containing updates arrive.

However, not all datasets require updates. For some, each document is only seen once and needs to only ever be indexed. In these cases, we don't want to provide an _id to Elasticsearch. This is a best practice, directly recommended in the ES docs:

When indexing a document that has an explicit id, Elasticsearch needs to check whether a document with the same id already exists within the same shard, which is a costly operation and gets even more costly as the index grows. By using auto-generated ids, Elasticsearch can skip this check, which makes indexing faster.

It speeds up indexing performance due to not needing to check for the existence of the ID before indexing. Link to ES docs here

The specific issue I see is that when my messages do not contain a record_id field, Connect falls back to the default ${!counter()}-${!timestamp_unix()} which is documented here. This _id is being generated by Connect.

I can't see any way to avoid this. I've tried setting the id explicitly to nil, null and "" but all of these result in a single document in ES with the ID of "null", and all messages overwrite that single document.

My request is to change the default behaviour such that instead of falling back to ${!counter()}-${!timestamp_unix()} as a default _id, Connect instead by default provides no _id and allows ES to generate one?

Thanks

The text was updated successfully, but these errors were encountered:

mihaitodor added enhancement outputs Any tasks or issues relating specifically to outputs needs investigation It looks as though have all the information needed but investigation is required labels Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impossible to index Elasticsearch docs with no _id field, to leverage performance gains from ES generated IDs #3016

Impossible to index Elasticsearch docs with no _id field, to leverage performance gains from ES generated IDs #3016

SeanBarry commented Nov 19, 2024 •

edited

Loading

Impossible to index Elasticsearch docs with no _id field, to leverage performance gains from ES generated IDs #3016

Impossible to index Elasticsearch docs with no _id field, to leverage performance gains from ES generated IDs #3016

Comments

SeanBarry commented Nov 19, 2024 • edited Loading

SeanBarry commented Nov 19, 2024 •

edited

Loading