Impossible to index Elasticsearch docs with no _id field, to leverage performance gains from ES generated IDs #3016
Labels
enhancement
needs investigation
It looks as though have all the information needed but investigation is required
outputs
Any tasks or issues relating specifically to outputs
Hey there, I'm using Connect to sink documents from kafka topics to Elasticsearch.
I've got a config that looks something like the following:
The purpose here is that when a
record_id
is present, we want to use that as the document_id
in Elasticsearch. This means that the document can be updated as new messages containing updates arrive.However, not all datasets require updates. For some, each document is only seen once and needs to only ever be indexed. In these cases, we don't want to provide an
_id
to Elasticsearch. This is a best practice, directly recommended in the ES docs:It speeds up indexing performance due to not needing to check for the existence of the ID before indexing. Link to ES docs here
The specific issue I see is that when my messages do not contain a
record_id
field, Connect falls back to the default${!counter()}-${!timestamp_unix()}
which is documented here. This _id is being generated by Connect.I can't see any way to avoid this. I've tried setting the
id
explicitly tonil
,null
and""
but all of these result in a single document in ES with the ID of "null", and all messages overwrite that single document.My request is to change the default behaviour such that instead of falling back to
${!counter()}-${!timestamp_unix()}
as a default_id
, Connect instead by default provides no_id
and allows ES to generate one?Thanks
The text was updated successfully, but these errors were encountered: