Skip to content
This repository was archived by the owner on Nov 7, 2025. It is now read-only.

Conversation

@mieciu
Copy link
Member

@mieciu mieciu commented Jun 5, 2025

The challenge here is to fix the "surrounding documents" view in Kibana given the new unique ID (_id field) implementation (ref: #1435). See Related screenshots section.

At this moment Kibana just says "No documents newer/older than the anchor could be found"

Present situation

Our current implementation of the _id field is based on rendering this value dynamically, after fetching all the data from ClickHouse. It looks like this:

{hex-encoded timestamp field}qqq{hex-encoded hash of the document}

While at the query parsing/execution phase we can of course access the timestamp field, the "hash of the document" part is computed during JSON response rendering.

The current implementation stores a list of IDs in ClickhouseQueryTranslator (UniqueIDs) - therefore we know that this query was using _id field and we have to apply extra logic on JSON response rendering. The situation is quite clear during simple filtering query:

    "query": {
        "bool": {
            "filter": [
                {
                    "ids": {
                        "values": [
                       "323032352d30362d30352031323a33303a33322e323036202b3030303020555443qqq4qqq3634363633353337363233323631333533333632363333353337363533303339363333333336363133393633333836323332333033393337363533323634333336333333363236343331333036343331333636313332333033303334363233303338333236343330363433333334333433333336333236353633333836343336"
                        ]
                    }
                },
                {
                    "term": {
                        "_index": "kibana_sample_data_flights"
                    }
                }
            ]
        }
    },

When parsing the SQL we simply take the first (timestamp) part of the query, make relevant WHERE clause which will filter out all the non-matching timestamps and then compare the doc hashes during JSON response rendering (see platform/parsers/elastic_query_dsl/query_translator.go). Of course we have to make sure that we don't fallback to default LIMIT 10 for our SQL clause because we might not have enough documents to filter from (see (cw *ClickhouseQueryTranslator) parseSize).

Problem

When fetching "surrounding documents", Kibana sends following query:

    "query": {
        "bool": {
            "filter": [],
            "must": [
                {
                    "bool": {
                        "must": {
                            "constant_score": {
                                "filter": {
                                    "range": {
                                        "@timestamp": {
                                            "format": "strict_date_optional_time",
                                            "gte": "2025-06-04T12:30:32.206Z",
                                            "lte": "2025-06-05T12:30:32.206Z"
                                        }
                                    }
                                }
                            }
                        },
                        "must_not": {
                            "ids": {
                                "values": [
                                    "323032352d30362d30352031323a33303a33322e323036202b3030303020555443qqq4qqq3634363633353337363233323631333533333632363333353337363533303339363333333336363133393633333836323332333033393337363533323634333336333333363236343331333036343331333636313332333033303334363233303338333236343330363433333334333433333336333236353633333836343336"
                                ]
                            }
                        }
                    }
                }
            ],
            "must_not": [],
            "should": []
        }
    },

And the must_not query becomes quite problematic. It's pretty obvious that when fetching next/previos N documents, we don't want to include that anchor document.
So at the SQL query level we cannot filter out all the documents with matching timestamp, because our next document might have exactly the same timestamp. One approach to do so is add a schema transformer to do so.
On response rendering, we cannot rely on the current logic which just leaves only matching ids, because we want the opposite affect. However, this is post-query phase and at that level we're unaware of the query - we just have hits as the result, but didn't know whether ids has been within must_not (or any other logical clause).

There are few gotchas here:

  • size passed in query lands in the SQLs LIMIT clause - we have to ignore it
  • relying on _id in any aggregation might produce completely absurd results
  • .... ?

Possible solution

TODO: check this PR

Related screenshots

Details image image

@mieciu mieciu mentioned this pull request Jun 5, 2025
@mieciu mieciu changed the title 🚧 _id fixing "surrounding documents" view with new _id implementation 🚧 [WIP] Fixing "surrounding documents" view with new _id implementation Jun 5, 2025
github-merge-queue bot pushed a commit that referenced this pull request Jun 6, 2025
An attempt to introduce unique object IDs in the ClickHouse realm.

A returned document id (`_id` field) would have the following syntax:
```
{hex-encoded timestamp field}qqq{hex-encoded hash of the document}
```
Of course, the `{hex-encoded hash of the document}` lives only in Quesma
memory:
1. It needs to be re-calculated when returning search hits
2. Quesma need to filter out based on it when rendering hits

Therefore when fetching document with specific `_id` we could filter out
documents with matching timestamp and then filter for that with matching
fields, returning the right entry.


This fixes the issue, where JSON view of single document could return a
random object from search hits, not necessarily the one clicked.
<img width="1458" alt="image"
src="https://github.com/user-attachments/assets/0b808dba-0255-4be0-ac5f-acb7a21569ba"
/>

**However** (and it's a pretty big however), the "surrounding document"
view **cannot return the surrounding documents**. While the query
doesn't error it also doesn't return any documents. The experiment to
have that working [is carried out is a separate
PR](#1446), although it's not
clear yet if this way of doing things is going to guarantee 100%
correctness.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants