🚧 [WIP] Fixing "surrounding documents" view with new _id implementation #1446

mieciu · 2025-06-05T18:18:57Z

The challenge here is to fix the "surrounding documents" view in Kibana given the new unique ID (_id field) implementation (ref: #1435). See Related screenshots section.

At this moment Kibana just says "No documents newer/older than the anchor could be found"

Present situation

Our current implementation of the _id field is based on rendering this value dynamically, after fetching all the data from ClickHouse. It looks like this:

{hex-encoded timestamp field}qqq{hex-encoded hash of the document}

While at the query parsing/execution phase we can of course access the timestamp field, the "hash of the document" part is computed during JSON response rendering.

The current implementation stores a list of IDs in ClickhouseQueryTranslator (UniqueIDs) - therefore we know that this query was using _id field and we have to apply extra logic on JSON response rendering. The situation is quite clear during simple filtering query:

    "query": {
        "bool": {
            "filter": [
                {
                    "ids": {
                        "values": [
                       "323032352d30362d30352031323a33303a33322e323036202b3030303020555443qqq4qqq3634363633353337363233323631333533333632363333353337363533303339363333333336363133393633333836323332333033393337363533323634333336333333363236343331333036343331333636313332333033303334363233303338333236343330363433333334333433333336333236353633333836343336"
                        ]
                    }
                },
                {
                    "term": {
                        "_index": "kibana_sample_data_flights"
                    }
                }
            ]
        }
    },

When parsing the SQL we simply take the first (timestamp) part of the query, make relevant WHERE clause which will filter out all the non-matching timestamps and then compare the doc hashes during JSON response rendering (see platform/parsers/elastic_query_dsl/query_translator.go). Of course we have to make sure that we don't fallback to default LIMIT 10 for our SQL clause because we might not have enough documents to filter from (see (cw *ClickhouseQueryTranslator) parseSize).

Problem

When fetching "surrounding documents", Kibana sends following query:

    "query": {
        "bool": {
            "filter": [],
            "must": [
                {
                    "bool": {
                        "must": {
                            "constant_score": {
                                "filter": {
                                    "range": {
                                        "@timestamp": {
                                            "format": "strict_date_optional_time",
                                            "gte": "2025-06-04T12:30:32.206Z",
                                            "lte": "2025-06-05T12:30:32.206Z"
                                        }
                                    }
                                }
                            }
                        },
                        "must_not": {
                            "ids": {
                                "values": [
                                    "323032352d30362d30352031323a33303a33322e323036202b3030303020555443qqq4qqq3634363633353337363233323631333533333632363333353337363533303339363333333336363133393633333836323332333033393337363533323634333336333333363236343331333036343331333636313332333033303334363233303338333236343330363433333334333433333336333236353633333836343336"
                                ]
                            }
                        }
                    }
                }
            ],
            "must_not": [],
            "should": []
        }
    },

And the must_not query becomes quite problematic. It's pretty obvious that when fetching next/previos N documents, we don't want to include that anchor document.
So at the SQL query level we cannot filter out all the documents with matching timestamp, because our next document might have exactly the same timestamp. One approach to do so is add a schema transformer to do so.
On response rendering, we cannot rely on the current logic which just leaves only matching ids, because we want the opposite affect. However, this is post-query phase and at that level we're unaware of the query - we just have hits as the result, but didn't know whether ids has been within must_not (or any other logical clause).

There are few gotchas here:

size passed in query lands in the SQLs LIMIT clause - we have to ignore it
relying on _id in any aggregation might produce completely absurd results
.... ?

Possible solution

TODO: check this PR

Related screenshots

Details

An attempt to introduce unique object IDs in the ClickHouse realm. A returned document id (`_id` field) would have the following syntax: ``` {hex-encoded timestamp field}qqq{hex-encoded hash of the document} ``` Of course, the `{hex-encoded hash of the document}` lives only in Quesma memory: 1. It needs to be re-calculated when returning search hits 2. Quesma need to filter out based on it when rendering hits Therefore when fetching document with specific `_id` we could filter out documents with matching timestamp and then filter for that with matching fields, returning the right entry. This fixes the issue, where JSON view of single document could return a random object from search hits, not necessarily the one clicked. <img width="1458" alt="image" src="https://github.com/user-attachments/assets/0b808dba-0255-4be0-ac5f-acb7a21569ba" /> **However** (and it's a pretty big however), the "surrounding document" view **cannot return the surrounding documents**. While the query doesn't error it also doesn't return any documents. The experiment to have that working [is carried out is a separate PR](#1446), although it's not clear yet if this way of doing things is going to guarantee 100% correctness.

mieciu added 4 commits June 2, 2025 14:38

introduce unique ids

e2cf414

cleanup

a60f6b8

okay one more found

4044e73

unique ids take2

5b90e35

mieciu mentioned this pull request Jun 5, 2025

Introduce unique ids #1435

Merged

mieciu changed the title ~~🚧 _id fixing "surrounding documents" view with new _id implementation~~ 🚧 [WIP] Fixing "surrounding documents" view with new _id implementation Jun 5, 2025

mieciu mentioned this pull request Jun 6, 2025

Fixing "surrounding documents" view with new _id implementation #1447

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚧 [WIP] Fixing "surrounding documents" view with new _id implementation #1446

🚧 [WIP] Fixing "surrounding documents" view with new _id implementation #1446

Uh oh!

mieciu commented Jun 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🚧 [WIP] Fixing "surrounding documents" view with new _id implementation #1446

Are you sure you want to change the base?

🚧 [WIP] Fixing "surrounding documents" view with new _id implementation #1446

Uh oh!

Conversation

mieciu commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Present situation

Problem

Possible solution

Related screenshots

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mieciu commented Jun 5, 2025 •

edited

Loading