-
Notifications
You must be signed in to change notification settings - Fork 179
sort symbols in order of frequency rather than lexicographically #280
base: master
Are you sure you want to change the base?
Conversation
Thanks for working on this! Currently we appear to have some severe bugs in Prometheus 2.1 tied to the storage. Thus, I'd suggest we freeze any new features to prometheus/tsdb until things go back to being stable. So this PR will probably be on hold for a bit. |
@fabxc no problem! I can pick this (#249 ) back up when we're in a more stable state. Is there anything I can help with regarding the bugs in 2.1? |
@fabxc will the 2.2.0 release unblock this? |
598c024
to
24b0863
Compare
24b0863
to
508d576
Compare
rebased off master, fixed a merge conflict that I'd missed when I pushed last |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'll run it locally for a while and report back.
head.go
Outdated
|
||
for s := range h.head.symbols { | ||
res[s] = struct{}{} | ||
res[s] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this 0
? It should be:
for s, num := range h.head.symbols {
res[s] = num
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I had a reason, I don't remember at this point :)
508d576
to
95666e0
Compare
changed the value assigned to the symbols in let me know if there's anything I can do to help test this :) |
95666e0
to
00a6d1c
Compare
@gouthamve @fabxc are these changes still relevant? |
Shouldn't you sort the symbols in the index writer , before writing it to disk? https://github.com/prometheus/tsdb/blob/c848349f07c83bd38d5d19faa5ea71c7fd8923ea/index/index.go#L343 |
@krasi-georgiev I'll have to double check, haven't looked at this in a while |
00a6d1c
to
5088a2c
Compare
@krasi-georgiev is that not what is happening here: https://github.com/prometheus/tsdb/pull/280/files#diff-71ebe2bcf31a915b1fa3b3b289d5d31dR354 ? rebased off master to fix the conflict in head.go |
5088a2c
to
36cbad4
Compare
failing tests |
36cbad4
to
21bde8c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you test how much are the savings by this change?
Also I don't see any test to ensure the behaviour that Symbols are saved ordered by frequency.
something like
- Add some symbols
- Save index
- Read index
- Check Symbols order.
Signed-off-by: Callum Styan <[email protected]>
21bde8c
to
52dadbc
Compare
Added a test for the sorting of symbols I'm not sure if we want to get back the frequency #'s when we read the symbols back out of the table? |
I don't see a reason to expose that in the API. |
failing tests also
|
Signed-off-by: Callum Styan <[email protected]>
Signed-off-by: Callum Styan <[email protected]>
1664379
to
7e9131d
Compare
Yes I'll have a look at, I guess including a benchmark test? But if you read #249 the goal is to reduce the size of the index file.
I'll have to double check. When I was reading the use of the index and block reader to get the symbols, when compaction happens we read the current symbols, determine which are still in use, and then write those to a new index. In that case we would want the frequencies to persist across that write. But I've probably just misread what's happening. |
yes this is what I meant , how much is the index file size reduced by this change. |
@cstyan would you have time to continue with this? |
@krasi-georgiev yeah I should have some time next week, if you wanted to try something before then feel free. |
Signed-off-by: Krasi Georgiev <[email protected]>
Signed-off-by: Krasi Georgiev <[email protected]>
updated to the latest master and resolved the conflicts. Now will run some tests locally to compare the index file savings with this change. |
using the following test I don't see any difference in the index file size. The index file size with or without the changes in this PR is 180Mb.
This generates random series so it should generate enough churn. |
ping @cstyan |
for #249
as we keep track of symbols in head we now also keep track of how many times we've seen that symbol, and when we write the symbols in IndexWriter we sort the symbols in order of frequency seen before writing them