Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDEV-33966: buf_page_make_young() is a contention point #3522

Open
wants to merge 3 commits into
base: 10.6
Choose a base branch
from

Conversation

dr-m
Copy link
Contributor

@dr-m dr-m commented Sep 17, 2024

  • The Jira issue number for this PR is: MDEV-33966

Description

The buf_pool.LRU list needs to reasonably accurately reflect recently accessed blocks, so that they will not be evicted prematurely. Because the list is protected by buf_pool.mutex, it is not a good idea to maintain the position on every page access.

Instead of maintaining the LRU position on each access, we will set a "recently accessed" flag in the block descriptor. The buf_flush_page_cleaner() thread as well as some traversal of the buf_pool.LRU list will test and reset this flag on each visited block. If the flag was set, the position in the buf_pool.LRU list may be adjusted.

As a result of this, the fields buf_pool_t::freed_page_clock and buf_page_t::freed_page_clock will lose their purpose and will be removed.

The field buf_page_t::access_time will store uint16_t(time(nullptr)) instead of storing uint32_t(ut_time_ms()). So, instead of wrapping around in 49.7 days, the counter will wrap around in 18.2 hours. In the worst case, the wrap-around could cause a rather recently used block to be wrongly evicted, but there are some remedies against this in buf_page_t::make_young().

Also the 64-bit field buf_block_t::modify_clock will be moved to 32+16 bits in buf_page_t so that it can be stored within the same 64 bits with buf_page_t::access_time. This the removal of freed_page_clock lead to a reduction of sizeof(buf_page_t).

Release Notes

freed_page_clock will always be reported as 0 in the INFORMATION_SCHEMA.INNODB_ tables.

The granularity of innodb_old_blocks_time will be reduced from milliseconds to seconds.

How can this PR be tested?

With a rather extreme case (Sysbench oltp_update_index with 1 thread, 1 table, 100k rows, 5MiB buffer pool, 240 seconds run time), I was able to observe a significant improvement:

revision average throughput
231900e 2843.39 tps
c8f4726 3312.46 tps

In some multi-threaded runs, I observed a slight to significant performance regression. So, more testing will be needed to determine if this is a good or bad thing. It is possible that the approximated LRU replacement policy is now totally incorrect. If that turns out to be the case based on more testing, we’d need to find a small and short test workload that shows a significant regression.

I checked that on my GNU/Linux system, time(nullptr) will read the time stamp from a memory location, not invoke any system call.

Basing the PR against the correct MariaDB version

  • This is a new feature or a refactoring, and the PR is based against the main branch.
  • This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

This bug affects 10.5 as well, but there are significant performance differences between 10.5 and 10.6. Therefore we target 10.6.

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

@dr-m dr-m self-assigned this Sep 17, 2024
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@dr-m dr-m force-pushed the 10.6-MDEV-33966 branch 8 times, most recently from e5d11b3 to f14f819 Compare September 24, 2024 14:34
Instead of repurposing buf_page_t::access_time for state()==MEMORY
blocks that are part of recv_sys.pages, let us define an anonymous
union around buf_page_t::hash.  In this way, we will be able to
declare access_time private.
buf_block_alloc(): Define as an alias in buf0lru.h, which defines
the underlying buf_LRU_get_free_block().

buf_block_free(): Define as an alias of the non-inline function
buf_pool.free_block(block).
The buf_pool.LRU list needs to reasonably accurately reflect recently
accessed blocks, so that they will not be evicted prematurely.
Because the list is protected by buf_pool.mutex, it is not a good idea
to maintain the position on every page access.

Instead of maintaining the LRU position on each access, we will
set a "recently accessed" flag in the block descriptor. The
buf_flush_page_cleaner() thread as well as some traversal of the
buf_pool.LRU list will test and reset this flag on each visited
block. If the flag was set, the position in the buf_pool.LRU list
may be adjusted.

buf_pool_t::freed_page_clock, buf_page_t::freed_page_clock: Remove.
This is no longer meaningful in the revised design.

page_zip_des_t::state: An atomic 16-bit field that will include
the "accessed" and "old" flags that would more logically belong
to buf_page_t. We maintain them here (along with some
ROW_FORMAT=COMPRESSED specific state that is protected by page
latches) in order to avoid race conditions and unnecessary
memory overhead.

buf_block_t::init(): Replaces buf_block_init().

buf_page_t::invalidate(): Replaces buf_block_modify_clock_inc().
Instead of maintaining a 64-bit counter, we will maintain one
comprising 32+16=48 bits, in modify_clock_low,modify_clock_high.
Worst case there will be exactly n<<48 calls to
buf_page_t::invalidate() before some operation such as
btr_pcur_t::restore_position() is executed. Such a count should
be extremely unlikely but not completely impossible. It is worth
noting that the DB_TRX_ID is only 48 bits, and each transaction
start and commit/rollback will consume an identifier.

buf_page_t::modify_clock(): Replaces the read access of
buf_page_t::modify_clock. Assert that the caller is holding
a page latch. Note: because invalidate() and modify_clock() are
protected with buf_pool.mutex or the buf_page_t::lock, there can
be no issue with regard to the atomicity of accessing the 48-bit field.

buf_page_t::access_time: Store uint16_t(time(nullptr)). Yes, it will
wrap around every 18.2 hours, and in the worst case, a rather recently
accessed block may end up being evicted as "least recently used".
But in this way we will avoid any alignment loss: the adjacent fields
modify_clock_low, modify_clock_high, access_time of 32+16+16 bits
will nicely add up to 64 bits.

buf_page_t::make_young(): Clear the "recently accessed" flag of a block
and move the block to the "not recently used" end of buf_pool.LRU
if it qualifies for that. When we make a block young, we zero out
access_time so that the comparison is more likely to hold on a
subsequent invocation, or so that flag_accessed() will update the
current access time, which due to wrap-around could look like less
than the previously assigned access_time.

buf_LRU_scan_and_free_block(): Declare static.

buf_pool_invalidate(): Define in the same compilation unit with
buf_LRU_scan_and_free_block().

buf_pool.LRU_old_time_threshold: Replaces buf_LRU_old_threshold_ms.

PageConverter::run(): Renamed from fil_iterate(). In debug builds,
acquire a dummy exclusive latch on the block, so that the assertion
in buf_block_t::modify() will be satisfied.

ut_time_ms(): Remove. Invoking my_hrtime_coarse() is sufficient for the
remaining purposes.
@dr-m dr-m marked this pull request as ready for review September 25, 2024 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants