Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change VERIFY BuildRequest to KILL_TABLET #7901

Open
kruall opened this issue Aug 16, 2024 · 3 comments · May be fixed by #13766
Open

Change VERIFY BuildRequest to KILL_TABLET #7901

kruall opened this issue Aug 16, 2024 · 3 comments · May be fixed by #13766

Comments

@kruall
Copy link
Collaborator

kruall commented Aug 16, 2024

VERIFY failed (2024-08-16T12:53:32.890340+0300): action_id=34ea0688-5bb511ef-9d455037-e1d78fd8;tablet_id=72075186224039667;verification=++it->second.RequestsCount < 10;fline=gc.cpp:62;event=build_gc_request;address=g=2181038085;c=2;;current_gen=1;gen=1:0;count=10;
  ydb/library/actors/core/log.cpp:744
  ~TVerifyFormattedRecordWriter(): requirement false f
0. /home/kruall/rya-wd/ydb/util/system/yassert.cpp:83: NPrivate::InternalPanicImpl(int, char const*, char const*, int, int, int, TBasicStringBuf<char, std::__y1::char_traits<char>>, char const*, unsigned long) @ 0xDABE3EC
1. /home/kruall/rya-wd/ydb/util/system/yassert.cpp:55: NPrivate::Panic(NPrivate::TStaticBuf const&, int, char const*, char const*, char const*, ...) @ 0xDAB95EB
2. /home/kruall/rya-wd/ydb/ydb/library/actors/core/log.cpp:744: NActors::TVerifyFormattedRecordWriter::~TVerifyFormattedRecordWriter() @ 0xE60E0A4
3. /home/kruall/rya-wd/ydb/ydb/core/tx/columnshard/blobs_action/bs/gc.cpp:62: NKikimr::NOlap::NBlobOperations::NBlobStorage::TGCTask::BuildRequest(NKikimr::NOlap::NBlobOperations::NBlobStorage::TBlobAddress const&) const @ 0x14F01BCA
4. /home/kruall/rya-wd/ydb/ydb/core/tx/columnshard/blobs_action/bs/gc_actor.cpp:17: NKikimr::NOlap::NBlobOperations::NBlobStorage::TGarbageCollectionActor::Handle(TAutoPtr<NActors::TEventHandle<NKikimr::TEvBlobStorage::TEvCollectGarbageResult>, TDelete>&) @ 0x14F13366
5. /home/kruall/rya-wd/ydb/ydb/core/tx/columnshard/blobs_action/bs/gc_actor.h:34: NKikimr::NOlap::NBlobOperations::NBlobStorage::TGarbageCollectionActor::StateWork(TAutoPtr<NActors::IEventHandle, TDelete>&) @ 0x14EFEAD9
6. /home/kruall/rya-wd/ydb/ydb/library/actors/core/executor_thread.cpp:251: NActors::TGenericExecutorThread::TProcessingResult NActors::TGenericExecutorThread::Execute<NActors::TMailboxTable::THTSwapMailbox>(NActors::TMailboxTable::THTSwapMailbox*, unsigned int, bool) @ 0xE5F12B3
7. /home/kruall/rya-wd/ydb/ydb/library/actors/core/executor_thread.cpp:439: NActors::TGenericExecutorThread::ProcessExecutorPool(NActors::IExecutorPool*)::$_0::operator()(unsigned int, bool) const @ 0xE5E9741
8. /home/kruall/rya-wd/ydb/ydb/library/actors/core/executor_thread.cpp:492: NActors::TGenericExecutorThread::ProcessExecutorPool(NActors::IExecutorPool*) @ 0xE5E9139
9. /home/kruall/rya-wd/ydb/ydb/library/actors/core/executor_thread.cpp:523: NActors::TExecutorThread::ThreadProc() @ 0xE5E9ED6
10. /home/kruall/rya-wd/ydb/util/system/thread.cpp:244: (anonymous namespace)::TPosixThread::ThreadProxy(void*) @ 0xDAC2629
11. ??:0: ?? @ 0x7FEAAF149608
12. ??:0: ?? @ 0x7FEAAF069352

https://github.com/ydb-platform/ydb/blob/main/ydb/core/tx/columnshard/blobs_action/bs/gc.cpp#L62

Currently, in case of issues with blob storage, verification triggers and the entire process fails. We need to switch to terminating only the tablet.

If we're already experiencing problems with blob storage, then killing the node and restarting all tablets on it might exacerbate the situation.
It's better to reduce the impact radius to a single tablet.

@maximyurchuk
Copy link
Collaborator

maximyurchuk commented Jan 14, 2025

I have the same problem

maxim-yurchuk@vla5-2570:/Berkanavt/kikimr_31003/logs$ cat kikimr.start.13.gz | gzip --d
Jan 14 00:05:54 vla5-2570 kikimr_31003[3869885]: VERIFY failed (2025-01-14T00:05:54.910827+0300): action_id=dfedc02-d1f211ef-9e84f3c3-3de25b8b;tablet_id=72075186229319608;verification=++it->second.RequestsCount < 10;fline=gc.cpp:62;event=build_gc_request;address=g=2181038081;c=21;;current_gen=1;gen=1:0;count=10;
Jan 14 00:05:54 vla5-2570 kikimr_31003[3869885]:   ydb/library/actors/core/log.cpp:754
Jan 14 00:05:54 vla5-2570 kikimr_31003[3869885]:   ~TVerifyFormattedRecordWriter(): requirement false failed
Jan 14 00:05:56 vla5-2570 kikimr_31003[3869885]: 0. /-S/util/system/yassert.cpp:83: NPrivate::InternalPanicImpl(int, char const*, char const*, int, int, int, TBasicStringBuf<char, std::__y1::char_traits<char>>, char const*, unsigned long) @ 0x9A80EDB
Jan 14 00:05:56 vla5-2570 kikimr_31003[3869885]: 1. /-S/util/system/yassert.cpp:55: NPrivate::Panic(NPrivate::TStaticBuf const&, int, char const*, char const*, char const*, ...) @ 0x9A7A98C
Jan 14 00:05:56 vla5-2570 kikimr_31003[3869885]: 2. /-S/ydb/library/actors/core/log.cpp:754: NActors::TVerifyFormattedRecordWriter::~TVerifyFormattedRecordWriter() @ 0xA812CE3
Jan 14 00:05:56 vla5-2570 kikimr_31003[3869885]: 3. /-S/ydb/core/tx/columnshard/blobs_action/bs/gc.cpp:62: NKikimr::NOlap::NBlobOperations::NBlobStorage::TGCTask::BuildRequest(NKikimr::NOlap::NBlobOperations::NBlobStorage::TBlobAddress const&) const @ 0x1256055C
Jan 14 00:05:56 vla5-2570 kikimr_31003[3869885]: 4. /-S/ydb/core/tx/columnshard/blobs_action/bs/gc_actor.cpp:17: NKikimr::NOlap::NBlobOperations::NBlobStorage::TGarbageCollectionActor::Handle(TAutoPtr<NActors::TEventHandle<NKikimr::TEvBlobStorage::TEvCollectGarbageResult>, TDelete>&) @ 0x125712CE
Jan 14 00:05:56 vla5-2570 kikimr_31003[3869885]: 5. /-S/ydb/core/tx/columnshard/blobs_action/bs/gc_actor.h:34: NKikimr::NOlap::NBlobOperations::NBlobStorage::TGarbageCollectionActor::StateWork(TAutoPtr<NActors::IEventHandle, TDelete>&) @ 0x1255D65E
Jan 14 00:05:56 vla5-2570 kikimr_31003[3869885]: 6. /-S/ydb/library/actors/core/executor_thread.cpp:281: NActors::TGenericExecutorThread::Execute(NActors::TMailbox*, bool) @ 0xA804C93
Jan 14 00:05:56 vla5-2570 kikimr_31003[3869885]: 7. /-S/ydb/library/actors/core/executor_thread.cpp:475: NActors::TGenericExecutorThread::ProcessExecutorPool(NActors::IExecutorPool*)::$_0::operator()(NActors::TMailbox*, bool) const @ 0xA808780
Jan 14 00:05:56 vla5-2570 kikimr_31003[3869885]: 8. /-S/ydb/library/actors/core/executor_thread.cpp:529: NActors::TGenericExecutorThread::ProcessExecutorPool(NActors::IExecutorPool*) @ 0xA8082D2
Jan 14 00:05:56 vla5-2570 kikimr_31003[3869885]: 9. /-S/ydb/library/actors/core/executor_thread.cpp:560: NActors::TExecutorThread::ThreadProc() @ 0xA808F2E
Jan 14 00:05:56 vla5-2570 kikimr_31003[3869885]: 10. /-S/util/system/thread.cpp:244: (anonymous namespace)::TPosixThread::ThreadProxy(void*) @ 0x9A84FBC
Jan 14 00:05:56 vla5-2570 kikimr_31003[3869885]: 11. ??:0: ?? @ 0x7FDDBF0A5608
Jan 14 00:05:56 vla5-2570 kikimr_31003[3869885]: 12. ??:0: ?? @ 0x7FDDBEFC5352
Jan 14 00:05:58 vla5-2570 kikimr_31003[371544]: GRPCs port is not defined.
Jan 14 00:05:58 vla5-2570 kikimr_31003[371544]: Determined node ID: 0
Jan 14 00:05:58 vla5-2570 kikimr_31003[371544]: Trying to register dynamic node to vla5-2569.search.yandex.net:2135
Jan 14 00:05:59 vla5-2570 kikimr_31003[371544]: Registration error: Status: TRANSPORT_UNAVAILABLE
Jan 14 00:05:59 vla5-2570 kikimr_31003[371544]: Issues:
Jan 14 00:05:59 vla5-2570 kikimr_31003[371544]: <main>: Error: GRpc error: (14): failed to connect to all addresses; last error: UNKNOWN: ipv6:%5B2a02:6b8:c34:14:0:1517:eb1e:f456%5D:2135: Failed to connect to remote host: Connection refused
Jan 14 00:05:59 vla5-2570 kikimr_31003[371544]: <main>: Error: Grpc error response on endpoint vla5-2569.search.yandex.net:2135
Jan 14 00:05:59 vla5-2570 kikimr_31003[371544]: Trying to register dynamic node to vla5-2566.search.yandex.net:2135
Jan 14 00:06:06 vla5-2570 kikimr_31003[371544]: Registration error: Status: TRANSPORT_UNAVAILABLE
Jan 14 00:06:06 vla5-2570 kikimr_31003[371544]: Issues:
Jan 14 00:06:06 vla5-2570 kikimr_31003[371544]: <main>: Error: GRpc error: (14): recvmsg:Connection reset by peer
Jan 14 00:06:06 vla5-2570 kikimr_31003[371544]: <main>: Error: Grpc error response on endpoint vla5-2566.search.yandex.net:2135
Jan 14 00:06:06 vla5-2570 kikimr_31003[371544]: Trying to register dynamic node to vla5-2568.search.yandex.net:2135
Jan 14 00:06:06 vla5-2570 kikimr_31003[371544]: Success. Registered as 50005
Jan 14 00:06:06 vla5-2570 kikimr_31003[371544]: Node name:
Jan 14 00:06:06 vla5-2570 kikimr_31003[371544]: Trying to get configs from vla5-2570.search.yandex.net:2135
Jan 14 00:06:06 vla5-2570 kikimr_31003[371544]: Success.
Jan 14 00:06:06 vla5-2570 kikimr_31003[371544]: configured
Jan 14 00:06:06 vla5-2570 kikimr_31003[371544]: Starting Kikimr r-1 built by maxim-yurchuk
Jan 14 00:06:06 vla5-2570 kikimr_31003[371544]: UDFsDir is not specified, no dynamic UDFs will be loaded.
Jan 14 00:06:06 vla5-2570 kikimr_31003[371544]: 2025-01-14T00:06:06.798512+03:00 371544 15038761320904901870 INFO ua_0 created, uri [[fd53::1]:16400]
Jan 14 00:06:06 vla5-2570 kikimr_31003[371544]: GRpc memory quota was set but disabled due to issues with grpc quoter, to enable it use EnableGRpcMemoryQuota option
Jan 14 00:06:06 vla5-2570 kikimr_31003[371544]: 2025-01-14T00:06:06.883837+03:00 371544 17193755286649186654 INFO ua_0/0 grpc call initialized, session_id [706f5fac-739c7f25-a4adc398-fdce3d30], last_seq_no [0]

Version: d425a70

@naspirato
Copy link
Collaborator

naspirato commented Jan 20, 2025

repeated in http://ydb-sas-testing-0000.search.yandex.net:8765/monitoring/cluster/nodes
version 7e240c6
2025-01-14 16:47:55.000

@maximyurchuk
Copy link
Collaborator

prio:high потому что стоппит тестирование с немезисом

@avevad avevad linked a pull request Jan 23, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants