Skip to content

Conversation

@UESTC-AHao
Copy link
Contributor

@UESTC-AHao UESTC-AHao commented Oct 21, 2025

Purpose

What this PR does / why we need it?

This feature introduces GPU Direct Storage (GDS) support to enable direct data transfer between GPU memory and storage, bypassing CPU memory as an intermediate buffer. This significantly reduces memory bandwidth bottlenecks and improves KV cache loading/offloading performance.

Modifications

Does this PR introduce any user-facing change?

A new parameter, useDirect, has been added to the service-startup command. When set to true, GDS transfer is enabled and the data path is Device Memory <-> Storage (direct). When set to false, the path becomes Device Memory <-> Host Memory <-> Storage, as shown below:

"kv_connector_extra_config": {
"ucm_connector_name": "UcmNfsStore",
"ucm_connector_config": {
"storage_backends": "/home/nfs",
"useDirect" : true

Test

How was this patch tested?

embed without GDS:
img_v3_02r9_6b47c813-3ecc-4a9f-9b85-8f027bdee16g

embed with GDS:
img_v3_02r9_5d4e7d1e-d099-44b4-a75b-b53189f3f03g

fetch without GDS:
img_v3_02r9_e8fe7199-9ad2-4e45-98d4-54e49f16865g

fetch with GDS:
img_v3_02r9_6050b6b2-34b7-40ae-86e0-1b42f7c9d45g

CUfileHandle_t cuFileHandle = nullptr;
auto status = CuFileHandleRecorder::Instance().Get(path, cuFileHandle,
[&path](CUfileHandle_t& handle, int& fd) -> Status {
return CreateCuFileHandle(path, O_WRONLY | O_CREAT | O_DIRECT, handle, fd);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need O_CREAT here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O_CREAT is unnecessary in D2SSync.The file is already created at the Python level before the dump occurs.

if (!success) { this->failureSet_->Insert(shard.owner); }
if (!shard.done) { return; }
if (device) {
if (device->Synchronized().Failure()) { this->failureSet_->Insert(shard.owner); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Sync GDS need Synchronize() here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — the current S2DSync/D2SSync operations are already synchronous, so there are no pending operations that need to be waited for.

virtual ~TaskQueue() = default;
virtual void Push(std::list<Task::Shard>& shards) noexcept = 0;
virtual Status Setup(const int32_t deviceId, const size_t bufferSize, const size_t bufferNumber,
class TaskSet* failureSet, const class SpaceLayout* layout, const size_t timeoutMs, bool transferUseDirect) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the class keyword be removed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but the relevant header files need to be included. I’ll make the necessary changes.

}
virtual ~IDevice() = default;
virtual Status Setup() = 0;
virtual Status Setup(bool transferUseDirect) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IDevice base class should not be aware of this config option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device should decide whether to invoke cuFileDriverOpen() during setup based on the transferUseDirect parameter; moreover, cuFileDriverOpen() is CUDA-specific and can not appear in nfsstore.

}
virtual Status D2SSync(const std::string& path, void* address, const size_t length, const size_t file_offset, const size_t dev_offset) {
return Status::Unsupported();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Pass a file handle, not the file path, to Device.
  2. Use lowerCamelCase for parameter names.
  3. Declare interface methods in the interface class as virtual functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.Passing a file handle would require using CUfileDescr_t or CUfileHandle_t to obtain the fd inside directstorage_queue.cc, but CUfile is CUDA-specific and can not appear in that file.
2.Received,I will fix it.
3.Received,I will fix it.


#include "task_shard.h"
#include "task_set.h"
#include "../nfsstore/cc/domain/space/space_layout.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StoreTask is base class, should not depend on NfsStore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Setup function is not declared virtual and does not accept parameters such as SpaceLayout, then the call q->Setup() in trans_manager.h cannot decide whether to invoke cuFileDriverOpen() based on the arguments passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants