You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs(dcp): document DCP-optimized S3 reader in README and docstrings
- Add documentation to README, constructor, and DCPOptimizedS3Reader class
- Include class docstrings for S3FileSystem, S3StorageWriter, and S3StorageReader
- Update reader configurations in README with examples
- Use sphinx-friendly formatting for docstrings
- Remove some unplanned TODOs and update some comments
Copy file name to clipboardExpand all lines: README.md
+29-12Lines changed: 29 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -128,7 +128,9 @@ Amazon S3 Connector for PyTorch provides robust support for PyTorch distributed
128
128
129
129
-`S3StorageWriter`: Implementation of PyTorch's StorageWriter interface.
130
130
131
-
-`S3StorageReader`: Implementation of PyTorch's StorageReader interface. Supports configurable reading strategies via the `reader_constructor` parameter (see [Reader Configurations](#reader-configurations)).
131
+
-`S3StorageReader`: Implementation of PyTorch's StorageReader interface.
132
+
- Supports configurable reading strategies via the `reader_constructor` parameter (see [Reader Configurations](#reader-configurations)).
133
+
-`S3ReaderConstructor.dcp_optimized()` is recommended for up to 2x faster loading with partial checkpoint optimizations.
132
134
-`S3FileSystem`: An implementation of PyTorch's FileSystemBase.
133
135
134
136
These tools enable seamless integration of Amazon S3 with
@@ -151,6 +153,7 @@ can be found in the [examples/dcp](https://github.com/awslabs/s3-connector-for-p
151
153
152
154
```py
153
155
from s3torchconnector.dcp import S3StorageWriter, S3StorageReader
reader_constructor=reader_constructor, # optional; constructor for S3Reader types
187
+
)
179
188
DCP.load(
180
189
state_dict=model_state_dict,
181
190
storage_reader=s3_storage_reader,
@@ -409,7 +418,7 @@ data = s3reader.read()
409
418
410
419
## Reader Configurations
411
420
412
-
Amazon S3 Connector for PyTorch supports two types of readers, configurable through `S3ReaderConstructor`.
421
+
Amazon S3 Connector for PyTorch supports three types of readers, configurable through `S3ReaderConstructor`.
413
422
414
423
### Reader Types
415
424
@@ -420,21 +429,32 @@ Amazon S3 Connector for PyTorch supports two types of readers, configurable thro
420
429
421
430
#### 2. Range-based Reader
422
431
423
-
- Performs byte-range requests to read specific portions of S3 objects without downloading the entire file.
424
-
- Prioritizes memory efficiency, with performance gains only for sparse partial reads.
432
+
- Performs byte-range requests to read specific portions of S3 objects without downloading the entire object.
433
+
- Prioritizes memory efficiency, with performance gains only for sparse partial reads in large objects.
425
434
- Features adaptive buffering with forward overlap handling:
426
435
-**Small reads** (< `buffer_size`): Use internal buffer to reduce S3 API calls.
427
436
-**Large reads** (≥ `buffer_size`): Bypass buffer for direct transfer.
428
437
438
+
#### 3. DCP-Optimized Reader (DCP only)
439
+
440
+
- Specialized usage for PyTorch Distributed Checkpoint (DCP) loading.
441
+
- Provides up to 2x performance improvement through zero-copy buffers and sequential access patterns.
442
+
- Enables efficient partial checkpoint loading (e.g. model-only) through range-based streams and range coalescing.
443
+
- Automatically handles range metadata injection from DCP load plan.
444
+
- Requires sequential access patterns (automatically enforced in `S3StorageReader.prepare_local_plan()`)
445
+
429
446
### When to Use Each Reader
430
447
431
-
-**Sequential Reader**: For processing entire files, and when repeated access to the data is required. Best for most general use cases.
448
+
-**Sequential Reader**: For processing entire objects, and when repeated access to the data is required. Best for most general use cases.
432
449
-**Range-based Reader**: For larger objects (100MB+) that require sparse partial reads, and in memory-constrained environments.
450
+
-**DCP-Optimized Reader**: For PyTorch Distributed Checkpoint loading scenarios.
433
451
434
452
**Note**: S3Reader instances are not thread-safe and should not be shared across threads. For multiprocessing with DataLoader, each worker process creates its own S3Reader instance automatically.
435
453
436
454
### Examples
437
455
456
+
For `S3ReaderConstructor` usage details, please refer to the [`S3ReaderConstructor` documentation](https://awslabs.github.io/s3-connector-for-pytorch/autoapi/s3torchconnector/s3reader/constructor/index.html). Below are some examples for `S3ReaderConstructor` usage.
457
+
438
458
Direct method - `S3Client` usage with range-based reader without buffer:
439
459
```py
440
460
# Direct S3Client usage for zero-copy partial reads into pre-allocated buffers, for memory efficiency and fast data transfer
For `S3ReaderConstructor` usage details, please refer to the [`S3ReaderConstructor` documentation](https://awslabs.github.io/s3-connector-for-pytorch/autoapi/s3torchconnector/s3reader/constructor/index.html).
0 commit comments