docs(dcp): document DCP-optimized S3 reader in README and docstrings

jet-tong · jet-tong · commit 567b0772b343 · 2025-11-04T16:51:15.000Z
- Add documentation to README, constructor, and DCPOptimizedS3Reader class
- Include class docstrings for S3FileSystem, S3StorageWriter, and S3StorageReader
- Update reader configurations in README with examples
- Use sphinx-friendly formatting for docstrings
- Remove some unplanned TODOs and update some comments
diff --git a/README.md b/README.md
@@ -128,7 +128,9 @@ Amazon S3 Connector for PyTorch provides robust support for PyTorch distributed
 
 - `S3StorageWriter`: Implementation of PyTorch's StorageWriter interface.
 
-- `S3StorageReader`: Implementation of PyTorch's StorageReader interface. Supports configurable reading strategies via the `reader_constructor` parameter (see [Reader Configurations](#reader-configurations)).
+- `S3StorageReader`: Implementation of PyTorch's StorageReader interface. 
+  - Supports configurable reading strategies via the `reader_constructor` parameter (see [Reader Configurations](#reader-configurations)). 
+  - `S3ReaderConstructor.dcp_optimized()` is recommended for up to 2x faster loading with partial checkpoint optimizations. 
 - `S3FileSystem`: An implementation of PyTorch's FileSystemBase.
 
 These tools enable seamless integration of Amazon S3 with 
@@ -151,6 +153,7 @@ can be found in the [examples/dcp](https://github.com/awslabs/s3-connector-for-p
 
 ```py
 from s3torchconnector.dcp import S3StorageWriter, S3StorageReader
+from s3torchconnector import S3ReaderConstructor
 
 import torchvision
 import torch.distributed.checkpoint as DCP
@@ -175,7 +178,13 @@ DCP.save(
 # Load distributed checkpoint from S3
 model = torchvision.models.resnet18()
 model_state_dict = model.state_dict()
-s3_storage_reader = S3StorageReader(region=REGION, path=CHECKPOINT_URI)
+# Use DCP-optimized reader for faster loading
+reader_constructor = S3ReaderConstructor.dcp_optimized()
+s3_storage_reader = S3StorageReader(
+    region=REGION, 
+    path=CHECKPOINT_URI,
+    reader_constructor=reader_constructor, # optional; constructor for S3Reader types
+)
 DCP.load(
     state_dict=model_state_dict,
     storage_reader=s3_storage_reader,
@@ -409,7 +418,7 @@ data = s3reader.read()
 
 ## Reader Configurations
 
-Amazon S3 Connector for PyTorch supports two types of readers, configurable through `S3ReaderConstructor`.
+Amazon S3 Connector for PyTorch supports three types of readers, configurable through `S3ReaderConstructor`.
 
 ### Reader Types
 
@@ -420,21 +429,32 @@ Amazon S3 Connector for PyTorch supports two types of readers, configurable thro
 
 #### 2. Range-based Reader
 
-- Performs byte-range requests to read specific portions of S3 objects without downloading the entire file.
-- Prioritizes memory efficiency, with performance gains only for sparse partial reads.
+- Performs byte-range requests to read specific portions of S3 objects without downloading the entire object.
+- Prioritizes memory efficiency, with performance gains only for sparse partial reads in large objects. 
 - Features adaptive buffering with forward overlap handling:
   - **Small reads** (< `buffer_size`): Use internal buffer to reduce S3 API calls.
   - **Large reads** (≥ `buffer_size`): Bypass buffer for direct transfer.
 
+#### 3. DCP-Optimized Reader (DCP only)
+
+- Specialized usage for PyTorch Distributed Checkpoint (DCP) loading.
+- Provides up to 2x performance improvement through zero-copy buffers and sequential access patterns.
+- Enables efficient partial checkpoint loading (e.g. model-only) through range-based streams and range coalescing.
+- Automatically handles range metadata injection from DCP load plan.
+- Requires sequential access patterns (automatically enforced in `S3StorageReader.prepare_local_plan()`)
+
 ### When to Use Each Reader
 
-- **Sequential Reader**: For processing entire files, and when repeated access to the data is required. Best for most general use cases.
+- **Sequential Reader**: For processing entire objects, and when repeated access to the data is required. Best for most general use cases.
 - **Range-based Reader**: For larger objects (100MB+) that require sparse partial reads, and in memory-constrained environments. 
+- **DCP-Optimized Reader**: For PyTorch Distributed Checkpoint loading scenarios.
 
 **Note**: S3Reader instances are not thread-safe and should not be shared across threads. For multiprocessing with DataLoader, each worker process creates its own S3Reader instance automatically.
 
 ### Examples
 
+For `S3ReaderConstructor` usage details, please refer to the [`S3ReaderConstructor` documentation](https://awslabs.github.io/s3-connector-for-pytorch/autoapi/s3torchconnector/s3reader/constructor/index.html). Below are some examples for `S3ReaderConstructor` usage. 
+
 Direct method - `S3Client` usage with range-based reader without buffer:
 ```py
 # Direct S3Client usage for zero-copy partial reads into pre-allocated buffers, for memory efficiency and fast data transfer
@@ -456,15 +476,13 @@ s3reader.seek(100 * 1024 * 1024)   # Skip to 100MB offset
 bytes_read = s3reader.readinto(buffer)  # Direct read into buffer
 ```
 
-DCP interface - `S3StorageReader` usage with range-based reader with buffer:
+DCP interface - `S3StorageReader` usage with dcp-optimized reader:
 ```py
-# Load distributed checkpoint with range-based reader to optimize memory usage for large checkpoint files
+# Load checkpoint with dcp-optimized reader for better performance
 from s3torchconnector.dcp import S3StorageReader
 from s3torchconnector import S3ReaderConstructor
 
-reader_constructor = S3ReaderConstructor.range_based(
-    buffer_size=16*1024*1024  # 16MB buffer
-)
+reader_constructor = S3ReaderConstructor.dcp_optimized()
 s3_storage_reader = S3StorageReader(
     region=REGION, 
     path=CHECKPOINT_URI,
@@ -492,7 +510,6 @@ for item in dataset:
     ...
 ```
 
-For `S3ReaderConstructor` usage details, please refer to the [`S3ReaderConstructor` documentation](https://awslabs.github.io/s3-connector-for-pytorch/autoapi/s3torchconnector/s3reader/constructor/index.html).
 
 ## Contributing
 
diff --git a/s3torchconnector/pyproject.toml b/s3torchconnector/pyproject.toml
@@ -35,7 +35,7 @@ test = [
     "flake8",
     "black",
     "mypy",
-    "importlib_metadata; python_version == '3.9'", # PyTorch 2.7.0+ DCP w/ Python 3.9 requires this module
+    "importlib_metadata; python_version == '3.9'", # PyTorch 2.7.0+ DCP w/ Python 3.9 requires this module; for dcp_optimized reader unit tests
 ]
 
 e2e = [
diff --git a/s3torchconnector/src/s3torchconnector/dcp/s3_file_system.py b/s3torchconnector/src/s3torchconnector/dcp/s3_file_system.py
@@ -44,6 +44,7 @@
 
 
 class S3FileSystem(FileSystemBase):
+    """S3-based implementation of PyTorch's FileSystemBase for distributed checkpointing."""
     def __init__(
         self,
         region: str,
@@ -267,6 +268,7 @@ class StorageMetadata:
 
 
 class S3StorageWriter(FileSystemWriter):
+    """S3 implementation of PyTorch's FileSystemWriter for distributed checkpoints."""
     def __init__(
         self,
         region: str,
@@ -321,6 +323,7 @@ def validate_checkpoint_id(cls, checkpoint_id: Union[str, os.PathLike]) -> bool:
 
 
 class S3StorageReader(FileSystemReader):
+    """S3 implementation of PyTorch's FileSystemReader with configurable reader strategies."""
     def __init__(
         self,
         region: str,
@@ -356,13 +359,21 @@ def validate_checkpoint_id(cls, checkpoint_id: Union[str, os.PathLike]) -> bool:
 
     def prepare_local_plan(self, plan: LoadPlan) -> LoadPlan:
         """
-        Sort load items by storage offset for sequential access optimization.
+        Performs two key optimizations:
+
+            1. **Load Ordering**: Sorts load items by storage offset to enable sequential access
+            
+            2. **Range Injection**: Provides byte range metadata to DCP reader constructors to enable
+            usage of DCPOptimizedS3Reader for range-based streams and range coalescing
 
         Args:
             plan (LoadPlan): The load plan from PyTorch DCP.
 
         Returns:
             LoadPlan: The same plan with items sorted by storage offset.
+
+        Note:
+            Both optimizations are required for DCPOptimizedS3Reader.
         """
         # Sort items in plan based on their offset in checkpoints shards
         plan.items.sort(key=lambda item: self.storage_data[item.storage_index].offset)
diff --git a/s3torchconnector/src/s3torchconnector/s3reader/constructor.py b/s3torchconnector/src/s3torchconnector/s3reader/constructor.py
@@ -38,7 +38,6 @@ def set_item_ranges_by_file(
         storage_data: "Dict[MetadataIndex, _StorageInfo]",
     ) -> None:
 
-        # TODO: Check if we want to return DCPOptimizedConstructor for immutability here instead
         if not plan_items:
             return  # Allow lack of plan_items, for SequentialS3Reader fallbacks
 
@@ -142,15 +141,43 @@ def range_based(buffer_size: Optional[int] = None) -> S3ReaderConstructorProtoco
     def dcp_optimized(
         max_gap_size: Union[int, float] = DEFAULT_MAX_GAP_SIZE,
     ) -> DCPS3ReaderConstructorProtocol:
-        """
-        Creates a DCPOptimizedConstructor that uses DCPOptimizedS3Reader when ranges are available
+        """Creates a constructor for DCP-optimized readers for faster checkpoint loading.
+
+        The DCP-optimized reader provides up to 2x performance improvement over the default sequential reader through:
+
+        - Zero-copy buffer management by storing data as memoryview segments
+        - Sequential access optimization to reduce buffer sizes from file-level to item-level
+        - Range-based fetching that downloads only required byte ranges and coalesces nearby ranges to reduce S3 request latency
 
         Args:
-        max_gap_size: Maximum gap size in bytes to coalesce ranges into multiple ranged-streams.
-                    Use float("inf") to coalesce all ranges regardless of gaps.
-                    Use 0 to disable coalescing.
+            max_gap_size: Maximum gap size in bytes between ranges to coalesce into the same S3 read stream. 
+                Most users should use the default value.
+
+                - Default: 32MB (``32 * 1024 * 1024``)
+                - Use ``float("inf")`` to coalesce all ranges regardless of gaps
+                - Use 0 to disable coalescing, which creates a new range-based stream for each gap
+
+        Returns:
+            DCPOptimizedConstructorProtocol: 
+                Constructor that creates DCPOptimizedS3Reader when ranges are available, falling back to 
+                SequentialS3Reader otherwise.
+
+        Requirements:
+            Should be used with S3StorageReader, in which ``prepare_local_plan()`` automatically handles:
+
+            - Load ordering: Sorts items by storage offset for sequential access
+            - Range injection: Provides byte ranges from DCP load plan to the reader
+
+            Advanced users implementing custom readers must include these optimizations
+            in their ``prepare_local_plan()``/``read_data()`` implementation to use the DCP-optimized reader.
+
+        Example::
+
+            reader_constructor = S3ReaderConstructor.dcp_optimized()
+            storage_reader = S3StorageReader(region, path, reader_constructor=reader_constructor)
+            DCP.load(state_dict, storage_reader=storage_reader)
+
         """
-        # TODO update docstring with guide and requirements to use this reader for DCP
         return DCPOptimizedConstructor(max_gap_size=max_gap_size)
 
     @staticmethod
diff --git a/s3torchconnector/src/s3torchconnector/s3reader/dcp_optimized.py b/s3torchconnector/src/s3torchconnector/s3reader/dcp_optimized.py
@@ -164,18 +164,38 @@ def readinto(self, buf) -> int:
 
 
 class DCPOptimizedS3Reader(S3Reader):
-    """
-    This reader optimizes PyTorch Distributed Checkpoint (DCP) partial loading by
-        1. exploiting sequential access patterns to avoid BytesIO buffer copy, and
-        2. only fetching required byte ranges instead of entire objects.
+    """S3 reader implementation optimized for PyTorch Distributed Checkpoint (DCP) loading.
+
+    Provides up to 2x performance improvement over default sequential reader through:
+
+        1. **Zero-Copy Buffer**: Custom ``_ItemViewBuffer`` storing data as memoryview
+        segments to eliminate BytesIO allocation and copy overhead.
+
+        2. **Sequential Access Optimization**: Exploits sequential access patterns over tensor
+        enforced by ``S3StorageReader.prepare_local_plan()`` to reduce buffer sizes from file-level to
+        item-level.
+        
+        3. **Range-based fetching**: For partial checkpoint loading, uses load plan item ranges information
+        to group nearby byte ranges within ``max_gap_size`` to minimize S3 first byte latency (compared to
+        range-based reader), while only fetching required byte ranges instead of entire files
+        (compared to sequential reader).
+
+    **Requirements**:
 
-    REQUIRES:
-    - DCP Loading - reader is only designed for usage via dcp_optimized reader_constructor for dcp.load()
-    - Load Ordering, applied automatically prepare_local_plan, to ensure sequential access patterns.
-    - item_ranges provided (List[ItemRange]) must be pre-sorted - also applied in prepare_local_plan.
-    - Only supports sequentially reading exact item_ranges provided - otherwise would result in errors.
-    Non-sequential access will result in errors.
+    - DCP Loading - reader is only designed for usage via dcp_optimized reader_constructor for ``dcp.load()``
+    - Pre-sorted list of item_ranges, injected automatically in ``prepare_local_plan``.
+    - Sequential Access over exact item_ranges provided, also applied automatically by ``prepare_local_plan``
 
+    **Usage**:
+    Typically created automatically by ``DCPOptimizedConstructor`` when used with ``S3StorageReader`` and 
+    ``S3ReaderConstructor.dcp_optimized()``:
+
+        reader_constructor = S3ReaderConstructor.dcp_optimized(max_gap_size=32*1024*1024)
+        storage_reader = S3StorageReader(region, path, reader_constructor=reader_constructor)
+        DCP.load(state_dict, storage_reader=storage_reader)
+
+    **Error Handling**:
+        Non-sequential access attempts raise ValueError with descriptive messages.
     """
 
     def __init__(
@@ -392,7 +412,6 @@ def _get_item_buffer(self, item: ItemRange) -> _ItemViewBuffer:
 
             chunk_len = len(chunk)
 
-            # TODO: separate skip part and take part for clearer logic
             # Skip past unwanted data (due to coalescing)
             if pos < item.start:
                 skip_bytes = min(item.start - pos, chunk_len)
@@ -532,6 +551,9 @@ def tell(self) -> int:
         return self._position
 
     def close(self) -> None:
+        """
+        Close the stream and release resources.
+        """
         if not self._closed:
             self._closed = True
             self._stream = None

Original file line number	Diff line number	Diff line change
`@@ -35,7 +35,7 @@ test = [`
`35`	`35`	`"flake8",`
`36`	`36`	`"black",`
`37`	`37`	`"mypy",`
`38`		`- "importlib_metadata; python_version == '3.9'", # PyTorch 2.7.0+ DCP w/ Python 3.9 requires this module`
	`38`	`+ "importlib_metadata; python_version == '3.9'", # PyTorch 2.7.0+ DCP w/ Python 3.9 requires this module; for dcp_optimized reader unit tests`
`39`	`39`	`]`
`40`	`40`
`41`	`41`	`e2e = [`