Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(clp-core): Add the write path for single-file archives. #646

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

davemarco
Copy link
Contributor

@davemarco davemarco commented Dec 28, 2024

Description

Reimplementation of unstructured single-file-archive writer from private branch into open source. Reader will be next after this PR is merged.

The flow of this PR is loosely inspired by #563. We effectively "tar" the multi-archive after it's created to write a single file archive.

Open source sfa format is similar, but not exactly the same as private branch spec. New features were added to open-source after private fork, and vice versa. This private google doc notes the differences. The doc also describes what changes are required to private branch at a later date, so open source can read it's archives.

Validation performed

I made changes to private branch reader, so it could read open source sfa. I compressed Loghub/HDFS2(multiple segments) with open source sfa writer, and was able to decompress with private reader. I tested decompress files were the same with diff.

Note the disk format has changed slightly since the original test due to code review. Overall, the test still shows the serialization approach works, we can do final validation in open source reader PR.

Summary by CodeRabbit

  • New Features
    • Introduced a new command-line option (--single-file-archive), enabling users to generate a single consolidated archive file rather than multiple files.
    • Enhanced the archive creation workflow to respect the single-file setting, preventing file splitting and streamlining the output process for improved file management.
    • Added functionalities for managing single-file archives, including metadata generation and file size management.
    • Implemented robust error handling for archive operations to ensure reliability during file processing.
  • Documentation
    • Updated documentation to reflect new features and usage instructions for the single-file archive option.

Copy link
Contributor

coderabbitai bot commented Dec 28, 2024

Walkthrough

The changes introduce support for creating single-file archives by adding new source files, command-line options, and configuration flags. This includes new header and source files for handling definitions, metadata, and writing single-file archives, modifications to the CommandLineArguments and compression logic to support a new --single-file-archive option, and extensions to the Archive classes to conditionally create single-file archives during the close process.

Changes

File(s) Change Summary
components/core/src/clp/clp/CMakeLists.txt
components/core/CMakeLists.txt
Added new source files for the single-file archive feature, integrating files for definitions and writer functionality.
components/core/src/clp/clp/CommandLineArguments.cpp
components/core/src/clp/clp/CommandLineArguments.hpp
Added a new command-line flag (--single-file-archive), a corresponding boolean member variable, and an accessor method to handle the single-file archive option.
components/core/src/clp/clp/compression.cpp Updated the compress function with an additional boolean parameter (use_single_file_archive) to adjust the splitting logic based on the new option.
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
Introduced a new header and writer files defining version constants, archive header, file info structures, exception class, and functions for writing single-file archives.
components/core/src/clp/streaming_archive/writer/Archive.cpp
components/core/src/clp/streaming_archive/writer/Archive.hpp
Extended Archive classes and UserConfig with new flags and methods to support single-file archive creation at close, including modified file splitting logic.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as CommandLineArguments
    participant Comp as Compression
    participant Arch as Archive
    participant SFA as SingleFileArchiveWriter

    User->>CLI: Run command with --single-file-archive flag
    CLI-->>Arch: Pass configuration (m_single_file_archive = true)
    Arch->>Comp: Compress files with single-file flag
    Arch->>SFA: On close, invoke write_single_file_archive
    SFA-->>Arch: Single-file archive created
Loading

Possibly related PRs

  • feat(clp-s): Add end to end search tests for clp-s. #668: The changes in the main PR, which involve adding new source files for single-file archive functionality, are related to the modifications in the retrieved PR that also incorporate support for single-file archives in the compress function and testing framework.
  • feat(core-clp)!: Migrate archive metadata file format to MessagePack. #700: The changes in the main PR, which introduce new source files for handling single-file archives, are related to the modifications in the retrieved PR that involve the ArchiveMetadata class and its serialization using MessagePack, as both PRs enhance the functionality of the archiving system.

Suggested reviewers

  • haiqi96
  • LinZhihao-723
✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (12)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (5)

19-20: Consider externalizing the read block size
The constant cReadBlockSize is set to 4096. For different environments or performance optimizations, externalizing this block size into a configuration or a compile-time parameter might improve flexibility.


133-147: Reassess large archive handling
A single-file archive exceeding cFileSizeWarningThreshold triggers a warning, but it might be beneficial to add user guidance or a more detailed strategy for dealing with very large archives (e.g. automatically switching to multi-file).


179-190: Potentially make version dynamic
Within write_archive_header, the cArchiveVersion is hard-coded. If changes are expected in future releases, consider introducing a mechanism to set the version at build time or to read it from project configuration.


192-195: Minor inefficiency with repeated .str() calls
Repetitively calling packed_metadata.str() can lead to unnecessary string object creation. While not critical for smaller metadata, consider assigning packed_metadata.str() once to a local variable for more efficiency.


197-213: Graceful recovery on partial reads
The loop for reading from the file and writing to archive_writer is straightforward and robust. For future improvements, consider adding logging or partial recovery in case of transient errors (e.g. a network file) instead of throwing an exception outright.

components/core/src/clp/streaming_archive/writer/Archive.cpp (3)

23-24: Check necessity of newly added includes.

Please confirm that both Defs.hpp and writer.hpp are required here. If these headers are no longer needed, consider removing them to reduce compilation overhead.


249-252: Verify multi–file archive cleanup.

When single–file archive mode is enabled, the archive transitions to a single–file format at closure. Ensure that the multi–file artefacts are either cleaned up or that the user is aware they remain. Inadvertent leftover files could confuse users.


341-342: Revisit dictionary–driven splitting logic.

Currently, splitting only occurs if false == m_use_single_file_archive. Consider whether large dictionary sizes warrant splitting in single–file archive mode too.

components/core/src/clp/streaming_archive/single_file_archive/writer.hpp (2)

50-55: Assess return type for better clarity.

Returning a std::stringstream from create_single_file_archive_metadata is straightforward, but consider a custom struct or type alias for readability and future extension.


65-69: Recover from partial writes.

write_single_file_archive can remove the existing multi–file archive. Think about potential rollback or error–handling strategies if the write fails partway.

components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (1)

16-21: Confirm versioning approach.

Your major/minor/patch shift bits are standard. Confirm that the product’s versioning policy aligns with these values if future releases require increments.

components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)

115-120: Evolution of metadata.

The newly introduced fields (m_variable_encoding_methods_version, m_variables_schema_version, and m_compression_type) address single–file archival requirements. Consider placing them in a derived metadata class if multi–file archives remain unaffected.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 329edf6 and 5c75147.

📒 Files selected for processing (11)
  • components/core/src/clp/clp/CMakeLists.txt (1 hunks)
  • components/core/src/clp/clp/CommandLineArguments.cpp (1 hunks)
  • components/core/src/clp/clp/CommandLineArguments.hpp (3 hunks)
  • components/core/src/clp/clp/FileCompressor.cpp (3 hunks)
  • components/core/src/clp/clp/compression.cpp (3 hunks)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (3 hunks)
  • components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (1 hunks)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1 hunks)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.hpp (1 hunks)
  • components/core/src/clp/streaming_archive/writer/Archive.cpp (5 hunks)
  • components/core/src/clp/streaming_archive/writer/Archive.hpp (4 hunks)
🧰 Additional context used
📓 Path-based instructions (10)
components/core/src/clp/clp/CommandLineArguments.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/clp/FileCompressor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/clp/CommandLineArguments.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/clp/compression.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/streaming_archive/writer/Archive.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/streaming_archive/writer/Archive.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/streaming_archive/single_file_archive/writer.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
components/core/src/clp/clp/FileCompressor.cpp (2)
Learnt from: haiqi96
PR: y-scope/clp#523
File: components/core/src/clp/clp/FileCompressor.hpp:58-78
Timestamp: 2024-11-10T16:46:53.300Z
Learning: When reviewing legacy code refactors, avoid suggesting changes that would extend the scope of the PR.
Learnt from: haiqi96
PR: y-scope/clp#523
File: components/core/src/clp/clp/FileCompressor.cpp:189-220
Timestamp: 2024-11-10T16:46:58.543Z
Learning: Ensure that before flagging functions like `parse_and_encode` for throwing exceptions while declared with `noexcept`, verify that the function is actually declared with `noexcept` to avoid false positives.
🔇 Additional comments (32)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (3)

100-112: Validate file existence before attempting to get its size
While the exception handling is robust, it may be helpful to add an explicit check if the file exists before calling file_size. This would provide a clearer error message if the file is missing and avoid filesystem errors.


114-147: Confirm directory existence when retrieving file info
When assembling the file_infos, consider verifying that the directory holding segments actually exists (e.g. segment_dir_path). If it does not, an early error message might clarify that no segments were found.


256-285: Ensure user awareness before deleting original archive
std::filesystem::remove_all(multi_file_archive_path) irreversibly deletes the multi-file archive after writing the single-file archive. For safety, either confirm user intent or allow an option to retain the source.

Would you like a script to confirm the presence of any leftover files before full deletion?

components/core/src/clp/streaming_archive/writer/Archive.cpp (3)

16-16: Confirm correct spdlog header include.

Typically the common header is <spdlog/spdlog.h> rather than <spdlog.h>. Please verify that this is intentional and that the correct symbols are available.


62-62: Initialization looks fine.

Assigning user_config.use_single_file_archive to m_use_single_file_archive supports the new single–file archive feature as intended. No issues found.


662-679: Handle edge cases for segment IDs and metadata.

  1. If m_next_segment_id is 0, m_next_segment_id - 1 becomes negative, potentially leading to unexpected behaviour in get_segment_ids().
  2. Consider checking for empty or invalid segment_ids before proceeding.
  3. Ensure that partial writes or exceptions during write_single_file_archive are either rolled back or leave a consistent state.
components/core/src/clp/streaming_archive/single_file_archive/writer.hpp (2)

16-34: Custom exception design appears consistent.

The OperationFailed class neatly extends TraceableException and provides a clear error message.


40-41: Validate last_segment_id usage.

Please ensure that calling get_segment_ids(last_segment_id) with zero or negative values is handled gracefully in the implementation.

components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (6)

23-27: Basic definitions look acceptable.

The magic number, file extension, and file–size warning threshold appear suitable for single–file archives.


28-35: Static file handling looks straightforward.

cStaticArchiveFileNames is a helpful container for known archive files. No issues observed here.


37-43: Packed struct alignment caution.

__attribute__((packed)) on SingleFileArchiveHeader can cause cross–platform alignment mismatches. Confirm that your usage environment supports it consistently.


45-49: FileInfo struct.

No concerns: the usage of MSGPACK_DEFINE_MAP is consistent with message pack patterns.


51-72: MultiFileArchiveMetadata structure is appropriate.

The fields here match multi–file archiving logic. Good usage of MSGPACK_DEFINE_MAP.


74-79: SingleFileArchiveMetadata structure appropriateness.

Combining archive_files, archive_metadata, and num_segments is logical for single–file mode. Nicely integrated with msgpack.

components/core/src/clp/clp/CommandLineArguments.hpp (3)

26-26: Default boolean initialisation.

m_single_file_archive(false) is clear and consistent with the default behaviour of multi–file archives.


49-50: Accessor method is straightforward.

get_use_single_file_archive() properly reflects the new member variable. No issues found.


98-98: Member variable integration.

m_single_file_archive fits in seamlessly with existing arguments. No contradictions observed.

components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (3)

10-10: Added include for encoding methods.

Including encoding_methods.hpp is logical, given usage of ffi::cVariableEncodingMethodsVersion further in the file.


13-14: New compression type constant.

cCompressionTypeZstd = "ZSTD"; is a welcome addition, clarifying the default compression type used.


86-91: Accessors for new metadata fields.

Providing methods to retrieve variable encoding and schema versions, along with compression type, aligns well with single–file archive needs.

components/core/src/clp/clp/compression.cpp (3)

110-110: No issues found with the addition of the single-file-archive configuration.
This line properly forwards the command-line argument into the archive writer’s configuration.


139-140: Logical check for archive splitting is correct.
The code correctly checks whether the dictionary size threshold is reached and if single-file mode is disabled. This ensures that splitting only occurs under the intended conditions.


168-169: Consistent archive splitting logic for grouped files.
These lines mirror the logic above and maintain consistent behaviour for grouped file compression.

components/core/src/clp/streaming_archive/writer/Archive.hpp (3)

51-51: New boolean flag introduced.
Adding the flag use_single_file_archive to UserConfig is clear and self-explanatory, facilitating better configurability of the archiving process.


197-197: Getter method aligns with the coding guidelines.
This accessor cleanly exposes the new flag, supporting usage in external logic.


285-289: Implementation detail for single-file archive creation.
The create_single_file_archive() method is a suitable extension point, though it’s not fully shown here. It should be well-documented and tested to ensure it handles any corner cases (e.g. partially written archives).

components/core/src/clp/clp/CommandLineArguments.cpp (1)

376-379: New command-line option successfully introduced.
The addition of --single-file-archive provides enhanced flexibility for output formats. Ensure that the help text and documentation clearly describe any potential performance or storage trade-offs.

components/core/src/clp/clp/FileCompressor.cpp (3)

246-247: Single-file check prevents increments of archive splits.
This usage respects the new flag to avoid unnecessary splitting, preserving consistent behaviour with single-file archiving.


341-342: Conditional logic consistent with the single-file feature.
Retaining the archival split only if false == … ensures that the newly introduced flag remains orthogonal to existing compression workflows.


542-543: Proper check before splitting IR streams.
The logic parallels other file-splitting checks, keeping single-file intentions intact across different compression paths.

components/core/src/clp/clp/CMakeLists.txt (2)

111-113: LGTM! New source files follow project conventions.

The new single-file archive implementation files are logically organized and follow the project's established patterns for source organization.


Line range hint 1-150: Consider making GlobalMetadataDB optional for single-file archives.

Based on the PR objectives, GlobalMetadataDB might not be required when writing single-file archives. Consider making it conditionally included based on the archive type to avoid unnecessary dependencies.

Let's verify the GlobalMetadataDB usage in the single-file archive implementation:

✅ Verification successful

Based on the examination of the single file archive implementation in the codebase, particularly in single_file_archive/writer.hpp and writer.cpp, I can now provide a final response:

GlobalMetadataDB is not used in single-file archive operations

The single-file archive implementation is completely independent of GlobalMetadataDB. It operates by:

  • Creating a self-contained archive with its own metadata structure (SingleFileArchiveMetadata)
  • Directly copying files from the multi-file archive
  • Managing file offsets and archive structure internally

The GlobalMetadataDB dependency can be safely made optional for single-file archives as they:

  • Store all necessary metadata within the single file
  • Don't require external database operations
  • Handle their own file management and metadata tracking
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check GlobalMetadataDB usage in single-file archive code

# Check for GlobalMetadataDB usage in new SFA files
rg -A 5 "GlobalMetadataDB" "../streaming_archive/single_file_archive/"

# Check for any conditional usage patterns
ast-grep --pattern 'if ($condition) {
  $$$
  GlobalMetadataDB
  $$$
}'

Length of output: 224


Script:

#!/bin/bash
# Let's try searching in the entire codebase
# Check GlobalMetadataDB usage in single file archive related code
rg -A 5 "GlobalMetadataDB" --type cpp

# Check writer implementation specifically
rg -A 10 "single_file_archive.*writer" --type cpp

# Look for single file archive implementation
fd "single.*archive"

Length of output: 48285


Script:

#!/bin/bash
# Let's check the single file archive writer implementation
rg -A 10 "class.*writer" components/core/src/clp/streaming_archive/single_file_archive/writer.hpp

# Check if GlobalMetadataDB is used in single file archive implementation
rg -A 5 "GlobalMetadataDB" components/core/src/clp/streaming_archive/single_file_archive/writer.cpp

# Check Archive constructor to see how single file archive mode is handled
rg -A 10 "Archive::Archive.*Config" components/core/src/clp/streaming_archive/writer/Archive.cpp

Length of output: 298


Script:

#!/bin/bash
# Check the content of single file archive writer files
cat components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
cat components/core/src/clp/streaming_archive/single_file_archive/writer.cpp

# Check Archive constructor implementation
cat components/core/src/clp/streaming_archive/writer/Archive.cpp | rg -A 20 "Archive::Archive"

Length of output: 12781

@LinZhihao-723
Copy link
Member

The archive version was changed to use uint32_t in the private branch to match the rule of semantic version encoding. In the current OSS, the archive format version is defined here. In the private branch, we use version = (major_version << 24) | (minor_version << 16) | (patch_version). @kirkrodrigues shall we update the OSS to also use semantic version for the archive format?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (9)
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (3)

15-21: Use consistent naming convention for version constants
The naming “cArchiveMajorVersion”, “cArchiveMinorVersion” and “cArchivePatchVersion” is informative, but ensure consistent naming style (e.g., PascalCase or snake_case) if there’s a global standard.


23-27: Remove magic number duplication
The .clp extension and the cUnstructuredSfaMagicNumber might be repeated in other parts of the code. If so, consider centralizing them in a single constants file to avoid potential inconsistencies.


76-81: Validate memory usage
Storing an entire vector of FileInfo might risk memory issues if the number of archived files grows very large in single-file mode. Consider a streaming approach for extremely large archives.

components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (3)

19-20: Enhance documentation for cReadBlockSize
Elaborate on the rationale for using 4096. This helps maintainers decide whether to tune it for performance.


137-145: Large file warning
This warning is valuable. Consider adding a user-configurable threshold to accommodate different environment needs.


215-231: Possible parallelization
The loop writing each segment’s file content can be parallelized if the order doesn’t matter for performance gains.

components/core/src/clp/clp/compression.cpp (2)

139-141: Maintain DRY
This condition repeats in lines 169-171. Factor out an inline function or macro if duplication grows.


139-141: Prefer “false == expr” check
Per coding guidelines, replace && false == archive_writer.get_use_single_file_archive() with && (false == archive_writer.get_use_single_file_archive()).

- && false == archive_writer.get_use_single_file_archive())
+ && (false == archive_writer.get_use_single_file_archive()))

Also applies to: 169-171

components/core/src/clp/streaming_archive/writer/Archive.cpp (1)

61-61: Boolean naming
m_use_single_file_archive is clear, but consider naming it m_is_single_file_archive or m_single_file_mode for consistent convention.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5c75147 and 82b9802.

📒 Files selected for processing (7)
  • components/core/src/clp/clp/FileCompressor.cpp (3 hunks)
  • components/core/src/clp/clp/compression.cpp (3 hunks)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (3 hunks)
  • components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (1 hunks)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1 hunks)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.hpp (1 hunks)
  • components/core/src/clp/streaming_archive/writer/Archive.cpp (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • components/core/src/clp/clp/FileCompressor.cpp
  • components/core/src/clp/streaming_archive/ArchiveMetadata.hpp
  • components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
🧰 Additional context used
📓 Path-based instructions (4)
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/clp/compression.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

🔇 Additional comments (9)
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (3)

19-21: Confirm the semantic version scheme
You are using a bitwise composition of major/minor/patch. Confirm that it remains compatible with your internal versioning constraints discussed in the PR comments.


47-51: Verify no external references to field names
The FileInfo struct serializes fields as “n” and “o”. Ensure there are no external references expecting different property names.


53-73: Serialization definitions
The fields in MultiFileArchiveMetadata are properly included in MSGPACK_DEFINE_MAP. Everything looks coherent and consistent with the usage.

components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (4)

100-112: Potential concurrency considerations
update_offset aggregates sizes by reading the filesystem. If future concurrency is introduced, ensure atomic or thread-safe operations around offset calculations.


117-147: Good usage of sentinel
The sentinel FileInfo{"", offset} is a handy approach for communicating total file size.


149-177: Ensure consistent metadata
Data from ArchiveMetadata is mirrored here. If new metadata fields are added in the future, they must be carefully synchronized.


197-213: Handle partial read cases
In a multi-threaded or partial read scenario, validate that your loop handles truncated reads or concurrency issues from external processes.

components/core/src/clp/clp/compression.cpp (1)

110-110: Check code style
The line updates the user config with archive_user_config.use_single_file_archive = command_line_args.get_use_single_file_archive();. Everything looks consistent.

components/core/src/clp/streaming_archive/writer/Archive.cpp (1)

248-250: Single-File archiving
Calling create_single_file_archive() in the close() method is logical. Ensure that any exceptions raised do not leave the archive in an inconsistent state.

Comment on lines 662 to 685
void Archive::create_single_file_archive() {
std::filesystem::path multi_file_archive_path = m_path;

auto segment_ids
= clp::streaming_archive::single_file_archive::get_segment_ids(m_next_segment_id - 1);

if (false == m_local_metadata.has_value()) {
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
}

auto& multi_file_archive_metadata = m_local_metadata.value();
auto packed_metadata
= clp::streaming_archive::single_file_archive::create_single_file_archive_metadata(
multi_file_archive_metadata,
multi_file_archive_path,
segment_ids
);

clp::streaming_archive::single_file_archive::write_single_file_archive(
multi_file_archive_path,
packed_metadata,
segment_ids
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Consider cleanup ordering
You remove the multi-file archive only after writing the single-file archive. Double-check that partial failures mid-operation do not remove the source prematurely.

@davemarco davemarco closed this Dec 29, 2024
@davemarco davemarco reopened this Dec 29, 2024
@davemarco
Copy link
Contributor Author

sorry closing was by accident...

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (4)

19-21: Consider documenting the rationale for the block size value.

Adding a comment explaining why 4096 was chosen as the block size would help future maintainers understand if this value needs adjustment.

 namespace {
-constexpr size_t cReadBlockSize = 4096;
+// 4096 bytes is a common page size and provides good balance between memory usage and I/O performance
+constexpr size_t cReadBlockSize = 4096;

195-211: Enhance error handling with more context in error messages.

The error handling could be improved by including more context when throwing exceptions.

         }
         if (ErrorCode_Success != error_code) {
-            throw OperationFailed(error_code, __FILENAME__, __LINE__);
+            throw OperationFailed(
+                error_code,
+                __FILENAME__,
+                __LINE__,
+                fmt::format("Failed to read from file '{}': {}", 
+                    file_path.string(),
+                    get_error_message(error_code))
+            );
         }

265-268: Improve error messages for file operations.

The error handling should include more context about the operation that failed.

     if (std::filesystem::exists(single_file_archive_path)) {
-        throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+        throw OperationFailed(
+            ErrorCode_Failure,
+            __FILENAME__,
+            __LINE__,
+            fmt::format("Single-file archive '{}' already exists",
+                single_file_archive_path.string())
+        );
     }

     // ... (other code remains the same)

     try {
         std::filesystem::remove_all(multi_file_archive_path);
     } catch (std::filesystem::filesystem_error& e) {
-        throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+        throw OperationFailed(
+            ErrorCode_Failure,
+            __FILENAME__,
+            __LINE__,
+            fmt::format("Failed to clean up multi-file archive '{}': {}",
+                multi_file_archive_path.string(),
+                e.what())
+        );
     }

Also applies to: 279-284


255-284: Consider architectural improvements for better operability.

Two suggestions for improving the implementation:

  1. Add logging for successful operations to aid in debugging and monitoring.
  2. Consider making the cleanup of the multi-file archive configurable, as there might be cases where keeping both versions is desired (e.g., for verification or backup purposes).

Example implementation:

void write_single_file_archive(
        std::filesystem::path const& multi_file_archive_path,
        std::stringstream const& packed_metadata,
        std::vector<std::string> const& segment_ids,
        bool cleanup_multi_file_archive = true  // New parameter
) {
    SPDLOG_INFO("Creating single-file archive from '{}'", 
        multi_file_archive_path.string());
    
    // ... (existing implementation) ...

    archive_writer.close();
    SPDLOG_INFO("Successfully created single-file archive '{}'", 
        single_file_archive_path.string());

    if (cleanup_multi_file_archive) {
        SPDLOG_INFO("Cleaning up multi-file archive '{}'", 
            multi_file_archive_path.string());
        try {
            std::filesystem::remove_all(multi_file_archive_path);
            SPDLOG_INFO("Successfully cleaned up multi-file archive");
        } catch (std::filesystem::filesystem_error& e) {
            // ... (existing error handling) ...
        }
    }
}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 82b9802 and 393049b.

📒 Files selected for processing (1)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

🔇 Additional comments (1)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)

1-18: LGTM! Well-organized includes and namespace declaration.

The includes are properly grouped and all necessary headers are present.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (5)

101-113: Enhance error handling with more specific information.

The error handling could be improved by including the file path in the error message and preserving the original error code.

Consider this enhancement:

     } catch (std::filesystem::filesystem_error const& e) {
         throw OperationFailed(
                 ErrorCode_Failure,
                 __FILENAME__,
                 __LINE__,
-                fmt::format("Failed to get file size: {}", e.what())
+                fmt::format("Failed to get file size for '{}': {} (error code: {})",
+                          file_path.string(), e.what(), e.code().value())
         );
     }

138-145: Enhance warning message with size information.

The warning message could be more helpful by including the actual archive size for comparison.

Consider this enhancement:

     if (offset > cFileSizeWarningThreshold) {
         SPDLOG_WARN(
-                "Single file archive size exceeded {}. "
+                "Single file archive size ({}) exceeded threshold ({}). "
                 "The single-file archive format is not intended for large archives, "
                 " consider using multi-file archive format instead.",
+                offset,
                 cFileSizeWarningThreshold
         );
     }

196-212: Consider adding progress reporting for large file operations.

For better observability during large file operations, consider adding progress reporting. Also, the error handling could be more specific.

Consider these enhancements:

 auto write_archive_file(std::filesystem::path const& file_path, FileWriter& archive_writer)
         -> void {
     FileReader reader(file_path.string());
+    auto total_size = std::filesystem::file_size(file_path);
+    uint64_t bytes_processed = 0;
     std::array<char, cReadBlockSize> read_buffer{};
     while (true) {
         size_t num_bytes_read{};
         ErrorCode const error_code
                 = reader.try_read(read_buffer.data(), cReadBlockSize, num_bytes_read);
         if (ErrorCode_EndOfFile == error_code) {
             break;
         }
         if (ErrorCode_Success != error_code) {
-            throw OperationFailed(error_code, __FILENAME__, __LINE__);
+            throw OperationFailed(
+                error_code,
+                __FILENAME__,
+                __LINE__,
+                fmt::format("Failed to read from file: {}", file_path.string())
+            );
         }
         archive_writer.write(read_buffer.data(), num_bytes_read);
+        bytes_processed += num_bytes_read;
+        if (total_size > cFileSizeWarningThreshold) {
+            SPDLOG_DEBUG(
+                "Processing file {}: {:.1f}% ({}/{} bytes)",
+                file_path.filename().string(),
+                (bytes_processed * 100.0) / total_size,
+                bytes_processed,
+                total_size
+            );
+        }
     }
 }

266-268: Improve error handling with specific error messages.

The error handling for file existence check and archive removal could be more informative.

Consider these enhancements:

     if (std::filesystem::exists(single_file_archive_path)) {
-        throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+        throw OperationFailed(
+            ErrorCode_Failure,
+            __FILENAME__,
+            __LINE__,
+            fmt::format("Single-file archive already exists: {}", 
+                       single_file_archive_path.string())
+        );
     }

     // ... (other code)

     try {
         std::filesystem::remove_all(multi_file_archive_path);
     } catch (std::filesystem::filesystem_error& e) {
-        throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+        throw OperationFailed(
+            ErrorCode_Failure,
+            __FILENAME__,
+            __LINE__,
+            fmt::format("Failed to remove multi-file archive at '{}': {}",
+                       multi_file_archive_path.string(), e.what())
+        );
     }

Also applies to: 280-284


233-245: Optimize vector allocation and add range validation.

Consider pre-allocating the vector and adding a reasonable range check for segment IDs.

Consider these enhancements:

 auto get_segment_ids(segment_id_t last_segment_id) -> std::vector<std::string> {
     std::vector<std::string> segment_ids;
 
     if (last_segment_id < 0) {
         return segment_ids;
     }
 
+    if (last_segment_id > cMaxSegmentId) {
+        throw OperationFailed(
+            ErrorCode_Failure,
+            __FILENAME__,
+            __LINE__,
+            fmt::format("Segment ID {} exceeds maximum allowed value {}",
+                       last_segment_id, cMaxSegmentId)
+        );
+    }
+
+    segment_ids.reserve(last_segment_id + 1);
     for (size_t i = 0; i <= last_segment_id; ++i) {
         segment_ids.emplace_back(std::to_string(i));
     }
 
     return segment_ids;
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 393049b and 5428403.

📒 Files selected for processing (2)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1 hunks)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.hpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

🔇 Additional comments (1)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)

1-17: LGTM! Well-organized includes and namespace setup.

The includes are properly organized into system headers, third-party libraries, and project-specific headers. The organization follows good C++ practices.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (4)

135-135: Document the sentinel file info entry

The empty filename sentinel is used to mark the total size of all files. Consider adding a comment explaining why an empty filename is used as a sentinel and how it should be handled by the reader.


197-213: Enhance error handling in file reading

Consider adding more specific error information when throwing OperationFailed. The current implementation loses the original error code context.

-            throw OperationFailed(error_code, __FILENAME__, __LINE__);
+            throw OperationFailed(
+                error_code,
+                __FILENAME__,
+                __LINE__,
+                fmt::format("Failed to read from file: {}", file_path.string())
+            );

234-246: Improve type consistency and clarity in segment ID generation

The function mixes signed and unsigned types, and the negative check could be more explicit.

-auto get_segment_ids(segment_id_t last_segment_id) -> std::vector<std::string> {
+auto get_segment_ids(segment_id_t const last_segment_id) -> std::vector<std::string> {
     std::vector<std::string> segment_ids;
 
-    if (last_segment_id < 0) {
+    if (last_segment_id < 0) {
+        SPDLOG_WARN("Negative segment ID provided: {}", last_segment_id);
         return segment_ids;
     }
 
-    for (size_t i = 0; i <= last_segment_id; ++i) {
+    auto const last_id = static_cast<size_t>(last_segment_id);
+    for (size_t i = 0; i <= last_id; ++i) {
         segment_ids.emplace_back(std::to_string(i));
     }

1-287: Overall implementation is well-structured and robust

The implementation demonstrates good separation of concerns, proper error handling, and clear documentation. The code is organized logically with helper functions in an anonymous namespace and a clean public interface.

Consider adding the following improvements in future iterations:

  1. Add unit tests to verify the archive format compatibility
  2. Consider implementing a dry-run mode to validate the archive structure before writing
  3. Add progress reporting for large archives
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (2)

7-7: Consider standardizing header file extensions

The codebase mixes .h and .hpp extensions. Consider standardizing to .hpp for consistency, as it's more specific to C++ headers.


47-51: Use descriptive field names in FileInfo

The FileInfo structure uses abbreviated field names (n, o) which reduce code readability.

Consider using more descriptive names:

 struct FileInfo {
-    std::string n;
-    uint64_t o;
-    MSGPACK_DEFINE_MAP(n, o);
+    std::string name;
+    uint64_t offset;
+    MSGPACK_DEFINE_MAP(name, offset);
 };
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5428403 and 7e261f7.

📒 Files selected for processing (2)
  • components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (1 hunks)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

🔇 Additional comments (5)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)

1-18: LGTM! Well-organized includes

The includes are properly organized into logical groups (standard library, third-party, and project-specific) with clear separation between groups.

components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (4)

13-21: Well-structured version encoding

The version encoding aligns with the semantic versioning approach discussed in the PR comments, using bit shifting to combine major (0), minor (1), and patch (1) versions into a single uint32_t.


38-45: Document the purpose of unused fields and packed attribute

The SingleFileArchiveHeader structure:

  1. Contains 6 unused fields without explanation
  2. Uses __attribute__((packed)) which might cause alignment issues on some architectures

Please document:

  • The purpose of reserving unused fields
  • Why packing is necessary
  • Any alignment requirements for different architectures

27-27: Verify file size warning threshold

The 100MB threshold (cFileSizeWarningThreshold) seems arbitrary. Please confirm if this value aligns with the private branch implementation and document the rationale.

✅ Verification successful

Let me gather more context about the usage of this threshold.


The file size warning threshold is correctly implemented and documented

The 100MB threshold is used to warn users when they're creating single-file archives that are too large. The implementation shows clear rationale in the warning message: single-file archives are not split and can consume excessive memory during decompression. Users are advised to use the multi-file archive format instead for large archives.

The warning message in writer.cpp provides sufficient documentation:

"Single file archive size exceeded {}. "
"The single-file archive format is not intended for large archives, "
"consider using multi-file archive format instead."
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for similar thresholds in the codebase
rg -i "size.*threshold|threshold.*size" --type cpp

Length of output: 1094


Script:

#!/bin/bash
# Check the usage context of the warning threshold
rg -B 5 -A 5 "cFileSizeWarningThreshold" --type cpp

# Look for any documentation or comments about file size limits
rg -i "file.*size.*limit|size.*limit.*file" --type cpp

Length of output: 3399


53-81: Verify metadata compatibility with private branch

The PR objectives mention differences in metadata formats between branches. Please confirm:

  1. All required fields from the private branch are included
  2. Data type changes are backward compatible
  3. The order of fields in MSGPACK_DEFINE_MAP maintains compatibility
✅ Verification successful

The previous ast-grep pattern might have failed due to syntax. Let's try different approaches to find metadata structures and their msgpack definitions in the codebase. This will help us verify the compatibility concerns.


Let's gather more information about version compatibility and metadata structures from the streaming archive implementation.


Based on the gathered information, I can now provide a final response about the metadata compatibility:

Metadata structures appear compatible with existing implementation

The metadata structures in the PR maintain compatibility with the existing codebase:

  • All required fields are present and match the existing implementation in ArchiveMetadata.hpp
  • Data types are consistent with the current usage:
    • archive_format_version_t for version tracking
    • std::string for version identifiers and compression type
    • uint64_t for size fields
  • The order of fields in MSGPACK_DEFINE_MAP matches the struct field declarations, maintaining proper serialization
  • The version constants (cArchiveFormatVersion, cVariableEncodingMethodsVersion, cVariablesSchemaVersion) are properly referenced

The changes appear to be a refactoring that preserves the existing metadata format while organizing it into more structured types (MultiFileArchiveMetadata and SingleFileArchiveMetadata).

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for similar metadata structures in the codebase
ast-grep --pattern 'struct $_ {
  $$$
  MSGPACK_DEFINE_MAP($$$);
}'

Length of output: 69


Script:

#!/bin/bash
# Search for metadata structures using ripgrep
rg -U "struct.*Metadata.*\{[\s\S]*?MSGPACK_DEFINE" -A 20

Length of output: 5375


Script:

#!/bin/bash
# Search for any MSGPACK_DEFINE usage
rg "MSGPACK_DEFINE" -A 5

Length of output: 3013


Script:

#!/bin/bash
# Look for any files containing 'Metadata' in their name
fd "Metadata" --type f

Length of output: 1265


Script:

#!/bin/bash
# Let's check the content of ArchiveMetadata files for any additional metadata structures
rg -U "struct.*\{[\s\S]*?\}" components/core/src/clp/streaming_archive/ArchiveMetadata.hpp

Length of output: 92


Script:

#!/bin/bash
# Check for any version-related constants or definitions
rg "version" -i components/core/src/clp/streaming_archive/

Length of output: 5135

Comment on lines 281 to 285
try {
std::filesystem::remove_all(multi_file_archive_path);
} catch (std::filesystem::filesystem_error& e) {
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve error handling for cleanup operation

  1. The error message is not propagated when file removal fails
  2. Consider verifying if the archive was written successfully before removing the original files
     archive_writer.close();
+    // Verify the archive was written successfully
+    if (!std::filesystem::exists(single_file_archive_path) ||
+        std::filesystem::file_size(single_file_archive_path) == 0) {
+        throw OperationFailed(
+            ErrorCode_Failure,
+            __FILENAME__,
+            __LINE__,
+            "Failed to write single-file archive"
+        );
+    }
     try {
         std::filesystem::remove_all(multi_file_archive_path);
     } catch (std::filesystem::filesystem_error& e) {
-        throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+        throw OperationFailed(
+            ErrorCode_Failure,
+            __FILENAME__,
+            __LINE__,
+            fmt::format("Failed to remove original archive: {}", e.what())
+        );
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try {
std::filesystem::remove_all(multi_file_archive_path);
} catch (std::filesystem::filesystem_error& e) {
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
}
archive_writer.close();
// Verify the archive was written successfully
if (!std::filesystem::exists(single_file_archive_path) ||
std::filesystem::file_size(single_file_archive_path) == 0) {
throw OperationFailed(
ErrorCode_Failure,
__FILENAME__,
__LINE__,
"Failed to write single-file archive"
);
}
try {
std::filesystem::remove_all(multi_file_archive_path);
} catch (std::filesystem::filesystem_error& e) {
throw OperationFailed(
ErrorCode_Failure,
__FILENAME__,
__LINE__,
fmt::format("Failed to remove original archive: {}", e.what())
);
}

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)

341-343: Consider refactoring the nested conditions for better readability.

The condition for file splitting now includes two nested checks that could be simplified.

Consider this alternative structure:

-    if (get_data_size_of_dictionaries() >= m_target_data_size_of_dicts
-        && false == m_use_single_file_archive)
-    {
+    bool should_split = get_data_size_of_dictionaries() >= m_target_data_size_of_dicts
+                        && !m_use_single_file_archive;
+    if (should_split) {

669-671: Add error context to the metadata validation.

The error message could be more descriptive about why the metadata is missing.

-        throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+        throw OperationFailed(
+            ErrorCode_Failure,
+            __FILENAME__,
+            __LINE__,
+            "Local metadata is missing when creating single-file archive"
+        );

681-685: Consider adding progress logging for large archives.

For large archives, it would be helpful to log the progress of the single-file archive creation process.

+    SPDLOG_INFO("Creating single-file archive at {}", multi_file_archive_path.string());
     clp::streaming_archive::single_file_archive::write_single_file_archive(
             multi_file_archive_path,
             packed_metadata,
             segment_ids
     );
+    SPDLOG_INFO("Successfully created single-file archive");
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (2)

24-24: Consider documenting the rationale for the block size value

Adding a comment explaining why 4096 was chosen as the block size would help future maintainers understand if this value needs adjustment.

-constexpr size_t cReadBlockSize = 4096;
+// 4KB block size chosen to match common page size for optimal I/O performance
+constexpr size_t cReadBlockSize = 4096;

199-215: Enhance error handling with descriptive messages

The error handling could be more informative by including the file path and specific error details in the exception message.

         }
         if (ErrorCode_Success != error_code) {
-            throw OperationFailed(error_code, __FILENAME__, __LINE__);
+            throw OperationFailed(
+                error_code,
+                __FILENAME__,
+                __LINE__,
+                fmt::format("Failed to read from file '{}': {}", 
+                    file_path.string(),
+                    get_error_message(error_code))
+            );
         }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e261f7 and 0bd9b27.

📒 Files selected for processing (4)
  • components/core/CMakeLists.txt (1 hunks)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1 hunks)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.hpp (1 hunks)
  • components/core/src/clp/streaming_archive/writer/Archive.cpp (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
🧰 Additional context used
📓 Path-based instructions (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

🔇 Additional comments (5)
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)

16-24: LGTM! Header includes and dependencies look good.

The addition of the single-file archive writer header and its dependencies is properly organized.


663-686: Verify cleanup order in single-file archive creation.

The function creates a single-file archive after closing the multi-file archive. This ordering needs careful consideration for partial failures.

Let's verify the cleanup handling:

#!/bin/bash
# Search for error handling patterns around archive creation
rg -A 5 "create_single_file_archive|write_single_file_archive" --type cpp

# Look for cleanup/rollback mechanisms
ast-grep --pattern 'try {
  $$$
  create_single_file_archive();
  $$$
} catch ($_) {
  $$$
}'
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (3)

1-21: LGTM! Well-organized includes and namespace structure.

The includes are properly grouped and contain all necessary headers for the implementation.


283-287: Improve error handling for cleanup operation

This section needs better error handling as identified in a previous review.


236-248: Add validation for segment ID range

Consider adding an upper bound check for last_segment_id to prevent excessive memory allocation for very large segment IDs.

 auto get_segment_ids(segment_id_t last_segment_id) -> std::vector<std::string> {
     std::vector<std::string> segment_ids;
 
     if (last_segment_id < 0) {
         return segment_ids;
     }
 
+    // Prevent excessive memory allocation
+    if (last_segment_id > cMaxSegmentId) {
+        throw OperationFailed(
+            ErrorCode_InvalidArgument,
+            __FILENAME__,
+            __LINE__,
+            fmt::format("Segment ID {} exceeds maximum allowed value {}", 
+                last_segment_id, cMaxSegmentId)
+        );
+    }
+
     for (size_t i = 0; i <= last_segment_id; ++i) {

components/core/CMakeLists.txt Show resolved Hide resolved
@davemarco davemarco requested a review from haiqi96 January 6, 2025 16:50
@haiqi96
Copy link
Contributor

haiqi96 commented Jan 6, 2025

High level comment: let's use auto func() -> [return type] {} for all methods. There are still a few methods using the old style signature

Copy link
Contributor Author

@davemarco davemarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some responses


auto write_archive_header(FileWriter& archive_writer, size_t packed_metadata_size) -> void {
SingleFileArchiveHeader header{
.magic{},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this from clp-s, but I think it is actually unnecessary because of the fancy std::array i am using. It looks like i can assign directly.

.unused{}
};

static_assert(cUnstructuredSfaMagicNumber.size() == header.magic.size());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also because of the fancy std::array, I dont think we need to do the size assertion. It should fail at compile time if there is a size mismatch.

auto write_single_file_archive(
ArchiveMetadata const& multi_file_archive_metadata,
std::filesystem::path const& multi_file_archive_path,
segment_id_t next_segment_id
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I missed this. last_segment_id is slightly inaccurate since it is actually an id for a segment that does not exist (it is the next segment to be written). The last_segment_id is the next_segment_id -1.
I think num_segments could also work. Let me know what you think.

Comment on lines 50 to 53
/**
* Combines file section metadata, multi-file archive metadata, and the number of segments into
* single-file archive metadata. Once combined, serializes the metadata into MsgPack format.
*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about. I feel like the second "and" after a list makes the sentence awkward

Combines file section metadata, multi-file archive metadata, and the number of segments into
single-file archive metadata, then serializes the metadata into MsgPack format.


write_archive_header(archive_writer, packed_metadata.str().size());
write_archive_metadata(archive_writer, packed_metadata);
write_archive_files(archive_writer, multi_file_archive_path, next_segment_id);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure its possible in the few milliseconds between when the metadata is created, and the single-file-archive is written, a rogue process could corrupt the files. But at the same time, after the single file archive is written, or any archive for that matter, could be attacked by a rogue process, and the archive would be corrupt. Basically, what i'm saying is this might go above a reasonable threat model.

That is to say, we can make this change, but it could also be overkill.

Anyways let me know ur thoughts, I will change if you want.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)

660-672: 🛠️ Refactor suggestion

Enhance error handling and validation in single-file archive creation.

Consider these improvements:

  1. Add error handling for filesystem operations
  2. Add cleanup on failure
  3. Add progress indication for large archives
  4. Add validation of the created single-file archive

Apply this diff to improve the implementation:

 void Archive::create_single_file_archive() {
     if (false == m_local_metadata.has_value()) {
         throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
     }

     auto const& multi_file_archive_metadata = m_local_metadata.value();
+    auto single_file_archive_path = m_path + ".sfa";
+
+    try {
+        SPDLOG_INFO("Creating single-file archive: {}", single_file_archive_path);

         clp::streaming_archive::single_file_archive::write_single_file_archive(
                 multi_file_archive_metadata,
-                m_path,
+                single_file_archive_path,
                 m_next_segment_id
         );
+
+        // Validate the created archive
+        if (!validate_single_file_archive(single_file_archive_path)) {
+            throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+        }
+
+        // Rename to final path only after validation
+        std::filesystem::rename(single_file_archive_path, m_path);
+
+        SPDLOG_INFO("Successfully created single-file archive: {}", m_path);
+    } catch (const std::exception& e) {
+        // Clean up on failure
+        std::filesystem::remove(single_file_archive_path);
+        throw;
+    }
 }
🧹 Nitpick comments (3)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (3)

156-171: Ensure pack_single_file_archive_metadata scale.

When packing metadata for large archives, consider verifying that the size of the metadata remains within reasonable limits to avoid out-of-memory scenarios.


173-182: Add a compile-time check for magic number size.

Guarantee that 'header.magic' matches 'cUnstructuredSfaMagicNumber.size()' at compile-time.

 SingleFileArchiveHeader header{
     .magic = cUnstructuredSfaMagicNumber,
     .version = cArchiveVersion,
     .metadata_size = packed_metadata_size,
     .unused{}
 };
+static_assert(
+    cUnstructuredSfaMagicNumber.size() == sizeof(header.magic),
+    "Magic number size mismatch"
+);
🧰 Tools
🪛 GitHub Actions: clp-lint

[error] 173-173: code should be clang-formatted


189-205: Handle missing file scenario in write_archive_file().

When creating FileReader, a missing or inaccessible file triggers an exception. Ensure that the resulting exception text or logs contain the problematic file path to aid debugging.

🧰 Tools
🪛 GitHub Actions: clp-lint

[error] 189-189: code should be clang-formatted

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 265b4e4 and 3ab46fd.

📒 Files selected for processing (4)
  • components/core/CMakeLists.txt (1 hunks)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1 hunks)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1 hunks)
  • components/core/src/clp/streaming_archive/writer/Archive.cpp (5 hunks)
✅ Files skipped from review due to trivial changes (1)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.hpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/core/CMakeLists.txt
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==

**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
📓 Learnings (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)
Learnt from: haiqi96
PR: y-scope/clp#646
File: components/core/src/clp/streaming_archive/writer/Archive.hpp:354-354
Timestamp: 2025-01-14T16:06:54.692Z
Learning: Member variables in C++ classes should be explicitly initialized in the constructor to prevent undefined behavior, as demonstrated in the Archive class where `m_use_single_file_archive` is initialized to `false`.
🪛 GitHub Actions: clp-lint
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp

[error] 71-71: code should be clang-formatted


[error] 79-79: code should be clang-formatted


[error] 79-79: code should be clang-formatted


[error] 79-79: code should be clang-formatted


[error] 79-79: code should be clang-formatted


[error] 88-88: code should be clang-formatted


[error] 91-91: code should be clang-formatted


[error] 91-91: code should be clang-formatted


[error] 173-173: code should be clang-formatted


[error] 184-184: code should be clang-formatted


[error] 184-184: code should be clang-formatted


[error] 189-189: code should be clang-formatted

⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: ubuntu-focal-static-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-focal-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (12)
components/core/src/clp/streaming_archive/writer/Archive.cpp (6)

16-16: LGTM!

The added includes are necessary for the new single-file archive functionality and follow the standard include ordering.

Also applies to: 18-18, 24-24


62-62: LGTM!

The member variable is properly initialized from the user configuration, following best practices for member variable initialization.


249-252: Verify cleanup ordering for single-file archive creation.

The single-file archive is created after closing the multi-file archive. Ensure that any failures during single-file archive creation don't leave the system in an inconsistent state.


338-340: LGTM!

The condition correctly prevents file splitting for single-file archives and follows the coding guidelines for boolean expressions.


665-665: Verify metadata consistency between archive types.

Based on previous discussions, ensure that the metadata format is consistent between normal archives and single-file archives, particularly regarding the GlobalMetadataDB.


76-76: Verify archive versioning scheme.

Based on the PR comments, consider adopting the semantic versioning scheme from the private branch where version is encoded as (major_version << 24) | (minor_version << 16) | (patch_version) using uint32_t.

components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (6)

1-22: Header and namespace declarations look good.

All necessary includes and namespace usages appear properly structured.


120-154: Validate file existence in get_file_infos().

This function will throw if certain files (e.g., static archive files) are missing. Depending on your intended design, consider checking for file existence here or handling it upstream.


184-187: write_archive_metadata appears correct.

This straightforward write of MsgPack data looks fine.

🧰 Tools
🪛 GitHub Actions: clp-lint

[error] 184-184: code should be clang-formatted


[error] 184-184: code should be clang-formatted


207-223: write_archive_files logic is clear.

Iterating static files and segments is straightforward. Be sure that skipping the metadata file is consistent with the rest of the codebase.


226-261: Verify success before removing original archive.

It's safer to confirm the single-file archive is completely written and valid before removing the multi-file archive.


34-118: 🛠️ Refactor suggestion

Consider adding overflow checks in get_file_size_and_update_offset().

When accumulating file sizes into 'offset', a very large file could cause 'offset' to overflow. Add a bounds check before increasing 'offset'.

 auto get_file_size_and_update_offset(std::filesystem::path const& file_path, uint64_t& offset)
         -> void {
     try {
         auto size = std::filesystem::file_size(file_path);
+        if (size > (std::numeric_limits<uint64_t>::max() - offset)) {
+            throw OperationFailed(
+                ErrorCode_Failure,
+                __FILENAME__,
+                __LINE__,
+                "File size causes overflow in offset"
+            );
+        }
         offset += size;
     } catch (std::filesystem::filesystem_error const& e) {
         ...
     }
 }

Likely invalid or redundant comment.

🧰 Tools
🪛 GitHub Actions: clp-lint

[error] 71-71: code should be clang-formatted


[error] 79-79: code should be clang-formatted


[error] 79-79: code should be clang-formatted


[error] 79-79: code should be clang-formatted


[error] 79-79: code should be clang-formatted


[error] 88-88: code should be clang-formatted


[error] 91-91: code should be clang-formatted


[error] 91-91: code should be clang-formatted

* @param single_file_archive_writer
* @param packed_metadata_size
*/
auto write_archive_header(FileWriter& single_file_archive_writer, size_t packed_metadata_size) -> void;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Resolve clang-format pipeline failures.

Multiple lines violate the required code style. Please run clang-format or apply the suggested changes to pass the lint checks.

#!/bin/bash
clang-format -i components/core/src/clp/streaming_archive/single_file_archive/writer.cpp

Also applies to: 79-79, 88-88, 91-91, 173-173, 184-184, 189-189

🧰 Tools
🪛 GitHub Actions: clp-lint

[error] 71-71: code should be clang-formatted

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (1)

16-20: Run clang-format on the version constants.

The pipeline indicates a clang-format violation at line 20. Please run clang-format to ensure consistent formatting across this block.

🧰 Tools
🪛 GitHub Actions: clp-lint

[error] 20-20: code should be clang-formatted


[error] 20-20: code should be clang-formatted

components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (2)

71-71: Apply clang-format to resolve pipeline failures.

The pipeline highlights multiple lines with formatting violations. Please run clang-format (or equivalent) to fix these.

Also applies to: 79-79, 82-82, 153-153, 164-164

🧰 Tools
🪛 GitHub Actions: clp-lint

[error] 71-71: code should be clang-formatted


226-226: Use stream position rather than creating a new string.

When determining metadata size, consider using tellp() to avoid extra string creation. For example:

- write_archive_header(single_file_archive_writer, packed_metadata.str().size());
+ write_archive_header(single_file_archive_writer, static_cast<size_t>(packed_metadata.tellp()));
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3ab46fd and a12ac42.

📒 Files selected for processing (3)
  • components/core/src/clp/clp/compression.cpp (3 hunks)
  • components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (1 hunks)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==

**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp
  • components/core/src/clp/clp/compression.cpp
  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
🪛 GitHub Actions: clp-lint
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp

[error] 20-20: code should be clang-formatted


[error] 20-20: code should be clang-formatted

components/core/src/clp/clp/compression.cpp

[error] 139-139: code should be clang-formatted


[error] 168-168: code should be clang-formatted

components/core/src/clp/streaming_archive/single_file_archive/writer.cpp

[error] 71-71: code should be clang-formatted


[error] 79-79: code should be clang-formatted


[error] 82-82: code should be clang-formatted


[error] 82-82: code should be clang-formatted


[error] 153-153: code should be clang-formatted


[error] 164-164: code should be clang-formatted

⏰ Context from checks skipped due to timeout of 90000ms (11)
  • GitHub Check: ubuntu-focal-static-linked-bins
  • GitHub Check: ubuntu-focal-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: build-macos (macos-13, true)
  • GitHub Check: build (macos-latest)
🔇 Additional comments (6)
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (3)

46-51: Repeated feedback on short field names.

This file redefines "n" and "o" in MSGPACK_NVP for compatibility with the single-file archive specification. However, note that previous reviewers raised similar concerns.


53-58: Struct definition looks good.

The usage of SingleFileArchiveMetadata as a simple data holder, with MSGPACK-based serialization, appears cohesive.


39-44: Validate packed struct usage for portability.

Using attribute((packed)) may cause alignment issues on differing platforms. Consider verifying behaviour on compilers beyond GCC/Clang, especially if cross-platform support is intended.

✅ Verification successful

🏁 Script executed:

#!/bin/bash
# Verify usage of packed structs across the repository to ensure no alignment issues arise
ast-grep --pattern $'struct $_ __attribute__((packed)) {\n  $$$\n}'

Length of output: 66


Below is a new shell script that searches for all occurrences of attribute((packed))—both in the specific file and across the repository—to help us manually verify that no unexpected alignment issues exist on non-GCC/Clang compilers:

Please run this script to review the output. If the only usage of attribute((packed)) appears in the expected locations (such as in Defs.hpp) and you can confirm that the code is only intended for GCC/Clang (or is otherwise guarded for other compilers), then the portability concern may be acceptable. However, if cross-platform support is intended, consider verifying how other compilers (e.g. MSVC) handle the packed attribute and possibly introduce alternative, compiler‑specific macros or tests.


🏁 Script executed:

#!/bin/bash
echo "Searching for '__attribute__((packed))' in Defs.hpp:"
rg '__attribute__\(\(packed\)\)' components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp -n

echo "---------------------------"
echo "Searching for '__attribute__((packed))' across the entire repository:"
rg '__attribute__\(\(packed\)\)' -n .

Length of output: 680


Packed Struct Verification in SingleFileArchiveHeader

  • The search confirms that the only occurrence of attribute((packed)) is at line 39 in Defs.hpp.
  • No other usages were found across the repository.
  • While the attribute is widely supported by GCC/Clang, ensure cross-platform builds (e.g. MSVC) are tested or consider adopting compiler-specific macros if broader portability is required.
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)

96-109: Overflow check for file offset.

There's a risk of offset overflow when large file sizes are added. A previous reviewer mentioned adding defensive checks.

components/core/src/clp/clp/compression.cpp (2)

110-110: Single-file archive user configuration.

Assigning archive_user_config.use_single_file_archive from command_line_args.single_file_archive() looks correct. No issues found here.


139-140: Check if splitting logic conflicts with single-file archive.

If the single-file archive flag is set, confirm whether splitting archives is still desired. If not, guard these conditions with a check on archive_user_config.use_single_file_archive. Also, clang-format is flagged for these lines.

Also applies to: 168-169

🧰 Tools
🪛 GitHub Actions: clp-lint

[error] 139-139: code should be clang-formatted

@davemarco davemarco changed the title feat(clp): Add the write path for single-file archives. feat(clp-core): Add the write path for single-file archives. Feb 9, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (3)

29-39: Add documentation for the unused array.

While the static file names are clear, please add a comment explaining the purpose of cNumUnused and why it's set to 6. This will help future maintainers understand if this space is reserved for future use.


41-46: Add documentation for the SingleFileArchiveHeader struct.

Please add documentation explaining:

  • Why the struct is packed
  • The purpose of each field
  • The relationship between this header and the file format

55-60: Add documentation for the SingleFileArchiveMetadata struct.

Please add documentation explaining:

  • The purpose of this struct in the context of single-file archives
  • The relationship between archive_files, archive_metadata, and num_segments
  • Any invariants that must be maintained
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)

133-133: Use a clear sentinel marker.
Storing a file info entry with an empty name can be confusing. Consider describing its purpose in a comment or using an explicit sentinel key (e.g. "_TOTAL_SIZE") to clarify intent.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a12ac42 and 4c26177.

📒 Files selected for processing (5)
  • components/core/src/clp/clp/compression.cpp (1 hunks)
  • components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (1 hunks)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1 hunks)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.hpp (1 hunks)
  • tools/yscope-dev-utils (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tools/yscope-dev-utils
🚧 Files skipped from review as they are similar to previous changes (2)
  • components/core/src/clp/clp/compression.cpp
  • components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==

**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp
  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-focal-static-linked-bins
  • GitHub Check: ubuntu-focal-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: build (macos-latest)
  • GitHub Check: build-macos (macos-13, true)
  • GitHub Check: lint-check (macos-latest)
🔇 Additional comments (6)
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (4)

1-11: LGTM! Header guards and includes are well-structured.

The header guards follow the correct naming convention, and the includes are organized logically.


14-22: LGTM! Versioning scheme aligns with semantic versioning requirements.

The version encoding using bitwise operations (major << 24 | minor << 16 | patch) matches the versioning scheme used in the private branch, ensuring compatibility.


24-27: LGTM! Magic number and file extension constants are well-defined.

The magic number 'YCLP' and file extension '.clp' are appropriately defined as constexpr values.


48-53: LGTM! FileInfo struct follows the specification.

The struct appropriately uses MSGPACK_DEFINE_MAP to rename variables during serialization to match the single-file archive specification. The comment explains the renaming convention.

components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (2)

102-102: Prevent potential overflow in offset calculation.
If the single-file archive contains very large or numerous files, "offset += size" can overflow a 64-bit integer.

Apply the patch from previous comments to perform an overflow check:

     auto size = std::filesystem::file_size(file_path);
+    if (size > std::numeric_limits<uint64_t>::max() - offset) {
+        throw OperationFailed(
+            ErrorCode_Failure,
+            __FILENAME__,
+            __LINE__,
+            "File size would cause offset overflow"
+        );
+    }
     offset += size;

235-238: Propagate the original exception message when file removal fails.
Re-throwing OperationFailed without "e.what()" obscures the root cause. Including the detailed error message helps diagnose issues.

Example improvement:

} catch (std::filesystem::filesystem_error& e) {
-    throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+    throw OperationFailed(
+        ErrorCode_Failure,
+        __FILENAME__,
+        __LINE__,
+        fmt::format("Failed to remove original archive: {}", e.what())
+    );
}

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (4)

80-80: Use clang-format to fix style.
The pipeline flagged this line for formatting issues. Please run clang-format or manually align parameters to pass the lint checks.

🧰 Tools
🪛 GitHub Actions: clp-lint

[error] 80-80: code should be clang-formatted


125-132: Reserve vector capacity in advance to improve performance.
You already know how many file records you will push, so calling reserve(...) can reduce reallocation overhead.

For instance:

files.reserve(cNumStaticFiles + next_segment_id + 1);

141-142: Clarify intent of sentinel entry.
An entry with an empty name storing the total size may be confusing to readers. Adding a short comment would improve clarity.

Example:

- files.emplace_back(FileInfo{"", offset});
+ // Sentinel entry to store the cumulative size of all files.
+ files.emplace_back(FileInfo{"", offset});

176-176: Use clang-format to fix style.
The pipeline flagged this line for formatting issues. Please run clang-format or manually adjust indentation to pass the lint checks.

🧰 Tools
🪛 GitHub Actions: clp-lint

[error] 176-176: code should be clang-formatted

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c26177 and 7916bb9.

📒 Files selected for processing (3)
  • components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (1 hunks)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1 hunks)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.hpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==

**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp
  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
🪛 GitHub Actions: clp-lint
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp

[error] 80-80: code should be clang-formatted


[error] 176-176: code should be clang-formatted

⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: build-macos (macos-13, true)
  • GitHub Check: build (macos-latest)
🔇 Additional comments (2)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)

107-120: 🛠️ Refactor suggestion

Add offset overflow check.
Large archive files could cause the offset to exceed uint64_t. Adding a check avoids silent corruption.

Example fix:

 try {
     auto size = std::filesystem::file_size(file_path);
+    if (size > std::numeric_limits<uint64_t>::max() - offset) {
+        throw OperationFailed(
+            ErrorCode_Failure,
+            __FILENAME__,
+            __LINE__,
+            "File size would cause offset overflow"
+        );
+    }
     offset += size;
 } catch (std::filesystem::filesystem_error const& e) {

Likely invalid or redundant comment.

components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (1)

16-22: Semantic versioning approach looks consistent.
Shifting and encoding major, minor, and patch into a uint32_t neatly aligns with semantic versioning.

Comment on lines 249 to 251
try {
std::filesystem::remove_all(multi_file_archive_path);
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove or justify forced failure.
Throwing an error unconditionally after removing the original archive forces every operation to fail. If this is intentional, explain the rationale. Otherwise, remove the throw statement.

Example fix:

try {
    std::filesystem::remove_all(multi_file_archive_path);
-    throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
} catch (std::filesystem::filesystem_error& e) {
    throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
}

Copy link
Contributor Author

@davemarco davemarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more responses

components/core/src/clp/clp/FileCompressor.cpp Outdated Show resolved Hide resolved
try {
std::filesystem::remove_all(multi_file_archive_path);
} catch (std::filesystem::filesystem_error& e) {
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2025-02-09 at 4 56 09 PM

try {
std::filesystem::remove_all(multi_file_archive_path);
} catch (std::filesystem::filesystem_error& e) {
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception propagates to the top which is good. If you want i can change the message from single_file_archive operation failed to something custom about multi-file archive not being deleted.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
components/core/src/clp/streaming_archive/single_file_archive/writer.hpp (1)

35-37: Fix the docstring typo.

The docstring contains a redundant word "removes deletes".

-* Writes a single-file archive then removes deletes the multi-file archive.
+* Writes a single-file archive then deletes the multi-file archive.
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (3)

141-142: Revisit sentinel entry usage.

Storing a sentinel FileInfo with an empty string and a total offset is functional but might be unclear to future maintainers. Consider adding documentation clarifying its use or changing the entry name to something like "<offset_sentinel>" for improved discoverability.


151-160: Consolidate metadata logic if possible.

The SingleFileArchiveMetadata struct largely reflects the metadata from the multi-file archive but adds segment information. If you anticipate more fields in the future, consider creating a unifying interface or method that merges them in one step to maintain clarity. This can reduce duplication and keep the data model consistent.


182-198: Use a larger buffer for file copying to improve performance.

While a 4 KB buffer is acceptable for common cases, consider using a larger read buffer for writing big files. This can reduce the number of I/O operations and potentially speed up the transfer of data. For instance, 64 KB or 128 KB might be reasonable defaults, depending on your system constraints.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7916bb9 and 1b7d73e.

📒 Files selected for processing (4)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1 hunks)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.hpp (1 hunks)
  • components/core/src/clp/streaming_archive/writer/Archive.cpp (4 hunks)
  • components/core/src/clp/streaming_archive/writer/Archive.hpp (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/core/src/clp/streaming_archive/writer/Archive.hpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==

**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
  • components/core/src/clp/streaming_archive/writer/Archive.cpp
  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
📓 Learnings (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)
Learnt from: haiqi96
PR: y-scope/clp#646
File: components/core/src/clp/streaming_archive/writer/Archive.hpp:354-354
Timestamp: 2025-01-14T16:06:54.692Z
Learning: Member variables in C++ classes should be explicitly initialized in the constructor to prevent undefined behavior, as demonstrated in the Archive class where `m_use_single_file_archive` is initialized to `false`.
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)
Learnt from: davemarco
PR: y-scope/clp#646
File: components/core/src/clp/streaming_archive/single_file_archive/writer.cpp:87-88
Timestamp: 2025-02-09T21:58:56.178Z
Learning: In C++, function signature mismatches between declaration and definition within an anonymous namespace are internal to the translation unit and don't cause compilation errors, as they are not part of the public API.
⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: ubuntu-focal-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-focal-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: lint-check (macos-latest)
  • GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (7)
components/core/src/clp/streaming_archive/single_file_archive/writer.hpp (2)

16-33: LGTM!

The exception class is well-designed with proper inheritance, constructor parameters, and method overrides.


42-45: LGTM!

The function signature is clear and well-designed.

components/core/src/clp/streaming_archive/writer/Archive.cpp (3)

62-62: LGTM!

The member variable is correctly initialized in the constructor, as per the learnings.


249-251: LGTM!

The single-file archive creation is correctly placed after all metadata is persisted and resources are cleaned up.


338-340: LGTM!

The condition correctly prevents file splitting when using single-file archive and follows the coding guidelines for boolean expressions.

components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (2)

128-131: Verify existence of static archive files.

You assume that each filename in cStaticArchiveFileNames is present within the multi-file archive path. If any file is missing, get_file_size_and_update_offset will throw an exception. Consider confirming that each static file exists to produce a more precise error message or to handle missing files gracefully.


107-120: 🛠️ Refactor suggestion

Prevent potential offset overflow.

While adding the file size to "offset", there is a risk that "offset + size" could overflow if the single-file archive is very large. Consider including a bounds check before adding:

     auto size = std::filesystem::file_size(file_path);
+    if (size > std::numeric_limits<uint64_t>::max() - offset) {
+        throw OperationFailed(
+            ErrorCode_Failure,
+            __FILENAME__,
+            __LINE__,
+            "File size would cause offset overflow"
+        );
+    }
     offset += size;

Likely invalid or redundant comment.

Copy link
Contributor Author

@davemarco davemarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haiqi96 - You can re-review. I did three things. 1) merged main. 2) Removed multi-file metadata from single file archive metadata. Now it is just written like the other static archives files. We can do this now that we changed to msgpack for multi-file-archive metadata 3) added back splitting

@davemarco davemarco requested a review from haiqi96 February 9, 2025 22:10
* @param single_file_archive_writer
* @param packed_metadata Packed metadata.
*/
auto write_archive_metadata(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have asked the same question but forgot:

This function is only used once, and the function is literally a single line call with some getter. Maybe we don't really need this wrapper unless we plan to use this function for other places in the future.

But I guess it's also ok to keep this method so it's clear to reader. Consider naming this to:
write_packed_archive_metadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed. Yes it's one line so we can remove if you want. It does look nicer here(screenshot) tho. Lmk what you prefer, and i'll do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2025-02-10 at 4 40 10 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's keep this method. I do feel it looks nice.

segment_id_t next_segment_id
) -> void {
FileWriter single_file_archive_writer;
std::filesystem::path single_file_archive_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

try {
std::filesystem::remove_all(multi_file_archive_path);
} catch (std::filesystem::filesystem_error& e) {
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kirkrodrigues wonder if we should strictly throw an exception here, or report an warning.
Technically, the singile file archive is correctly created, we just fail to remove mulit-archive files.

Copy link
Contributor Author

@davemarco davemarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haiqi review responses

* @param single_file_archive_writer
* @param packed_metadata Packed metadata.
*/
auto write_archive_metadata(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed. Yes it's one line so we can remove if you want. It does look nicer here(screenshot) tho. Lmk what you prefer, and i'll do.

* @param single_file_archive_writer
* @param packed_metadata Packed metadata.
*/
auto write_archive_metadata(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2025-02-10 at 4 40 10 PM


for (auto const& static_archive_file_name : cStaticArchiveFileNames) {
files.emplace_back(FileInfo{std::string(static_archive_file_name), offset});
get_file_size_and_update_offset(multi_file_archive_path / static_archive_file_name, offset);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about removing the function completely? I don't think we have an existing function in utils that calls stat on a closed file.

Something like

offset += std::filesystem::file_size(multi_file_archive_path / static_archive_file_name);

I think I originally created the function, so I could rethrow the filesystem exception as our own exception. But on second thought, I think it's fine to just throw the filesystem exception, and document it.

try {
std::filesystem::remove_all(multi_file_archive_path);
} catch (std::filesystem::filesystem_error& e) {
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to SPDLOG_WARN

@davemarco davemarco requested a review from haiqi96 February 10, 2025 22:30
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)

216-227: 🛠️ Refactor suggestion

Improve error handling for file existence check.

Add context to the error message:

     if (std::filesystem::exists(single_file_archive_path)) {
-        throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+        throw OperationFailed(
+                ErrorCode_Failure,
+                __FILENAME__,
+                __LINE__,
+                fmt::format("Single-file archive already exists at: {}", 
+                          single_file_archive_path.string())
+        );
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

auto
write_single_file_archive(std::filesystem::path const& multi_file_archive_path, size_t num_segments)
        -> void {
    FileWriter single_file_archive_writer;
    std::filesystem::path single_file_archive_path{
            multi_file_archive_path.string() + std::string(single_file_archive::cUnstructuredSfaExtension)
    };

    if (std::filesystem::exists(single_file_archive_path)) {
        throw OperationFailed(
                ErrorCode_Failure,
                __FILENAME__,
                __LINE__,
                fmt::format("Single-file archive already exists at: {}", single_file_archive_path.string())
        );
    }

    // ... rest of the function implementation
}
🧹 Nitpick comments (3)
components/core/src/clp/streaming_archive/single_file_archive/writer.hpp (1)

35-44: Enhance function documentation.

The docstring should provide more details about:

  • The format of the single-file archive being created
  • The meaning and valid range of num_segments
  • What happens to the multi-file archive after conversion
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (2)

173-178: Optimize string handling.

Cache the string to avoid multiple str() calls:

 auto write_packed_archive_metadata(
         FileWriter& single_file_archive_writer,
         std::stringstream const& packed_metadata
 ) -> void {
-    single_file_archive_writer.write(packed_metadata.str().data(), packed_metadata.str().size());
+    auto const metadata_str = packed_metadata.str();
+    single_file_archive_writer.write(metadata_str.data(), metadata_str.size());
 }

180-196: Improve error handling with descriptive messages.

Add context to error messages:

         }
         if (ErrorCode_Success != error_code) {
-            throw OperationFailed(error_code, __FILENAME__, __LINE__);
+            throw OperationFailed(
+                    error_code,
+                    __FILENAME__,
+                    __LINE__,
+                    fmt::format("Failed to read from file: {}", file_path.string())
+            );
         }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b7d73e and 05377b8.

📒 Files selected for processing (4)
  • components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (1 hunks)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1 hunks)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.hpp (1 hunks)
  • components/core/src/clp/streaming_archive/writer/Archive.cpp (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==

**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
  • components/core/src/clp/streaming_archive/writer/Archive.cpp
  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
🧠 Learnings (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)
Learnt from: haiqi96
PR: y-scope/clp#646
File: components/core/src/clp/streaming_archive/writer/Archive.hpp:354-354
Timestamp: 2025-01-14T16:06:54.692Z
Learning: Member variables in C++ classes should be explicitly initialized in the constructor to prevent undefined behavior, as demonstrated in the Archive class where `m_use_single_file_archive` is initialized to `false`.
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)
Learnt from: davemarco
PR: y-scope/clp#646
File: components/core/src/clp/streaming_archive/single_file_archive/writer.cpp:87-88
Timestamp: 2025-02-09T21:58:56.178Z
Learning: In C++, function signature mismatches between declaration and definition within an anonymous namespace are internal to the translation unit and don't cause compilation errors, as they are not part of the public API.
⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-focal-static-linked-bins
  • GitHub Check: ubuntu-focal-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: lint-check (macos-latest)
  • GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (6)
components/core/src/clp/streaming_archive/single_file_archive/writer.hpp (1)

16-33: LGTM! Well-designed exception class.

The class follows good error handling practices with proper inheritance, move semantics, and const correctness.

components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (3)

121-144: LGTM! Well-structured file info collection.

The function properly handles both static files and segments, with clean path handling using std::filesystem.


146-159: LGTM! Clean metadata packing implementation.

The function is concise and uses modern C++ features effectively.


198-213: LGTM! Clean implementation of archive file writing.

The function properly handles both static and segment files with good path handling.

components/core/src/clp/streaming_archive/writer/Archive.cpp (2)

60-60: LGTM! Clean initialization of single-file archive flag.

The change properly initializes the flag from user configuration.


247-252: LGTM! Clean integration of single-file archive creation.

The change properly handles the creation of single-file archives when requested.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (3)

83-91: Enhance error documentation.

The docstring should specify what types of errors can cause OperationFailed to be thrown (e.g., file not found, permission denied, etc.).


107-119: Add specific error code for file size failures.

Instead of using a generic ErrorCode_Failure, consider adding a specific error code for file size retrieval failures to help with debugging.


225-227: Enhance error message for existing archive.

Add more context to the error message when the single-file archive already exists:

-        throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+        throw OperationFailed(
+            ErrorCode_Failure,
+            __FILENAME__,
+            __LINE__,
+            fmt::format("Single-file archive already exists at: {}", 
+                       single_file_archive_path.string())
+        );
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 05377b8 and e0fcefe.

📒 Files selected for processing (1)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==

**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
🧠 Learnings (1)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (2)
Learnt from: davemarco
PR: y-scope/clp#646
File: components/core/src/clp/streaming_archive/single_file_archive/writer.cpp:161-171
Timestamp: 2025-02-10T22:36:08.496Z
Learning: When using std::array, size checking between arrays is enforced at compile-time through the type system, making additional static_assert size checks redundant.
Learnt from: davemarco
PR: y-scope/clp#646
File: components/core/src/clp/streaming_archive/single_file_archive/writer.cpp:87-88
Timestamp: 2025-02-09T21:58:56.178Z
Learning: In C++, function signature mismatches between declaration and definition within an anonymous namespace are internal to the translation unit and don't cause compilation errors, as they are not part of the public API.
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: lint-check (macos-latest)
  • GitHub Check: build-macos (macos-13, true)
  • GitHub Check: build (macos-latest)
🔇 Additional comments (3)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (3)

1-21: LGTM! Well-organized includes.

The includes are logically grouped and contain all necessary headers for the implementation.


22-26: LGTM! Appropriate namespace structure and constant definition.

The code is well-organized with proper namespace scoping, and the read block size is set to a reasonable value for buffered I/O operations.


181-196: Add buffer overflow protection.

Consider adding a check to ensure that num_bytes_read doesn't exceed cReadBlockSize before writing to the archive.

❌ Verification inconclusive

🏁 Script executed:

#!/bin/bash
# Search for similar buffer checks in the codebase
rg "num_bytes_read.*cReadBlockSize" -A 2 -B 2

Length of output: 47


I'll generate an additional shell script to search for the implementation of try_read and verify whether it guarantees that num_bytes_read does not exceed cReadBlockSize. This will help us decide if additional buffer overflow protection is needed.


🏁 Script executed:

#!/bin/bash
# Search for the definition/implementation of try_read to check its safety checks
rg "try_read\(" -A 5 -B 5

Length of output: 71207


Buffer overflow protection check is not necessary.

Our investigation shows that the contract for try_read guarantees that it will not read more than cReadBlockSize bytes. The underlying implementations (e.g. using fread or bounded readers) ensure that num_bytes_read is always less than or equal to the requested number of bytes. No additional check is required in write_archive_file.

Likely an incorrect or invalid review comment.

Copy link
Contributor

@haiqi96 haiqi96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replied to a few comments.


for (auto const& static_archive_file_name : cStaticArchiveFileNames) {
files.emplace_back(FileInfo{std::string(static_archive_file_name), offset});
get_file_size_and_update_offset(multi_file_archive_path / static_archive_file_name, offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it is ok. I remember Zhihao had a talk about error handling but I wasn't paying attention enough.

@LinZhihao-723, do you think we can directly throw the exception from std::filesystem ?

* @param single_file_archive_writer
* @param packed_metadata Packed metadata.
*/
auto write_archive_metadata(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's keep this method. I do feel it looks nice.

Copy link
Contributor Author

@davemarco davemarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haiqi review


for (auto const& static_archive_file_name : cStaticArchiveFileNames) {
files.emplace_back(FileInfo{std::string(static_archive_file_name), offset});
get_file_size_and_update_offset(multi_file_archive_path / static_archive_file_name, offset);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made change for now

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (5)

27-96: Enhance error documentation in function declarations.

While the documentation is generally good, consider documenting all possible error conditions for each function. For example:

  • write_archive_header should document potential write failures
  • write_packed_archive_metadata should document potential write failures
  • write_archive_files should document potential file access or write failures

97-120: Add overflow protection for offset calculations.

The offset accumulation could potentially overflow uint64_t. Consider adding bounds checking:

         offset += std::filesystem::file_size(multi_file_archive_path / static_archive_file_name);
+        if (offset > std::numeric_limits<uint64_t>::max() - std::filesystem::file_size(segment_dir_path / segment_id)) {
+            throw OperationFailed(
+                ErrorCode_Failure,
+                __FILENAME__,
+                __LINE__,
+                "File size would cause offset overflow"
+            );
+        }

137-147: Consider using safer serialization for header.

The current implementation using reinterpret_cast could be unsafe if struct padding changes. Consider using a serialization method that explicitly handles each field:

-    single_file_archive_writer.write(reinterpret_cast<char const*>(&header), sizeof(header));
+    single_file_archive_writer.write(reinterpret_cast<char const*>(&header.magic), sizeof(header.magic));
+    single_file_archive_writer.write(reinterpret_cast<char const*>(&header.version), sizeof(header.version));
+    single_file_archive_writer.write(reinterpret_cast<char const*>(&header.metadata_size), sizeof(header.metadata_size));
+    single_file_archive_writer.write(reinterpret_cast<char const*>(&header.unused), sizeof(header.unused));

156-172: Enhance error messages in write_archive_file.

Consider providing more context in error messages:

-            throw OperationFailed(error_code, __FILENAME__, __LINE__);
+            throw OperationFailed(
+                error_code,
+                __FILENAME__,
+                __LINE__,
+                fmt::format("Failed to read from file: {}", file_path.string())
+            );

192-223: Improve error message for existing archive.

Provide more context when the target archive already exists:

     if (std::filesystem::exists(single_file_archive_path)) {
-        throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+        throw OperationFailed(
+            ErrorCode_Failure,
+            __FILENAME__,
+            __LINE__,
+            fmt::format("Single-file archive already exists at: {}", single_file_archive_path.string())
+        );
     }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e0fcefe and d8a3c70.

📒 Files selected for processing (1)
  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==

**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
🧠 Learnings (1)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (2)
Learnt from: davemarco
PR: y-scope/clp#646
File: components/core/src/clp/streaming_archive/single_file_archive/writer.cpp:161-171
Timestamp: 2025-02-10T22:36:08.496Z
Learning: When using std::array, size checking between arrays is enforced at compile-time through the type system, making additional static_assert size checks redundant.
Learnt from: davemarco
PR: y-scope/clp#646
File: components/core/src/clp/streaming_archive/single_file_archive/writer.cpp:87-88
Timestamp: 2025-02-09T21:58:56.178Z
Learning: In C++, function signature mismatches between declaration and definition within an anonymous namespace are internal to the translation unit and don't cause compilation errors, as they are not part of the public API.
⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: ubuntu-focal-static-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-focal-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: lint-check (macos-latest)
  • GitHub Check: build-macos (macos-13, true)
  • GitHub Check: build (macos-latest)
🔇 Additional comments (5)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (5)

1-21: LGTM! Well-organized includes.

The includes are logically grouped and provide all necessary dependencies for the implementation.


22-26: LGTM! Appropriate namespace and block size constant.

The code is properly organized in namespaces, and the read block size of 4096 bytes is a good choice for file I/O operations.


122-135: LGTM! Clean metadata packing implementation.

The function correctly serializes the metadata using msgpack.


149-154: LGTM! Clean metadata writing implementation.

The function correctly writes the serialized metadata to the archive.


174-189: LGTM! Proper archive file writing implementation.

The function correctly handles both static archive files and segment files in the expected order.

@davemarco davemarco requested a review from haiqi96 February 11, 2025 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants