feat: add sysinfo metrics #1139

nikhilsinhaparseable · 2025-01-29T06:28:30Z

collect CPU usage, memory usage of the server
collect disk usage of the volume - data, staging, hot-tier
add these metrics to Prometheus Metrics

export these metrics to cluster metrics API
add the metrics to pmeta stream
add the querier node's sysinfo metrics to pmeta and cluster metrics API

Summary by CodeRabbit

New Features
- Added comprehensive system metrics collection, including disk, memory, and CPU usage, with Prometheus integration for real-time monitoring.
Improvements
- Enhanced metrics endpoints to include system and disk resource usage details.
- Streamlined internal log ingestion for improved performance.
- Initialization processes now consistently start system and cluster metrics schedulers asynchronously.
Refactor
- Simplified stream writing logic by always writing records to disk, removing conditional checks based on stream type.
Bug Fixes
- Corrected naming and usage of metrics scheduler functions for consistency.

coveralls · 2025-01-29T06:33:20Z

Pull Request Test Coverage Report for Build 13025834726

Details

0 of 390 (0.0%) changed or added relevant lines in 8 files are covered.
4 unchanged lines in 2 files lost coverage.
Overall coverage decreased (-0.2%) to 12.761%

Changes Missing Coverage	Changed/Added Lines	%
src/handlers/http/modal/ingest_server.rs	1	0.0%
src/handlers/http/modal/server.rs	1	0.0%
src/handlers/http/modal/query_server.rs	2	0.0%
src/correlation.rs	3	0.0%
src/handlers/http/cluster/mod.rs	11	0.0%
src/handlers/http/ingest.rs	14	0.0%
src/metrics/mod.rs	174	0.0%
src/metrics/prom_utils.rs	184	0.0%

Files with Coverage Reduction	New Missed Lines	%
src/handlers/http/cluster/mod.rs	1	0.0%
src/metrics/mod.rs	3	0.0%

Totals
Change from base Build 13025432081:	-0.2%
Covered Lines:	2477
Relevant Lines:	19411

💛 - Coveralls

coderabbitai · 2025-04-01T12:32:08Z

Walkthrough

This set of changes introduces a comprehensive system metrics collection and reporting framework. Disk, memory, and CPU usage metrics are now gathered periodically using a scheduler and exposed via Prometheus-compatible metrics. The codebase is refactored to support these new metrics, with additional data structures and metric types added. Metric collection is integrated into server initialization routines, ensuring that system and cluster metrics are always gathered and available. Some internal APIs are streamlined, such as the removal of the stream_type parameter from stream push operations and simplification of the internal ingestion logic.

Changes

File(s)	Change Summary
src/metrics/mod.rs	Adds scheduled system resource monitoring: new Prometheus metrics for disk, memory, and CPU; new structs for encapsulating metrics; scheduler for periodic collection; new async functions for collecting and updating metrics; error handling for metric collection.
src/metrics/prom_utils.rs	Refactors `Metrics` struct and methods to handle new disk, memory, and CPU metrics; introduces new metric types and async processing; adds methods for collecting and processing system metrics; removes old synchronous metric processing; updates metric aggregation logic.
src/handlers/http/cluster/mod.rs	Adds `collect_all_metrics` to cluster metrics collection; updates function signatures to async; ensures system metrics are collected before cluster metrics; updates scheduler logic and error handling.
src/handlers/http/ingest.rs	Simplifies internal ingestion by removing explicit event construction and schema fetching; directly calls log flattening and pushing with parsed JSON and custom fields.
src/handlers/http/modal/ingest_server.rs src/handlers/http/modal/server.rs src/handlers/http/modal/query_server.rs	Adds calls to `init_system_metrics_scheduler` in server initialization methods; updates imports; ensures both system and cluster metrics schedulers are always started asynchronously during server startup; removes conditional checks and related logging.
src/event/mod.rs src/parseable/streams.rs	Removes `stream_type` parameter from stream push operations and related method signatures; eliminates conditional logic based on stream type in `Stream::push`; updates test code accordingly.

Sequence Diagram(s)

sequenceDiagram
    participant Scheduler
    participant MetricsModule
    participant Prometheus
    participant Disk
    participant System

    loop Every minute
        Scheduler->>MetricsModule: collect_all_metrics()
        MetricsModule->>Disk: collect_disk_metrics()
        Disk-->>MetricsModule: Disk usage stats
        MetricsModule->>System: collect_system_metrics()
        System-->>MetricsModule: Memory & CPU stats
        MetricsModule->>Prometheus: Update metrics (disk, memory, CPU)
    end

sequenceDiagram
    participant Server
    participant MetricsModule

    Server->>MetricsModule: init_system_metrics_scheduler().await
    Server->>MetricsModule: init_cluster_metrics_scheduler().await
    MetricsModule-->>Server: Scheduler tasks started

Suggested labels

for next release

Suggested reviewers

parmesant

Poem

A bunny hops through metrics new,
Disk and memory, CPU too!
With every tick, the system checks,
Scheduler runs—no need for specs.
Prometheus smiles, the gauges rise,
All the stats beneath the skies.
🐇✨ System health, now crystal clear!

Note

⚡️ AI Code Reviews for VS Code, Cursor, Windsurf

CodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback.
Learn more here.

Note

⚡️ Faster reviews with caching

CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 30th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.
Enjoy the performance boost—your workflow just got faster.

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (9)

src/parseable/streams.rs (2)
133-155: Consider returning a more descriptive error instead of panicking.

This block is generally correct and well-structured; however, you call .expect("File and RecordBatch both are checked") instead of returning a more descriptive StagingError. In production, this might abruptly terminate the application instead of gracefully propagating the error. As a follow-up, you could replace the .expect(...) with a result-based error handling approach to facilitate better error diagnosis.
-                let mut writer = DiskWriter::try_new(file_path, &record.schema(), range)
-                    .expect("File and RecordBatch both are checked");
+                let mut writer = DiskWriter::try_new(file_path, &record.schema(), range)
+                    .map_err(|e| {
+                        StagingError::Create
+                    })?;
1047-1047: Use a more descriptive .expect(...) in tests.

Using .unwrap() in a test is acceptable, but for clarity, a more descriptive message can help diagnose failures:
-    staging.push("abc", &batch, time, &HashMap::new()).unwrap();
+    staging.push("abc", &batch, time, &HashMap::new())
+        .expect("Failed to push record batch to staging during test");
src/handlers/http/cluster/mod.rs (1)

874-874: Include error handling for querier metrics if necessary.

Adding all_metrics.push(Metrics::querier_prometheus_metrics().await); is a solid enhancement. Consider wrapping it if there's any chance of error or unavailability from querier_prometheus_metrics() so that the entire metrics collection doesn't silently fail if the querier is unreachable.

src/cli.rs (3)

442-451: Provide user-friendly fallback or logs on invalid config.

The logic in get_url is concise, routing between get_endpoint and build_url. However, it uses a panic! in get_endpoint for invalid input. Consider returning a Result<Url, SomeConfigError> or logging more details to help users correct their configurations in production deployments.

468-482: Parsing by splitting on “:” alone may limit IPv6 support.

The code splits on “:” to separate hostname from port, which won’t handle IPv6 addresses gracefully. If you foresee IPv6 usage, consider an established parser or additional checks to handle bracketed IPv6 addresses.

507-516: Fail-fast parsing with a helpful panic message is appropriate.

build_url clarifies errors for misconfigured addresses. This is fine as a default, but if you anticipate frequently needing dynamic reconfiguration, consider returning a Result<Url, ConfigError> to handle misconfigurations more gracefully at runtime.

src/metrics/prom_utils.rs (2)

65-69: Ensure testing of newly introduced fields
The newly introduced fields for disk, memory, and CPU usage in the Metrics struct are clear and well-defined. Ensure they’re fully covered in unit tests to verify the integration logic, especially verifying default values and changes upon metric updates.

Also applies to: 98-119

124-131: Consider adding doc comments
The MetricType enum and its from_metric implementation are well-structured. To aid maintainability, consider adding Rust doc comments explaining each variant’s purpose and usage, as these mappings play a crucial role in the metrics pipeline.

Also applies to: 134-174

src/metrics/mod.rs (1)

195-230: Suggest verifying naming consistency
The newly declared disk, memory, and CPU gauges and counters are logically grouped under the same namespace. As a nitpick, ensure consistent naming conventions (e.g., “_disk” vs. “_usage”) to minimize confusion in dashboards.

Also applies to: 280-294

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 382f480 and 6f96079.

📒 Files selected for processing (10)

src/cli.rs (1 hunks)
src/event/mod.rs (0 hunks)
src/handlers/http/cluster/mod.rs (2 hunks)
src/handlers/http/ingest.rs (2 hunks)
src/handlers/http/modal/ingest_server.rs (2 hunks)
src/handlers/http/modal/query_server.rs (2 hunks)
src/handlers/http/modal/server.rs (2 hunks)
src/metrics/mod.rs (4 hunks)
src/metrics/prom_utils.rs (4 hunks)
src/parseable/streams.rs (2 hunks)

💤 Files with no reviewable changes (1)

src/event/mod.rs

🧰 Additional context used

🧬 Code Definitions (8)

src/handlers/http/modal/ingest_server.rs (1)

src/metrics/mod.rs (1)

init_system_metrics_scheduler (386-406)

src/handlers/http/modal/server.rs (1)

src/metrics/mod.rs (1)

init_system_metrics_scheduler (386-406)

src/handlers/http/modal/query_server.rs (2)

src/handlers/http/cluster/mod.rs (1)

init_cluster_metrics_scheduler (878-927)

src/metrics/mod.rs (1)

init_system_metrics_scheduler (386-406)

src/parseable/streams.rs (2)

src/utils/time.rs (2)

granularity_range (267-282)

new (59-61)

src/parseable/staging/writer.rs (1)

try_new (57-72)

src/cli.rs (5)

src/option.rs (1)

mode (127-135)

src/storage/object_storage.rs (1)

get_endpoint (74-74)

src/storage/localfs.rs (1)

get_endpoint (83-85)

src/storage/azure_blob.rs (1)

get_endpoint (192-194)

src/storage/s3.rs (1)

get_endpoint (322-324)

src/metrics/prom_utils.rs (1)

src/metrics/mod.rs (2)

get_system_metrics (509-542)

get_volume_disk_usage (454-480)

src/handlers/http/cluster/mod.rs (2)

src/metrics/mod.rs (1)

collect_all_metrics (409-417)

src/metrics/prom_utils.rs (2)

querier_prometheus_metrics (203-252)

new (79-121)

src/metrics/mod.rs (1)

src/metrics/prom_utils.rs (1)

new (79-121)

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: coverage

🔇 Additional comments (21)

src/handlers/http/modal/ingest_server.rs (2)

31-31: LGTM: Added import for system metrics scheduler

Added import for the new system metrics scheduling functionality that will be used in the init method.

114-114: Added system metrics collection to ingest server

This adds the ability to collect and monitor system metrics (CPU, memory, disk usage) on the ingestor node, which aligns with the PR's objective of enhancing monitoring capabilities. This metrics collection is initialized during server startup after metadata is stored.

The implementation properly follows an asynchronous pattern and handles potential errors by propagating them upward with the ? operator, which ensures the server won't start if metrics initialization fails.

src/handlers/http/modal/server.rs (2)

33-33: LGTM: Added import for system metrics scheduler

Added import for the system metrics scheduling functionality that will be used in the init method.

138-139: Added system metrics collection to the main server

This adds the ability to collect and monitor system metrics (CPU, memory, disk usage) on the server node, which aligns with the PR's objective of enhancing monitoring capabilities. The metrics collection is properly initialized during server startup after analytics initialization.

The implementation follows an asynchronous pattern with proper error handling via the ? operator, ensuring the server won't start if metrics initialization fails. It's appropriately placed after the analytics initialization but before spawning the server tasks.

src/handlers/http/ingest.rs (2)

31-31: Updated import to include LogSourceEntry

Updated import for format-related types to match their usage in the code.

127-127: Simplified internal stream processing

Replaced the previous schema-based event processing with a more streamlined approach using flatten_and_push_logs. This simplification eliminates the need for schema retrieval and additional transformation steps when ingesting internal stream data.

This approach is more consistent with how other stream ingestion is handled throughout the codebase and reduces the complexity of the ingest_internal_stream function. Since this is used for internal pmeta streams which include the system metrics data being added in this PR, the simplified approach should make it easier to reliably capture the metrics.

src/handlers/http/modal/query_server.rs (3)

22-22: Fixed typo in cluster metrics scheduler import

Corrected the function name from "schedular" to "scheduler" in the import statement.

28-28: Added import for system metrics scheduler

Added import for the system metrics scheduling functionality that will be used in the init method.

118-120: Improved metrics initialization sequence

Enhanced the query server initialization by adding system metrics collection and improving the cluster metrics initialization flow. The previous conditional approach has been replaced with a more robust sequential initialization pattern.

This implementation:

Initializes system metrics collection (CPU, memory, disk usage) on the querier node

Initializes cluster metrics collection from all ingestors

Uses proper error handling with the ? operator to ensure the server won't start if either metrics initialization fails

This change aligns perfectly with the PR's objective of adding sysinfo metrics from the querier node to both pmeta and cluster metrics API, enhancing the overall monitoring capabilities of the system.

src/handlers/http/cluster/mod.rs (3)

42-42: Imports look good.

Introducing collect_all_metrics here properly ties together the metric collection logic used later in this file.

878-878: Asynchronous scheduler initialization appears sound.

Converting init_cluster_metrics_scheduler into an async function aligns it well with asynchronous tasks. This helps ensure non-blocking operation when scheduling metrics collection.

885-887: Good error logging addition.

Capturing and logging errors from collect_all_metrics() provides better visibility. Ensure sensitive data is not included in any custom error messages returned from collect_all_metrics() to avoid accidental PII leakage.

src/cli.rs (2)

453-466: Panic-based validation for endpoints is acceptable but strict.

The new get_endpoint function panics if the endpoint string includes “http” or if environment variables are malformed. This is a valid fail-fast strategy, but you might consider a more robust error mechanism if there are advanced use cases (e.g., IPv6 or over-sanitized environment variables).

484-505: Environment variable resolution strategy looks suitable.

Resolving the $VARNAME pattern is convenient. The panic path is again a valid approach if environment variables are strictly required at startup. If you plan to allow optional environment variables or partial expansions, you might expand the logic accordingly.

src/metrics/prom_utils.rs (3)

19-21: Imports look good
No issues observed with introducing HashMap and Path here, as they align well with the new disk and memory usage collection logic.

22-45: No immediate concerns
These added imports from crate::about::current, along with system metrics and disk usage utilities, seem necessary and correctly scoped.

254-361: Increase test coverage for metric processing
The build_metrics_from_samples, process_gauge_metric, and related helper functions correctly process Prometheus samples into Metrics. However, there's a significant gap in test coverage. Implement targeted tests confirming each metric maps correctly (especially disk and memory usage) to ensure reliability.

Would you like me to generate a specialized test module for these functions?
src/metrics/mod.rs (4)

21-29: Imports and constants are fine
No issues with the newly introduced imports and constants for scheduling intervals and HTTP integration.

Also applies to: 34-35, 39-39

358-383: Struct definitions are clear
DiskMetrics, SystemMetrics, MemoryMetrics, and CpuMetrics are well-defined. Using simple numeric fields is efficient for direct usage in Prometheus metrics.

436-451: Robust disk and system metrics collection
The functions for collecting disk usage, memory, and CPU metrics are well-structured and straightforward. Current usage of the sysinfo crate is appropriate. Just ensure you handle large disk sizes and high CPU usage gracefully when parsing into i64 or f64.

Please confirm large disk or CPU usage values are tested to avoid overflow or unexpected truncation issues.

Also applies to: 453-480, 482-506, 508-542

385-406:
❓ Verification inconclusive

Potential concurrency consideration
init_system_metrics_scheduler spawns a loop sleeping every 10 seconds but uses an interval of 1 minute. Confirm that the more frequent loop wake-up won’t cause concurrency or performance overhead.

Also applies to: 408-417, 419-434

🏁 Script executed:
#!/bin/bash
# We will check if there are any references to concurrency control or repeated scheduling calls
rg "init_system_metrics_scheduler" -A 10
Length of output: 5324
Concurrency Consideration – Verify Scheduler Wake-Up Frequency
In src/metrics/mod.rs, the init_system_metrics_scheduler function spawns an async loop that calls scheduler.run_pending() every 10 seconds, while the metrics collection is scheduled using SYSTEM_METRICS_INTERVAL_SECONDS (typically a one-minute interval). Please confirm that this more frequent wake-up does not introduce any concurrency-related overhead or performance degradation. The same pattern appears in the code blocks at lines 408–417 and 419–434, so verifying that these intervals remain lightweight under load is advisable.

src/metrics/prom_utils.rs

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/handlers/http/cluster/mod.rs (1)
902-911: Well-implemented async scheduler with proper error handling.

The scheduler function has been properly updated to be asynchronous, and error handling for system metrics collection has been added. However, there's a spelling inconsistency in the log message.
-    info!("Setting up schedular for cluster metrics ingestion");
+    info!("Setting up scheduler for cluster metrics ingestion");

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6f96079 and faf3e66.

📒 Files selected for processing (3)

src/handlers/http/cluster/mod.rs (2 hunks)
src/handlers/http/modal/query_server.rs (2 hunks)
src/handlers/http/modal/server.rs (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

src/handlers/http/modal/server.rs
src/handlers/http/modal/query_server.rs

🧰 Additional context used

🧬 Code Graph Analysis (1)

src/handlers/http/cluster/mod.rs (2)

src/metrics/mod.rs (1)

collect_all_metrics (409-417)

src/metrics/prom_utils.rs (2)

querier_prometheus_metrics (203-252)

new (79-121)

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: coverage
GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Quest Smoke and Load Tests for Distributed deployments

🔇 Additional comments (2)

src/handlers/http/cluster/mod.rs (2)

43-43: Appropriate import for system metrics collection.

The import of collect_all_metrics function is correctly added to support the new system metrics collection functionality.

898-898: Good enhancement adding querier metrics to cluster metrics.

This addition ensures that system metrics from the querier node are included along with ingestor and indexer metrics, improving the overall monitoring capabilities of the system.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/metrics/prom_utils.rs (1)
209-227: expect() will bring the querier down on every I/O failure – propagate errors instead

All three resource‑collection calls (get_system_metrics, get_volume_disk_usage for staging & hot‑tier) unwrap with expect(...). Any transient OS error (e.g. /proc not readable, volume temporarily unmounted) will panic the whole node and break the metrics endpoint.

Please replace expect with graceful handling, e.g.:
- let system_metrics = get_system_metrics().expect("Failed to get system metrics");
+ let system_metrics = match get_system_metrics() {
+     Ok(m) => m,
+     Err(e) => {
+         warn!("failed to get system metrics: {e}");
+         return metrics;            // or keep partial data & return
+     }
+ };
Repeat for the two get_volume_disk_usage calls.

This comment was previously raised on earlier commits and is still applicable.

Also applies to: 239-246

🧹 Nitpick comments (1)

src/metrics/prom_utils.rs (1)
355-361: Hash‑map insert silently discards duplicate CPU samples

process_cpu_usage overwrites any existing key with the latest sample:
metrics.parseable_cpu_usage.insert(cpu_name.to_string(), val);
If multiple samples for the same CPU (e.g. different modes) are scraped in one batch, earlier values are lost.
Consider aggregating (e.g. max/avg) or using a compound key (cpu + mode) to retain all series.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between faf3e66 and 0b2d1a7.

📒 Files selected for processing (9)

src/event/mod.rs (0 hunks)
src/handlers/http/cluster/mod.rs (2 hunks)
src/handlers/http/ingest.rs (2 hunks)
src/handlers/http/modal/ingest_server.rs (2 hunks)
src/handlers/http/modal/query_server.rs (2 hunks)
src/handlers/http/modal/server.rs (2 hunks)
src/metrics/mod.rs (4 hunks)
src/metrics/prom_utils.rs (4 hunks)
src/parseable/streams.rs (2 hunks)

💤 Files with no reviewable changes (1)

src/event/mod.rs

🚧 Files skipped from review as they are similar to previous changes (7)

src/handlers/http/modal/server.rs
src/handlers/http/modal/ingest_server.rs
src/handlers/http/modal/query_server.rs
src/handlers/http/cluster/mod.rs
src/handlers/http/ingest.rs
src/parseable/streams.rs
src/metrics/mod.rs

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: coverage
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Build Kafka x86_64-unknown-linux-gnu

🔇 Additional comments (1)

src/metrics/prom_utils.rs (1)

329-343: Potential precision loss when casting f64 disk gauges to u64

Prometheus exports disk sizes in bytes; large volumes can exceed 2⁵³ (≈9 PiB).
A f64 can’t represent every integer above that threshold exactly, so the cast val as u64 may truncate.

If you control the exporter and values will stay < 9 PiB you’re fine; otherwise, store the raw f64 or round before casting:
-            "parseable_total_disk" => disk_usage.total = val as u64,
+            "parseable_total_disk" => disk_usage.total = val.round() as u64,
or keep f64 in DiskMetrics.

src/metrics/prom_utils.rs

coderabbitai

Actionable comments posted: 0

🔭 Outside diff range comments (1)

src/metrics/prom_utils.rs (1)
318-358: 🛠️ Refactor suggestion

Consider updating existing methods to handle errors gracefully

The get_daily_stats_from_samples method uses multiple .expect() calls which could cause panics if the expected values are not present.

Consider replacing these with more graceful error handling:
-                        if sample.labels.get("stream").expect("stream name is present")
+                        if sample.labels.get("stream").unwrap_or(&"".to_string())
                            == stream_name
-                            && sample.labels.get("date").expect("date is present") == date
+                            && sample.labels.get("date").unwrap_or(&"".to_string()) == date
Apply this pattern to all similar occurrences in the method.

♻️ Duplicate comments (1)

src/metrics/prom_utils.rs (1)

169-193: Consider more robust error handling

The method relies on .expect(...) in the called from_about_api_response method for retrieving metadata. While the error is mapped to a PostError, a failure could still impact server stability.

Consider implementing fallback or graceful degradation when metadata cannot be retrieved, rather than propagating the error upward. This would ensure metrics collection continues even when certain components fail.

🧹 Nitpick comments (2)

src/metrics/prom_utils.rs (2)

284-298: Check for potential precision loss in disk usage metrics

The code casts floating-point values (f64) to unsigned integers (u64) without handling potential precision loss or negative values.

Consider adding validation or using a safer conversion method:

-            "parseable_total_disk" => disk_usage.total = val as u64,
-            "parseable_used_disk" => disk_usage.used = val as u64,
-            "parseable_available_disk" => disk_usage.available = val as u64,
+            "parseable_total_disk" => disk_usage.total = val.max(0.0) as u64,
+            "parseable_used_disk" => disk_usage.used = val.max(0.0) as u64,
+            "parseable_available_disk" => disk_usage.available = val.max(0.0) as u64,

This ensures negative values are handled gracefully, preventing unexpected behavior.

300-308: Apply same safe conversion to memory metrics

Similar to disk metrics, the memory usage metrics also cast floating-point values to unsigned integers without validation.

Apply the same safe conversion approach:

-            "total_memory" => metrics.parseable_memory_usage.total = val as u64,
-            "used_memory" => metrics.parseable_memory_usage.used = val as u64,
-            "total_swap" => metrics.parseable_memory_usage.total_swap = val as u64,
-            "used_swap" => metrics.parseable_memory_usage.used_swap = val as u64,
+            "total_memory" => metrics.parseable_memory_usage.total = val.max(0.0) as u64,
+            "used_memory" => metrics.parseable_memory_usage.used = val.max(0.0) as u64,
+            "total_swap" => metrics.parseable_memory_usage.total_swap = val.max(0.0) as u64,
+            "used_swap" => metrics.parseable_memory_usage.used_swap = val.max(0.0) as u64,

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0b2d1a7 and e3efb1b.

📒 Files selected for processing (3)

src/handlers/http/cluster/mod.rs (3 hunks)
src/handlers/http/modal/ingest_server.rs (2 hunks)
src/metrics/prom_utils.rs (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

src/handlers/http/modal/ingest_server.rs
src/handlers/http/cluster/mod.rs

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: coverage
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Build Kafka x86_64-unknown-linux-gnu

🔇 Additional comments (6)

src/metrics/prom_utils.rs (6)

57-61: Good addition of system resource metrics

The new fields for disk, memory, and CPU usage metrics align well with the PR objectives to enhance monitoring capabilities. This structured approach will make it easier to track and visualize system resource utilization across different volumes.

90-111: Proper initialization of system resource metrics

The initialization of the new system metrics fields with sensible default values is well-implemented. This ensures these metrics start in a clean state before being populated with actual values.

116-167: Well-designed metric type categorization

The new MetricType enum provides a clean approach to categorize different types of metrics. The from_metric implementation effectively maps metric names and their labels to the appropriate enum variant, creating a maintainable foundation for metric processing.

230-231: Fixed lifetime & deleted storage size metrics

The implementation now correctly passes the metric name to process_storage_size, addressing the previous issue where lifetime and deleted storage metrics were not being populated.

262-282: Properly implemented storage metrics processing

The refactored process_storage_size method now correctly targets different storage structures based on the metric name, ensuring all metrics are properly updated.

310-316: Well-structured CPU usage metrics collection

The implementation of CPU usage metrics collection is clean and effective, storing metrics in a HashMap keyed by CPU identifier.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

src/metrics/mod.rs (1)

398-403: Background task runs forever with no graceful shutdown hook

The spawned loop never checks for a cancellation signal. In unit tests or during controlled shutdown the task will live until the process exits, potentially delaying the shutdown sequence.

Consider wiring in a shutdown channel / tokio::select! so the scheduler exits cleanly:
let shutdown = shutdown_signal.clone();
tokio::spawn(async move {
    loop {
        tokio::select! {
            _ = scheduler.run_pending() => {},
            _ = tokio::time::sleep(Duration::from_secs(10)) => {},
            _ = shutdown.recv() => break,   // <- new graceful‑exit branch
        }
    }
});

src/metrics/prom_utils.rs (2)

90-111: Leverage Default to avoid verbose zero‑initialisation

Manual field‑by‑field zeroing is error‑prone and obscures intent. All of these structs already derive Default, so you can simplify:

-            parseable_data_disk_usage: DiskMetrics {
-                total: 0,
-                used: 0,
-                available: 0,
-            },
-            parseable_staging_disk_usage: DiskMetrics {
-                total: 0,
-                used: 0,
-                available: 0,
-            },
-            parseable_hot_tier_disk_usage: DiskMetrics {
-                total: 0,
-                used: 0,
-                available: 0,
-            },
-            parseable_memory_usage: MemoryMetrics {
-                total: 0,
-                used: 0,
-                total_swap: 0,
-                used_swap: 0,
-            },
+            parseable_data_disk_usage: DiskMetrics::default(),
+            parseable_staging_disk_usage: DiskMetrics::default(),
+            parseable_hot_tier_disk_usage: DiskMetrics::default(),
+            parseable_memory_usage: MemoryMetrics::default(),

284-297: Loss of precision when casting floating‑point bytes to u64

Prometheus samples arrive as f64; directly casting with as u64 truncates toward zero.
For byte counts this usually means dropping the fractional part, but in rare cases a negative gauge would wrap to an enormous positive integer.

Safer conversion:

-            "parseable_total_disk" => disk_usage.total = val as u64,
-            "parseable_used_disk" => disk_usage.used = val as u64,
-            "parseable_available_disk" => disk_usage.available = val as u64,
+            "parseable_total_disk" => disk_usage.total = val.round() as u64,
+            "parseable_used_disk" => disk_usage.used = val.round() as u64,
+            "parseable_available_disk" => disk_usage.available = val.round() as u64,

If negative values are impossible you might also add debug_assert!(val >= 0.0) to surface logic errors early.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e3efb1b and a88cddc.

📒 Files selected for processing (9)

src/event/mod.rs (0 hunks)
src/handlers/http/cluster/mod.rs (3 hunks)
src/handlers/http/ingest.rs (2 hunks)
src/handlers/http/modal/ingest_server.rs (2 hunks)
src/handlers/http/modal/query_server.rs (2 hunks)
src/handlers/http/modal/server.rs (2 hunks)
src/metrics/mod.rs (4 hunks)
src/metrics/prom_utils.rs (4 hunks)
src/parseable/streams.rs (2 hunks)

💤 Files with no reviewable changes (1)

src/event/mod.rs

✅ Files skipped from review due to trivial changes (1)

src/handlers/http/modal/ingest_server.rs

🚧 Files skipped from review as they are similar to previous changes (5)

src/handlers/http/ingest.rs
src/handlers/http/modal/server.rs
src/handlers/http/cluster/mod.rs
src/handlers/http/modal/query_server.rs
src/parseable/streams.rs

🧰 Additional context used

🧬 Code Graph Analysis (1)

src/metrics/mod.rs (2)

src/handlers/http/mod.rs (1)

metrics_path (64-66)

src/parseable/mod.rs (1)

hot_tier_dir (230-232)

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: coverage

coderabbitai · 2025-04-22T05:18:42Z

src/metrics/mod.rs

+const SYSTEM_METRICS_INTERVAL_SECONDS: Interval = clokwerk::Interval::Minutes(1);



⚠️ Potential issue

Interval::Minutes(1) cannot be used in a const context – compilation will fail

clokwerk::Interval::Minutes is not a const fn, therefore the compiler rejects the assignment to a const.
Either inline the interval where it is used or switch to a plain Duration constant:

-const SYSTEM_METRICS_INTERVAL_SECONDS: Interval = clokwerk::Interval::Minutes(1); +// capture metrics every minute +// (cannot be const ‑ we construct the value at runtime) +

and later:

- .every(SYSTEM_METRICS_INTERVAL_SECONDS) + .every(Interval::Minutes(1))

Failing to fix this will stop the crate from compiling.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const SYSTEM_METRICS_INTERVAL_SECONDS: Interval = clokwerk::Interval::Minutes(1);

--- a/src/metrics/mod.rs

+++ b/src/metrics/mod.rs

@@ -38,7 +38,9 @@

-const SYSTEM_METRICS_INTERVAL_SECONDS: Interval = clokwerk::Interval::Minutes(1);

+// capture metrics every minute

+// (cannot be const — we construct the value at runtime)

+

/// Sets up the scheduler to collect system metrics.

pub fn setup_metrics_scheduler() -> Scheduler {

let mut scheduler = Scheduler::new();

@@ - eighty,7 +90,7 @@

- .every(SYSTEM_METRICS_INTERVAL_SECONDS)

+ .every(Interval::Minutes(1))

.run(|| {

collect_system_metrics();

});

coderabbitai · 2025-04-22T05:18:42Z

src/metrics/mod.rs

+        let mount_point = disk.mount_point().to_str().unwrap();
+
+        if path.starts_with(mount_point) {
+            return Ok(DiskMetrics {
+                total: disk.total_space(),


🛠️ Refactor suggestion

unwrap() on mount‑point path may panic for non‑UTF‑8 filesystems

disk.mount_point().to_str() can return None, e.g. on exotic mount‑points with invalid UTF‑8.
A single such disk would crash metric collection and poison the scheduler.

-let mount_point = disk.mount_point().to_str().unwrap(); +let Some(mount_point) = disk.mount_point().to_str() else { continue };

This keeps the loop robust while still collecting the remaining disks.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

src/metrics/mod.rs (2)
39-39: ⚠️ Potential issue

Interval::Minutes(1) cannot be used in a const context - compilation will fail

The clokwerk::Interval::Minutes function is not a const fn, so it cannot be used in a const context. This will cause a compilation error.

Change this to either:
-const SYSTEM_METRICS_INTERVAL_SECONDS: Interval = clokwerk::Interval::Minutes(1);
+// Capture metrics every minute (cannot be const - constructed at runtime)
And then use Interval::Minutes(1) directly where needed (line 391).

465-465: ⚠️ Potential issue

unwrap() on mount point path may panic for non-UTF-8 filesystems

Using unwrap() on the result of to_str() can cause a panic if a disk has a mount point with invalid UTF-8 characters.
-        let mount_point = disk.mount_point().to_str().unwrap();
+        let Some(mount_point) = disk.mount_point().to_str() else { continue };

🧹 Nitpick comments (6)

src/metrics/mod.rs (2)

385-406: Properly implemented async scheduler for metrics collection

The scheduler implementation with error handling and Tokio task is good. However, consider adding some telemetry to track scheduler performance.

Consider adding logging when the scheduler actually runs:
        .run(move || async {
+            info!("Running scheduled system metrics collection");
            if let Err(err) = collect_all_metrics().await {
                error!("Error in capturing system metrics: {:#}", err);
            }
+            info!("System metrics collection completed");
        });
508-542: Consider handling potential sysinfo refresh failures

The sysinfo refresh operation is not checked for errors, which could lead to inaccurate metrics if the refresh fails silently.

While sysinfo doesn't return errors from the refresh operation, it's good practice to log when refreshing system information:
    let mut sys = System::new_all();
+    info!("Refreshing system information for metrics collection");
    sys.refresh_all();

src/metrics/prom_utils.rs (4)

169-193: Refactored to async metric processing with proper error handling

The function now properly handles asynchronous metric collection and error cases, but relies on unwrap within from_about_api_response.

In the additional metadata retrieval section, consider making the error handling more robust in case of metadata retrieval failure:

-        let (commit_id, staging) =
-            Self::from_about_api_response(metadata)
-                .await
-                .map_err(|err| {
-                    error!("Fatal: failed to get ingestor info: {:?}", err);
-                    PostError::Invalid(err.into())
-                })?;
+        let (commit_id, staging) = match Self::from_about_api_response(metadata).await {
+            Ok((commit, staging)) => (commit, staging),
+            Err(err) => {
+                error!("Failed to get ingestor info: {:?}", err);
+                // Use default values but continue
+                (String::new(), String::new())
+            }
+        };

This makes the metadata retrieval non-fatal to the overall metrics collection.

284-298: Ensure numeric type safety in disk usage processing

Converting f64 to u64 without bounds checking could cause information loss for very large values.

Consider adding a helper method for safe numeric conversion:

+    /// Safely converts f64 to u64, handling potential overflow
+    fn safe_f64_to_u64(val: f64) -> u64 {
+        if val < 0.0 {
+            0
+        } else if val > u64::MAX as f64 {
+            u64::MAX
+        } else {
+            val as u64
+        }
+    }

    fn process_disk_usage(metrics: &mut Metrics, volume_type: &str, val: f64, metric_name: &str) {
        // ...
        match metric_name {
-            "parseable_total_disk" => disk_usage.total = val as u64,
-            "parseable_used_disk" => disk_usage.used = val as u64,
-            "parseable_available_disk" => disk_usage.available = val as u64,
+            "parseable_total_disk" => disk_usage.total = Self::safe_f64_to_u64(val),
+            "parseable_used_disk" => disk_usage.used = Self::safe_f64_to_u64(val),
+            "parseable_available_disk" => disk_usage.available = Self::safe_f64_to_u64(val),
            _ => {}
        }

Apply this pattern to all f64 to u64 conversions in the file.

310-316: Robust CPU metrics handling with HashMap

The CPU usage processing with HashMap is a good approach to handle variable number of CPUs. However, consider adding a null check for the CPU usage value.

Consider adding a guard against potential NaN or negative CPU usage values:

    fn process_cpu_usage(metrics: &mut Metrics, val: f64, sample: PromSample) {
        if let Some(cpu_name) = sample.labels.get("cpu_usage") {
+            // Ensure CPU usage is a valid non-negative number
+            if val.is_nan() || val.is_infinite() || val < 0.0 {
+                return;
+            }
            metrics
                .parseable_cpu_usage
                .insert(cpu_name.to_string(), val);
        }
    }

319-358: Consider updating metrics code sample processing with the new modular approach

The get_daily_stats_from_samples method still uses the old direct matching approach rather than the new modular pattern established for other metrics.

This function still uses direct matching of metric strings and unwraps with expect(). Consider refactoring it to match the new modular design pattern used elsewhere in the file:

    pub fn get_daily_stats_from_samples(
        samples: Vec<PromSample>,
        stream_name: &str,
        date: &str,
    ) -> (u64, u64, u64) {
        let mut events_ingested: u64 = 0;
        let mut ingestion_size: u64 = 0;
        let mut storage_size: u64 = 0;
        for sample in samples {
            if let PromValue::Gauge(val) = sample.value {
-                match sample.metric.as_str() {
-                    "parseable_events_ingested_date" => {
-                        if sample.labels.get("stream").expect("stream name is present")
-                            == stream_name
-                            && sample.labels.get("date").expect("date is present") == date
-                        {
-                            events_ingested = val as u64;
-                        }
-                    }
+                let stream_label = sample.labels.get("stream");
+                let date_label = sample.labels.get("date");
+                
+                // Only process if we have the expected labels
+                if let (Some(stream), Some(date_val)) = (stream_label, date_label) {
+                    if stream != stream_name || date_val != date {
+                        continue;
+                    }
+                    
+                    match sample.metric.as_str() {
+                        "parseable_events_ingested_date" => {
+                            events_ingested = Self::safe_f64_to_u64(val);
+                        }

This approach avoids the expect calls which could panic if labels are missing.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a88cddc and 6d67039.

📒 Files selected for processing (9)

src/event/mod.rs (0 hunks)
src/handlers/http/cluster/mod.rs (2 hunks)
src/handlers/http/ingest.rs (2 hunks)
src/handlers/http/modal/ingest_server.rs (2 hunks)
src/handlers/http/modal/query_server.rs (2 hunks)
src/handlers/http/modal/server.rs (2 hunks)
src/metrics/mod.rs (4 hunks)
src/metrics/prom_utils.rs (4 hunks)
src/parseable/streams.rs (2 hunks)

💤 Files with no reviewable changes (1)

src/event/mod.rs

✅ Files skipped from review due to trivial changes (1)

src/handlers/http/modal/ingest_server.rs

🚧 Files skipped from review as they are similar to previous changes (5)

src/handlers/http/modal/server.rs
src/handlers/http/cluster/mod.rs
src/handlers/http/modal/query_server.rs
src/handlers/http/ingest.rs
src/parseable/streams.rs

🧰 Additional context used

🧬 Code Graph Analysis (1)

src/metrics/mod.rs (2)

src/handlers/http/mod.rs (1)

metrics_path (64-66)

src/parseable/mod.rs (1)

hot_tier_dir (230-232)

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: coverage
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Distributed deployments

🔇 Additional comments (5)

src/metrics/mod.rs (3)

195-229: New Prometheus metrics are well-structured and aligned with system monitoring goals

The system metrics (disk, memory, CPU) are properly defined with appropriate labels and namespaces, following the existing pattern in the codebase.

358-383: Well-designed data structures for system metrics

These data structures are well-organized, clearly represent the respective system resources, and have good field naming.

419-434: Disk metrics collection handles multiple storage configurations well

The implementation correctly handles different storage configurations, including checking for local storage mode and optional hot tier directory.

src/metrics/prom_utils.rs (2)

57-61: Good implementation of system metrics fields in the Metrics struct

The addition of disk, memory, and CPU usage fields to the Metrics struct properly extends the existing structure to include the new system metrics.

116-167: Well-designed enum for categorizing metrics

The MetricType enum provides a clean abstraction for different metric types, making the code more maintainable and extensible.

collect CPU usage, memory usage of the server collect disk usage of the volume - data, staging, hot-tier add these metrics to Prometheus Metrics export these metrics to cluster metrics API add the metrics to pmeta stream add the querier node's sysinfo metrics to pmeta and cluster metrics API

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

src/metrics/mod.rs (2)
38-39: ⚠️ Potential issue

Fix invalid constant initialization

clokwerk::Interval::Minutes(1) cannot be used in a const context as it's not a const fn. This will cause compilation failure.

Modify the code to either inline the interval or use a static:
-const SYSTEM_METRICS_INTERVAL_SECONDS: Interval = clokwerk::Interval::Minutes(1);
+// capture metrics every minute
+// (initialized at runtime)
And then later:
-        .every(SYSTEM_METRICS_INTERVAL_SECONDS)
+        .every(Interval::Minutes(1))
464-466: ⚠️ Potential issue

Fix potential panic on non-UTF-8 mount points

The unwrap() on mount point path may panic for exotic mount points with invalid UTF-8 characters.
-let mount_point = disk.mount_point().to_str().unwrap();
+let Some(mount_point) = disk.mount_point().to_str() else { continue };
This keeps the loop robust while still collecting metrics for the remaining disks.

🧹 Nitpick comments (1)

src/metrics/mod.rs (1)
510-512: Consider optimizing system info refresh

The code refreshes all system components with refresh_all(), which might be inefficient when you only need CPU and memory information.
-let mut sys = System::new_all();
-sys.refresh_all();
+let mut sys = System::new();
+sys.refresh_memory();
+sys.refresh_cpu();
This would refresh only the components needed for the metrics collection.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ff0b7c4 and 69efa26.

📒 Files selected for processing (9)

src/event/mod.rs (0 hunks)
src/handlers/http/cluster/mod.rs (2 hunks)
src/handlers/http/ingest.rs (2 hunks)
src/handlers/http/modal/ingest_server.rs (2 hunks)
src/handlers/http/modal/query_server.rs (2 hunks)
src/handlers/http/modal/server.rs (2 hunks)
src/metrics/mod.rs (4 hunks)
src/metrics/prom_utils.rs (4 hunks)
src/parseable/streams.rs (2 hunks)

💤 Files with no reviewable changes (1)

src/event/mod.rs

🚧 Files skipped from review as they are similar to previous changes (6)

src/handlers/http/modal/ingest_server.rs
src/handlers/http/modal/query_server.rs
src/handlers/http/modal/server.rs
src/handlers/http/cluster/mod.rs
src/handlers/http/ingest.rs
src/parseable/streams.rs

🧰 Additional context used

🧬 Code Graph Analysis (1)

src/metrics/mod.rs (2)

src/handlers/http/mod.rs (1)

metrics_path (64-66)

src/parseable/mod.rs (1)

hot_tier_dir (230-232)

⏰ Context from checks skipped due to timeout of 90000ms (9)

GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: coverage
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Build Kafka x86_64-unknown-linux-gnu

🔇 Additional comments (15)

src/metrics/mod.rs (7)

195-229: LGTM! Well-defined Prometheus metrics for system resources

The new metrics for disk usage, memory usage, and CPU usage are well-defined with appropriate labels and namespaces. This will provide valuable system monitoring data.

280-294: LGTM! Proper registration of new metrics

All the newly defined metrics are correctly registered with the Prometheus registry.

358-383: LGTM! Well-structured metric data models

The data structures for disk, memory, and CPU metrics are well-designed with appropriate fields, and they derive the necessary traits for serialization and debugging.

385-406: LGTM! Robust async scheduler implementation

The system metrics scheduler implementation is well-designed with:

Good error handling that logs errors but doesn't propagate them

Non-blocking execution in a separate Tokio task

Appropriate sleep interval between scheduler runs

419-434: LGTM! Comprehensive disk metrics collection

The disk metrics collection logic is well-implemented, correctly gathering metrics for:

Staging directory

Data directory (for local storage mode)

Hot tier directory (when configured)

483-506: LGTM! Well-structured system metrics collection

The system metrics collection logic properly gathers and reports memory and CPU usage metrics with appropriate labels.

408-417: LGTM! Clear top-level metrics collection function

The collect_all_metrics function provides a clean, high-level interface for collecting all system metrics, with proper error handling.

src/metrics/prom_utils.rs (8)

57-62: LGTM! Well-structured additions to Metrics struct

The new fields for disk, memory, and CPU usage metrics are well-designed and align with the system metrics collection implemented in src/metrics/mod.rs.

90-112: LGTM! Proper initialization of new metric fields

All the new fields are properly initialized with zero values, ensuring that the Metrics struct is always in a valid state.

116-167: LGTM! Well-designed metric type classification

The MetricType enum provides a clean way to categorize different metrics, making the code more maintainable and extensible. The from_metric method is comprehensive and handles all metric types correctly.

169-193: LGTM! Improved async metrics processing

Converting from_prometheus_samples to an async function improves the code's overall architecture by allowing non-blocking fetching of additional metadata.

195-216: LGTM! Modular sample processing

Breaking down the sample processing logic into a separate function improves code organization and maintainability.

218-240: LGTM! Clean metric type dispatching

The process_gauge_metric function effectively dispatches processing to specialized handlers based on the metric type, making the code more maintainable and easier to extend.

284-316: LGTM! Well-implemented disk, memory, and CPU metric processing

The specialized handlers for disk, memory, and CPU metrics are well-implemented and correctly update the corresponding fields in the Metrics struct.

262-282: LGTM! Fixed storage size processing

The implementation correctly addresses the previous review comment about lifetime and deleted storage sizes by passing the metric name to process_storage_size and selecting the appropriate target based on the metric name.

nikhilsinhaparseable force-pushed the add-sysinfo-metrics branch from e67c355 to 8910431 Compare January 29, 2025 12:28

nikhilsinhaparseable requested review from de-sh and parmesant January 29, 2025 15:05

nikhilsinhaparseable force-pushed the add-sysinfo-metrics branch from 8910431 to 6f96079 Compare April 1, 2025 12:32

coderabbitai bot requested changes Apr 1, 2025

View reviewed changes

src/metrics/prom_utils.rs Outdated Show resolved Hide resolved

coderabbitai bot reviewed Apr 12, 2025

View reviewed changes

nikhilsinhaparseable force-pushed the add-sysinfo-metrics branch from faf3e66 to 0b2d1a7 Compare April 20, 2025 13:55

coderabbitai bot requested changes Apr 20, 2025

View reviewed changes

src/metrics/prom_utils.rs Show resolved Hide resolved

coderabbitai bot reviewed Apr 20, 2025

View reviewed changes

coderabbitai bot previously approved these changes Apr 20, 2025

View reviewed changes

nikhilsinhaparseable dismissed coderabbitai[bot]’s stale review via dde2e49 April 22, 2025 05:00

nikhilsinhaparseable force-pushed the add-sysinfo-metrics branch from e3efb1b to dde2e49 Compare April 22, 2025 05:00

coderabbitai bot requested changes Apr 22, 2025

View reviewed changes

nikhilsinhaparseable force-pushed the add-sysinfo-metrics branch from a88cddc to 6d67039 Compare April 22, 2025 10:22

coderabbitai bot reviewed Apr 22, 2025

View reviewed changes

nikhilsinhaparseable added 3 commits May 21, 2025 23:05

deepsource fix

23d9d1d

merge from main

69efa26

nikhilsinhaparseable force-pushed the add-sysinfo-metrics branch from ff0b7c4 to 69efa26 Compare May 22, 2025 03:13

coderabbitai bot reviewed May 22, 2025

View reviewed changes

		const SYSTEM_METRICS_INTERVAL_SECONDS: Interval = clokwerk::Interval::Minutes(1);

-const SYSTEM_METRICS_INTERVAL_SECONDS: Interval = clokwerk::Interval::Minutes(1);
+--- a/src/metrics/mod.rs
++++ b/src/metrics/mod.rs
+@@ -38,7 +38,9 @@
+-const SYSTEM_METRICS_INTERVAL_SECONDS: Interval = clokwerk::Interval::Minutes(1);
++// capture metrics every minute
++// (cannot be const — we construct the value at runtime)
++
+ /// Sets up the scheduler to collect system metrics.
+ pub fn setup_metrics_scheduler() -> Scheduler {
+     let mut scheduler = Scheduler::new();
+@@ - eighty,7 +90,7 @@
+-        .every(SYSTEM_METRICS_INTERVAL_SECONDS)
++        .every(Interval::Minutes(1))
+         .run(|| {
+             collect_system_metrics();
+         });

Uh oh!

Uh oh!

feat: add sysinfo metrics #1139

Are you sure you want to change the base?

feat: add sysinfo metrics #1139

Uh oh!

Conversation

nikhilsinhaparseable commented Jan 29, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coveralls commented Jan 29, 2025

Pull Request Test Coverage Report for Build 13025834726

Details

💛 - Coveralls

Uh oh!

coderabbitai bot commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Suggested labels

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nikhilsinhaparseable commented Jan 29, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 1, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)