Skip to content

feat(new sink): add Apache Doris sink support #23117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

bingquanzhao
Copy link

Summary

This PR introduces a new Apache Doris sink for Vector, enabling users to send log data directly to Apache Doris databases using the Stream Load API. The implementation includes:

  • Complete Doris sink implementation with Stream Load API integration
  • Comprehensive configuration options (endpoints, authentication, batching, custom headers)
  • Full documentation generation using CUE
  • Health check functionality with proper error handling
  • Support for Doris-specific Stream Load parameters via custom HTTP headers

Apache Doris is a modern MPP analytical database that provides sub-second query response times on large datasets, making it ideal for real-time data warehouses and log analysis scenarios.

Change Type

  • New feature
  • Bug fix
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

How did you test this PR?

Local Testing

  1. Unit Tests: All unit tests pass with cargo test
  2. Configuration Validation: Verified config parsing with vector validate
  3. Documentation Generation: Successfully generated docs with make generate-component-docs
  4. CUE Validation: All CUE files pass format and validation checks
  5. Changelog Validation: Changelog fragment passes validation with ./scripts/check_changelog_fragments.sh

Test Configuration Used

sources:
  demo:
    type: demo_logs
    format: json
    interval: 1

sinks:
  doris:
    type: doris
    inputs: ["demo"]
    
    # Target configuration
    endpoints: 
      - "http://doris-fe1:8030"
      - "http://doris-fe2:8030"
    database: "analytics_db"
    table: "user_events"
    
    # Authentication configuration
    auth:
      strategy: basic
      user: "admin"
      password: "admin123"
    
    # Batch configuration
    batch:
      max_events: 100000        # Maximum events per batch
      timeout_secs: 30          # Batch timeout in seconds
      max_bytes: 1073741824     # Maximum bytes per batch (1GB)
    
    # Custom HTTP headers for Doris Stream Load
    headers:
      format: "json"
      strip_outer_array: "false"
      read_json_by_line: "true"
    
    # Additional configuration
    label_prefix: "vector"
    log_request: true
    log_progress_interval: 10
    buffer_bound: 1

Environment Setup

  • Tested configuration validation against Vector's validation system
  • Verified health check functionality (attempts connection to configured endpoints)
  • All documentation generation and validation checks pass
  • CUE v0.7.0 used for documentation generation

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the "no-changelog" label to this PR.

Notes

Implementation Details

  • Stream Load API: Uses Doris's native Stream Load API for optimal performance and compatibility
  • Authentication: Supports basic authentication with username/password
  • Batching: Configurable batching with event count, byte size, and timeout limits
  • Custom Headers: Support for Doris-specific Stream Load parameters via HTTP headers including:
    • format: Data format specification (json, csv, etc.)
    • read_json_by_line: JSON line-by-line reading mode
    • strip_outer_array: Array handling configuration
    • columns: Column mapping specification
  • Error Handling: Comprehensive error handling with configurable retry logic
  • Health Checks: Validates connectivity and basic authentication
  • Rate Limiting: Built-in rate limiting and adaptive concurrency control

Documentation

  • Added complete CUE documentation for the sink configuration
  • Generated reference documentation automatically using Vector's documentation system
  • Updated service definitions and URL references
  • All documentation validation checks pass (CI=true make check-docs)

Dependencies

  • No new external dependencies added
  • Uses existing Vector HTTP client infrastructure
  • Leverages standard Vector authentication, batching, and request frameworks
  • Follows Vector's established patterns for sink implementation

Code Quality

  • All code formatted with cargo fmt
  • Follows Vector's coding standards and patterns
  • Proper error handling and logging throughout
  • Comprehensive configuration validation

Testing Strategy

  • Configuration validation ensures all options are properly parsed
  • Health check functionality verified through connection attempts
  • Documentation generation confirms all metadata is correctly defined
  • Follows Vector's established testing patterns for sinks

References

@bingquanzhao bingquanzhao requested review from a team as code owners May 28, 2025 16:18
@bits-bot
Copy link

bits-bot commented May 28, 2025

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added domain: sinks Anything related to the Vector's sinks domain: ci Anything related to Vector's CI environment domain: external docs Anything related to Vector's external, public documentation labels May 28, 2025
@drichards-87
Copy link
Contributor

Created Jira card for Docs Team review.

Copy link
Contributor

@maycmlee maycmlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small suggestions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: ci Anything related to Vector's CI environment domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks editorial review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants