Skip to content

fix(config): prohibit multiline interpolation [WIP] #22995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

pront
Copy link
Member

@pront pront commented May 5, 2025

Summary

The envinronment variable (and the secrets) resolution now happens after the config string is parsed into a TOML table. The interpolation now happens on the table keys and values. As a side effect, this fixes a bug where comments referring to env vars (or secrets) that don't exist caused a config build error.

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Injecting whole blocks now results in error e.g.

export SOURCES_BLOCK="sources:\"
  demo:
    type: demo_logs
    format: json
    interval: 1
${SOURCES_BLOCK}

The config above will fail to load.

How did you test this PR?

Vector config:

secret:
  backend_1:
    type: file
    path: /Users/pavlos.rontidis/CLionProjects/vector/pront/data/top_secret.json

# ${IN_COMMENT_BUT_DOES_NOT_EXIST}

sources:
  demo_logs_1:
    type: demo_logs
    format: json
    interval: 60
    # noop: "SECRET[i_dont_exist]"

transforms:
  t0:
    inputs:
      - demo_logs_1
    type: "remap"
    source: |
      .host = "${HOSTNAME}"
      .environment = "${ENV:?emv must be supplied}"
      .tenant = "${TENANT:-undefined}"
      .day = "SECRET[backend_1.day]"


sinks:
  s0:
    type: "SECRET[backend_1.type]"
    inputs: [ "t0" ]
    encoding:
      codec: json
      json:
        pretty: true

Output:

{
  "day": "monday",
  "environment": "${ENV:?emv must be supplied}",
  "host": "${HOSTNAME}",
  "message": "{\"host\":...",
  "service": "vector",
  "source_type": "demo_logs",
  "tenant": "${TENANT:-undefined}",
  "timestamp": "2025-05-05T20:52:42.024064Z"
}

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the "no-changelog" label to this PR.

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • The CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • cargo fmt --all
      • cargo clippy --workspace --all-targets -- -D warnings
      • cargo nextest run --workspace (alternatively, you can run cargo test --all)
      • ./scripts/check_changelog_fragments.sh
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run cargo vdev build licenses to regenerate the license inventory and commit the changes (if any). More details here.

References

@datadog-vectordotdev
Copy link

datadog-vectordotdev bot commented May 5, 2025

Datadog Report

Branch report: pront/intepolation-after-parsing
Commit report: 7edd5ec
Test service: vector

❌ 2 Failed (0 Known Flaky), 2398 Passed, 0 Skipped, 1m 32s Total Time

❌ Failed Tests (2)

  • config::loading::tests::load_directory_globals - vector - Details

    Expand for error
     thread 'config::loading::tests::load_directory_globals' panicked at src/config/loading/mod.rs:355:43
     
     thread 'config::loading::tests::load_directory_globals' panicked at src/config/loading/mod.rs:355:43:
     called \`Result::unwrap()\` on an \`Err\` value: ["Expected array at 'proxy.no_proxy', found 'string'"]
     stack backtrace:
        0:     0x5630ce25bbaa - std::backtrace_rs::backtrace::libunwind::trace::h2740d05102fd9881
                                    at /rustc/4eb161250e340c8f48f66e2b929ef4a5bed7c181/library/std/src/../../backtrace/src/backtrace/libunwind.rs:116:5
        1:     0x5630ce25bbaa - std::backtrace_rs::backtrace::trace_unsynchronized::h93ae2edf130065c8
                                    at /rustc/4eb161250e340c8f48f66e2b929ef4a5bed7c181/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
        2:     0x5630ce25bbaa - std::sys::backtrace::_print_fmt::h7660a544e6110dd9
     ...
    
  • config::loading::tests::load_directory_globals_duplicates - vector - Details

    Expand for error
     thread 'config::loading::tests::load_directory_globals_duplicates' panicked at src/config/loading/mod.rs:365:43
     
     thread 'config::loading::tests::load_directory_globals_duplicates' panicked at src/config/loading/mod.rs:365:43:
     called \`Result::unwrap()\` on an \`Err\` value: ["Expected array at 'proxy.no_proxy', found 'string'", "Expected array at 'proxy.no_proxy', found 'string'"]
     stack backtrace:
        0:     0x55adc8203baa - std::backtrace_rs::backtrace::libunwind::trace::h2740d05102fd9881
                                    at /rustc/4eb161250e340c8f48f66e2b929ef4a5bed7c181/library/std/src/../../backtrace/src/backtrace/libunwind.rs:116:5
        1:     0x55adc8203baa - std::backtrace_rs::backtrace::trace_unsynchronized::h93ae2edf130065c8
                                    at /rustc/4eb161250e340c8f48f66e2b929ef4a5bed7c181/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
        2:     0x55adc8203baa - std::sys::backtrace::_print_fmt::h7660a544e6110dd9
     ...
    

@pront pront changed the title fix(config): disallow multiline interpolation fix(config): prohibit multiline interpolation May 6, 2025
@@ -0,0 +1,297 @@
use std::{collections::HashMap, sync::LazyLock};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved vars.rs to this location. But kept the same use/export.

deserialize_table(table)
}

pub fn resolve_environment_variables(table: Table) -> Result<Table, Vec<String>> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was extracted from prepare_input

@pront pront marked this pull request as ready for review May 6, 2025 21:19
@pront pront requested a review from a team as a code owner May 6, 2025 21:19
@pront pront requested a review from jszwedko May 6, 2025 21:20
Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I'm glad to see us finally addressing this rough edge.

I think this works for string fields, but it doesn't seem to interpolate non-string fields correctly. For example:

sources:
  source0:
    type: demo_logs
    count: ${COUNT}
    format: json
    decoding:
      codec: json
    framing:
      method: bytes
    interval: 1.0
sinks:
  sink0:
    inputs:
    - source0
    type: console
    encoding:
      codec: json

With COUNT=1 gives:

2025-05-06T21:40:58.830829Z  INFO vector::app: Log level is enabled. level="info"
2025-05-06T21:40:58.831986Z  INFO vector::app: Loading configs. paths=["/tmp/tmp.yaml"]
2025-05-06T21:40:58.838663Z ERROR vector::cli: Configuration error. error=invalid type: string "1", expected usize



in `sources.source0`

@@ -0,0 +1,20 @@
The env var (and secrets) resolution now happens after the config string is parsed into a TOML table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the config always parsed into a TOML table? This explains the weird behavior I've seen before where I get TOML errors for YAML configs. 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, internally we call format::deserialize which supports all formats and then we always convert to a TOML table after this step. I would expect this to translate to a config builder (out of scope for this PR).


pub fn resolve_environment_variables(table: Table) -> Result<Table, Vec<String>> {
let mut vars = std::env::vars().collect::<HashMap<_, _>>();
if !vars.contains_key("HOSTNAME") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL this special case was here.

@@ -88,11 +89,10 @@ impl SecretBackendLoader {
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Currently (on master) we don't support multiple types for secrets.

2025-05-07T14:38:00.776983Z ERROR vector::cli: Configuration error. error=Error while retrieving secret from backend "backend_1": invalid type: integer `1`, expected a string at line 1 column 10.

However, interpolate_secrets supports multiple types. I consider this out of scope for this PR but we could go further and allow e.g. integer secrets.

@pront
Copy link
Member Author

pront commented May 7, 2025

With COUNT=1 gives:

2025-05-06T21:40:58.830829Z  INFO vector::app: Log level is enabled. level="info"
2025-05-06T21:40:58.831986Z  INFO vector::app: Loading configs. paths=["/tmp/tmp.yaml"]
2025-05-06T21:40:58.838663Z ERROR vector::cli: Configuration error. error=invalid type: string "1", expected usize

Fixed and added tests to cover this.

Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks. This fixes my first example. I think the optimistic parsing as integers, though, isn't quite right. It doesn't seem to handle quoted strings where the value is parsable as an integer. For example, this doesn't work:

sources:
  source0:
    type: demo_logs
    count: ${COUNT}
    format: shuffle
    lines: ["${LINE}"]
    framing:
      method: bytes
    interval: 1.0
sinks:
  sink0:
    inputs:
    - source0
    type: console
    encoding:
      codec: json

It gives:

COUNT=1 LINE=1234 ./target/debug/vector -c /tmp/tmp.yaml
2025-05-07T15:31:36.925469Z  INFO vector::app: Log level is enabled. level="info"
2025-05-07T15:31:36.926534Z  INFO vector::app: Loading configs. paths=["/tmp/tmp.yaml"]
2025-05-07T15:31:36.932941Z ERROR vector::cli: Configuration error. error=invalid type: integer `1234`, expected a string

in `sources.source0`

@pront
Copy link
Member Author

pront commented May 7, 2025

Nice, thanks. This fixes my first example. I think the optimistic parsing as integers, though, isn't quite right. It doesn't seem to handle quoted strings where the value is parsable as an integer. For example, this doesn't work:

sources:
  source0:
    type: demo_logs
    count: ${COUNT}
    format: shuffle
    lines: ["${LINE}"]
    framing:
      method: bytes
    interval: 1.0
sinks:
  sink0:
    inputs:
    - source0
    type: console
    encoding:
      codec: json

It gives:

COUNT=1 LINE=1234 ./target/debug/vector -c /tmp/tmp.yaml
2025-05-07T15:31:36.925469Z  INFO vector::app: Log level is enabled. level="info"
2025-05-07T15:31:36.926534Z  INFO vector::app: Loading configs. paths=["/tmp/tmp.yaml"]
2025-05-07T15:31:36.932941Z ERROR vector::cli: Configuration error. error=invalid type: integer `1234`, expected a string

in `sources.source0`

Taking a look.This all goes back to my previous comment that this intermediate TOML table representation is problematic because it doesn't use type information from the config types.

@pront pront changed the title fix(config): prohibit multiline interpolation fix(config): prohibit multiline interpolation [WIP] May 12, 2025
@pront
Copy link
Member Author

pront commented May 13, 2025

So there are some tests failures:

  1. existing e2e regression
  2. this PR broke some tests (no_proxy is set to a string but schema specifies array of strings)

Will fix (1) in a separate PR
Will fix (2) in this PR (after 1 is fixed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Secrets in commented-out lines in vector.toml cause config validation to fail Commented environment variables are still attempted to be interpolated
2 participants