Skip to content

Export an array of all tokens from ct_token_map #577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 24, 2025

Conversation

taminomara
Copy link
Contributor

@taminomara taminomara commented May 23, 2025

This helps with writing structured input adapters for fuzzing. When fuzzing a parser specifically (as opposed to fuzzing lexer and parser at the same time), we'd like to supply it with an array of valid lexemes. This export helps us build such an array as we don't have to manually list all tokens in a fuzzing entry point.

Note that I didn't implement this functionality for generated lexers because there's already a way to get all tokens via mod_l::lexerdef().iter_rules().

Example of a fuzzing implementation after this PR:

#[derive(Debug)]
struct Token(u32, String);

impl<'a> Arbitrary<'a> for Token {
    fn arbitrary(u: &mut Unstructured<'a>) -> libfuzzer_sys::arbitrary::Result<Self> {
        Ok(Token(*u.choose(token_map::TOKENS)?, u.arbitrary()?))
    }
}

fuzz_target!(|data: Vec<Token>| {
    let mut text = String::new();
    let lexemes = data.into_iter().map(|tok| {
        let lexeme = DefaultLexeme::new(
            tok.0,
            text.len(),
            tok.1.len(),
        );
        text.push_str(&tok.1);
        lexeme
    }).collect();

    // Run parser...
}

@ltratt
Copy link
Member

ltratt commented May 23, 2025

This is a part of the system I haven't thought about for a while. Is it possible to do the same thing with mod_l::lexerdef().iter_rules().map(|x| x.name()).collect() or similar? [Warning: untried!]

@taminomara
Copy link
Contributor Author

Is it possible to do the same thing with mod_l::lexerdef().iter_rules().map(|x| x.name()).collect() or similar?

Yes, this seems to work.

@ltratt
Copy link
Member

ltratt commented May 23, 2025

OK, then I think we don't need to generate the array?

@taminomara
Copy link
Contributor Author

OK, then I think we don't need to generate the array?

That will only work when user has a generated lexer. If there's a custom lexer with ct_token_map, then there's no way to get a full array of tokens.

@ltratt
Copy link
Member

ltratt commented May 23, 2025

If there's a custom lexer with ct_token_map, then there's no way to get a full array of tokens.

I take your point.

@taminomara taminomara force-pushed the master branch 4 times, most recently from e8356d9 to 49ba5e3 Compare May 24, 2025 14:53
This helps with writing structured input adapters for fuzzing. When fuzzing a parser specifically (as opposed to fuzzing lexer and parser at the same time), we'd like to supply it with an array of valid lexemes. This export helps us build such an array as we don't have to manually list all tokens in a fuzzing entry point.

Note that I didn't implement this functionality for generated lexers because there's already a way to get all tokens via `mod_l::lexerdef().iter_rules()`.
@ratmice
Copy link
Collaborator

ratmice commented May 24, 2025

Took a bit of head scratching until I grokked it (building a token stream directly rather than an intermediate vector!),
But once it clicked, it all seemed fine to me.

Seems fine to me now, unless Laurence has any further comments.

@ltratt
Copy link
Member

ltratt commented May 24, 2025

@ratmice Thanks for the review!

@taminomara Thanks for the PR!

@ltratt ltratt added this pull request to the merge queue May 24, 2025
Merged via the queue into softdevteam:master with commit 7831d2d May 24, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants