Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental parse error from #93 #99

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from
Draft

Conversation

chanced
Copy link
Owner

@chanced chanced commented Nov 20, 2024

@asmello experimental parse error from #93 Sorry this took me so long to get back around to doing.

Its not as ergonomic as I had hoped, I forgot you can't provide defaults to generics of functions. Honestly, it may honestly be more hassle than its worth. I don't know if multi-error reporting is even remotely useful to folks for something as simple as a json pointer.

The miette output of the multi-error is a mess at the moment. I'm guessing its due to the related errors but I dont know how to get a list to show otherwise.

let err: ParseError<Complete> = PointerBuf::parse("hello-world/~3/##~~/~3/~") // or .parse_complete
    .unwrap_err() // ParseError<WithInput>
    .into(); // ParseError<Complete>

println!("{:?}", miette::Report::from(err));
Screenshot 2024-11-20 at 10 48 42 AM

chanced and others added 21 commits October 21, 2024 14:09
* Adds type alias `crate::resolve::ResolveError` for `crate::resolve::Error`
* Renames `crate::assign::AssignError` to `crate::assign::Error`
* Adds type alias `crate::assign::AssignError` for crate::assign::Error`
* Adds `position` (token index) to variants of `assign::Error` & `resolve::Error`
* improves ParseIndexError::InvalidCharacter
Co-authored-by: André Mello <[email protected]>
@asmello
Copy link
Collaborator

asmello commented Nov 20, 2024

I don't think related errors is a bad idea intrinsically — one of the things I wanted to explore in follow up PRs is this. This seems to be the right approach with miette for displaying multiple errors. Most of the noise in the output is due to that bug I flagged on #93, which we can just fix, and maybe we can tune the labels a bit for the multi-errors case, but it doesn't look terrible.

That said, I think we should leave it for a later step, after we've figured out the basic interface for rich errors.

Assuming we do this, my take is we shouldn't store the full offsets in the original error, and instead should compute them on-demand when the rich error is constructed (or even when .related_errors() is called!). It'd look like this from a usage perspective:

let err: Report<ParseError, Cow<'_, str>> = PointerBuf::parse("hello-world/~3/##~~/~3/~")
    .unwrap_err();
let err: FullReport<ParseError, Cow<'_, str>> = err.diagnose_fully();

// needed for conversion to `miette::Report`
let err: FullReport<ParseError, Cow<'static, str>> = err.into_owned();

println!("{:?}", miette::Report::from(err));

The need to convert to miette::Report to get the nice output is sad, I didn't think of that. And I guess that means we always end up needing an owned error, so the nice lifetime flexibility I came up with is moot... 😞 Some related discussion here: zkat/miette#305.

If we let go of the Cow, that doesn't change much, though:

let err: Report<ParseError, String> = PointerBuf::parse("hello-world/~3/##~~/~3/~")
    .unwrap_err();
let err: FullReport<ParseError, String> = err.diagnose_fully();

println!("{:?}", miette::Report::from(err));

The subject (SUB) generic I added to Report is still useful so that we can handle String and PointerBuf subjects without the complexity of the Subject enum you added on #93. I'll make adjustments to #98 later.

I'll give some more thought into the stuff in parse_error.rs, but my gut feeling is this is backwards. One generic wrapper that can extend many concrete types makes a lot of sense, it's applying composability correctly, but I don't immediately see the point of using generics with a marker trait here. That usually goes for making types stateful, so that the internal data representation can be the same but you get a different set of methods available for each concrete marker choice. So it can make sense for a builder, for example, and bon takes advantage of that to guarantee statically that all mandatory fields are set before you can build. Here it seems like we'd be inverting that paradigm, and defining the internal data based on the marker trait, at which point it seems like it's just implementing two distinct ParseError types with extra steps.

I don't have a complete alternative to offer yet, other than doing the straightforward thing and defining ParseError and ParserErrors as two separate types, but I suspect we can do something smarter with the Report. Simply adding a field of Vec<dyn miette::Diagnostic> doesn't work because miette::Diagnostic is not object-safe, and Vec<E> is probably not feasible either (although we might be able to assume E = SRC, that is, all errors are the same type as the originator), but I'll give it some thought.

@chanced
Copy link
Owner Author

chanced commented Nov 21, 2024

n@asmello I'm sorry, github is back to marking messages I haven't seen as read. I'm not sure what that's about.

Most of the noise in the output is due to that bug I flagged on #93, which we can just fix, and maybe we can tune the labels a bit for the multi-errors case, but it doesn't look terrible.

Yep, this was due to me delegating again. I created an enum InvalidEncoding and attached it to EncodingError (renamed from InvalidEncodingError).

Single error display:
Screenshot 2024-11-21 at 11 02 00 AM

multi-error display (slightly different from above as I omitted the leading slash:
Screenshot 2024-11-21 at 11 02 37 AM

There's still a bug that trailing ~ are not displayed. The problem is the span length, i believe.

That said, I think we should leave it for a later step, after we've figured out the basic interface for rich errors.

Heh, I should have kept reading before investigating and writing code.

Assuming we do this, my take is we shouldn't store the full offsets in the original error, and instead should compute them on-demand when the rich error is constructed (or even when .related_errors() is called!). It'd look like this from a usage perspective:

That random function attached to Pointer that didn't seem connected to the previous changeset was for this purpose. I started with computing the offsets from the positions, under the expectation that we would be removing the offset field eventually.

I ended up just using the offsets already on the errors and it made things a good bit cleaner. I'm not opposed to removing them from assign::Error and resolve::Error. The offsets are easy enough to figure out from the position (so long as you have the source, obviously).

When it comes to ParseError, I think it has gotten better now that the errors are being sourced properly. The verbiage of the errors themselves could use improvement and perhaps using complete_offset() instead of giving the offset of the token makes more sense.

That usually goes for making types stateful, so that the internal data representation can be the same but you get a different set of methods available for each concrete marker choice.

There are a different set of methods based upon the associated types.

impl<S> ParseError<S> where S: Structure<Cause = Causes> {
    pub fn causes(&self) -> & [Cause]
}
impl<S> ParseError<S> where S: Structure<Cause = Cause> {
    pub fn cause(&self) -> &Cause
}
impl<S> ParseError<S> where S: Structure<Subject = String> {
    pub fn subject(&self) -> &str
}

cause and subject were S::Cause and S::Subject but they evaluated to the same (well, sort &String but close enough).

@chanced
Copy link
Owner Author

chanced commented Nov 21, 2024

I had really hoped this would be possible:

let Result<_, ParseError<Complete>> = Pointer::parse("/ptr");

where

impl Pointer {
    pub fn parse<S: AsRef<str> + ?Sized, E: Structure>(s: &S) -> Result<&Self, ParseError<E>>
}

but not being able to set defaults on generics of functions put an end to that.

The remaining advantage to using this structure is that the api of the error remains consistent between Pointer::parse and PointerBuf::parse. I haven't put any effort into going from ParseError<WithoutInput> -> ParseError<WithInput> but ParseError<WithInput> -> ParseError<Complete> is implemented (poorly). ParseError<WithInput> -> ParseError<WithoutInput> should also be implemented.

Report or whatever would still need to exist for {assign, resolve}::Error but we can feature flag it. Then we can impl From<ParseError<Subject=String>> for Report and the same machinery to go from {assign, resolve}::Error to Report will be possible for ParseError<String>

@chanced
Copy link
Owner Author

chanced commented Nov 21, 2024

There are 3 shapes that ParseError can be in (i'm not crazy about the names):

  • WithoutInput, returned from Pointer::parse: { subject: Empty, cause: Cause }
  • WithInput, returned from PointerBuf::parse: { subject: String, cause: Cause }
  • Complete, bikeshed on where it is returned: { subject: String, cause: Causes }

where Cause is formerly ParseError and Causes is Box<[Cause]>.

pub enum Cause {
    NoLeadingBackslash,
    InvalidEncoding {
        offset: usize,
        source: EncodingError,
    },
}

Empty exists so that it can impl From<&str> and impl From<String>, which are just thrown away.

@chanced
Copy link
Owner Author

chanced commented Nov 21, 2024

If we had specialization, the API could be identical across variations of ParseError:

impl<S> ParseError<S> where S: Structure<Cause = Causes> {
    pub fn cause(&self) -> &Cause { &self.cause[0] }
}

@chanced
Copy link
Owner Author

chanced commented Nov 21, 2024

Nevermind, I was able to get fn cause(&self) -> &Cause to be universal without it.

pub trait Causative:
    PartialEq + Sized + std::error::Error + fmt::Display + fmt::Debug + miette::Diagnostic
+    fn first(&self) -> &Cause;
impl<S: Structure> ParseError<S> {
    pub fn cause(&self) -> &Cause {
        self.cause.first()
    }
}

@chanced
Copy link
Owner Author

chanced commented Nov 21, 2024

An alternative approach would be something like

struct SansInput { cause: Cause }
struct WithInput { cause: Cause, input: String }
struct Complete { causes: Vec<Cause>, input: String }

enum ParseError {
    SansInput(SansInput),
    WithInput(WithInput),
    Complete(Complete),
}

but methods which operate on the input would need to return Option. Maybe that's okay in exchange for much lower complexity.

@asmello
Copy link
Collaborator

asmello commented Nov 21, 2024

I'm still skeptical about this marker trait system, because in my experience this pattern has some serious ergonomics implications. One you kinda noticed already - lack of specialisation means we often have to implement the same function multiple times, once for each concrete marker value. This has a viral effect, because each function that is implemented concretely can't be called in generic contexts. That is, this doesn't work:

use std::marker::PhantomData;

trait State {}
struct S1;
impl State for S1 {}
struct S2;
impl State for S2 {}
struct Foo<S: State> {
    _phantom: PhantomData<S>
}

impl Foo<S1> {
    fn do_it() { }
}

impl Foo<S2> {
    fn do_it() { }
}

fn frobnicate<S: State>(foo: Foo<S>) {
    // error[E0599]: no method named `do_it` found for struct `Foo<S>`
    foo.do_it(); 
}

Playground

I'm not actually sure how your first method in Causative solved this issue? But even if it did, you still went from implementing cause 3x to implementing first 3x, which doesn't seem like much of an improvement. What if you needed last as well? Doesn't seem to scale well.

I'm also worried that all this work is just to support ParseError, and I'm not sure how much can be reused for other error types that would similarly benefit from an enrichment system.

@asmello
Copy link
Collaborator

asmello commented Nov 21, 2024

To be clear though, you are not crazy, there is some credit to this system. The state machine is a nice model for error enrichment. My concerns are just over implications for ergonomics, as well as complexity.

I'm also scared of running into edge cases with generics and associated types. Fought those hard last Sunday getting .into_owned to work when I had a very similar design to ParseError (where a field's type was an associated type). The solution ended up being to be more loose with the generic bounds on the type definition (introducing a totally decoupled second parameter) and then constraining it just in the places where the association was relevant. That led to many simplifications. I wonder if something similar might be done here.

@asmello
Copy link
Collaborator

asmello commented Nov 21, 2024

The remaining advantage to using this structure is that the api of the error remains consistent between Pointer::parse and PointerBuf::parse.

I wonder if the answer isn't to just have 3 totally separate types, but have them implement a trait where common functionality makes sense? This would give callers a common API to operate with where it matters, while still providing a way to have different methods on each stage of enrichment.

What's telling to me here is that the three flavours of ParseError end up having both distinct data types as well as a unique subset of methods. I really don't see what the generics are buying us here.

@chanced
Copy link
Owner Author

chanced commented Nov 21, 2024

I'm still skeptical about this marker trait system, because in my experience this pattern has some serious ergonomics implications.

Ah, I hadn't really considered too much the implications of infecting generics on people. Hmm.

One you kinda noticed already - lack of specialisation means we often have to implement the same function multiple times, once for each concrete marker value. This has a viral effect, because each function that is implemented concretely can't be called in generic contexts

yea, I had no idea you couldn't shadow fns like that. I discovered that through this - where initially my intention was to have cause return &[Cause] or &Cause.

I'm not actually sure how your first method in Causative solved this issue?

Oh, in this case I wanted cause to return either itself (in the case of Cause) or the first, in the case of Causes. The reasoning is that if it were not Complete, the first Cause would be returned. Adding it to the trait made it so that I didn't need to constrain on Structure::Cause.

What if you needed last as well? Doesn't seem to scale well.

I realize this is a contrived example but we wouldn't need last. Most of the distinctions between these types are whether or not there exists the input string. I think I have most of that covered but and what's remaining of finishing it shouldn't depend too much on the markers. I'm not dismissing the concern. Its valid - just giving context on the current state.

I'm also worried that all this work is just to support ParseError, and I'm not sure how much can be reused for other error types that would similarly benefit from an enrichment system.

This is true. It is confusing to have two separate mechanisms for enrichment when we have so few errors.

To be clear though, you are not crazy, there is some credit to this system. The state machine is a nice model for error enrichment. My concerns are just over implications for ergonomics, as well as complexity.

Understood and thanks. I wasn't sure if you outright hated it or not :).

I've been going around and around trying to find the right combination of complexity, ergonomics, and utility. The dual types, while awesome, definitely spike the complexity around errors.

gotta run, i'll reply to the rest when I get back.

@asmello
Copy link
Collaborator

asmello commented Nov 21, 2024

Understood and thanks. I wasn't sure if you outright hated it or not :).

Oh, I don't hate it. I'm just scared of it! 😄

Ah, I hadn't really considered too much the implications of infecting generics on people. Hmm.

I think this is probably ok for users, because they have to specify which concrete type they want out of .parse, and once you've got a concrete type it's all good. It's only if they decide to be generic themselves over S that they will start to run into trouble, but I think this would be rare. I'm mainly concerned about the cost of internal evolution.

@chanced
Copy link
Owner Author

chanced commented Nov 22, 2024

I wonder if the answer isn't to just have 3 totally separate types, but have them implement a trait where common functionality makes sense? This would give callers a common API to operate with where it matters, while still providing a way to have different methods on each stage of enrichment.

That may work. If we go that route, I'm thinking we eliminate the third option so it'll just be 2, with and without input.

What's telling to me here is that the three flavours of ParseError end up having both distinct data types as well as a unique subset of methods. I really don't see what the generics are buying us here.

Yea, we have 2 or 3 distinct structures depending on whether we keep the list. The generics keep it to one descriptive type. That's about it. Unfortunately the luster of this approach is severely diminished without being able to pick and choose the reporting style from a single set of methods (parse on both Pointer and PointerBuf) by specifying the generic.

Granted, we can still allow for picking of reporting style on parse. Ergonomics would take a serious hit in situations where the error is ambiguous though:

impl PointerBuf {
    pub fn parse<E: Structure>(s: impl Into<String>) -> Result<Self, ParseError<E>> {
        let s = s.into();
        match Validator::validate::<E::Cause>(&s) {
            Ok(()) => Ok(Self(s)),
            Err(cause) => Err(ParseError::<E>::new(cause, s)),
        }
    }
}

fn example () {
    let ptr = PointerBuf::parse("/").unwrap();
}
error[E0283]: type annotations needed
    --> src/pointer.rs:1301:15
     |
1301 |     let ptr = PointerBuf::parse("/").unwrap();
     |               ^^^^^^^^^^^^^^^^^^^^^^ cannot infer type for type parameter `E` declared on the associated function `parse`
     |
     = note: cannot satisfy `_: Structure`
     = help: the following types implement trait `Structure`:
               SansInput
               WithInput
               parse_error::Complete
note: required by a bound in `PointerBuf::parse`

Oh, I don't hate it. I'm just scared of it! 😄

Hah, I understand that. grill has become a swamp of associated types and generics. Its the only other project I have which is infested with associated types to any noticable degree. I've come to love them (associated types) but booting back up on that project takes time.

Having said all of that, I don't think this particular arrangement is horrible, especially if we seal up Causitive (with a better name) but its still mental overhead. Also, there are so few variants that it seems a bit overkill.

@chanced
Copy link
Owner Author

chanced commented Nov 22, 2024

I think I'd favor a distinct type for the ParseError with the input string. Report<ParseError> may be too generic for an explicit action.

edit: Hmm, maybe not. I'm going to go back to your PR and play with it. Perhaps just one enriched generic error is sufficient. It may help a good bit if we eliminate the possibility of multiple errors too.

@asmello
Copy link
Collaborator

asmello commented Nov 22, 2024

I think I'd favor a distinct type for the ParseError with the input string. Report<ParseError> may be too generic for an explicit action.

I'm not opposed to this, but as long as we have two versions for every type (one encoding just the cause and the other also encoding a position), I think the generic scales better. That's about it, really.

My initial jerk reaction to the generic wrapper was because in my experience generic errors tend to be painful to work with. Especially for users, as they often end up doing things we didn't foresee. But what I ended up realising is that the public API can expose only concrete types, so this isn't a problem (Report<ParseError> is a concrete type). That's why I changed my mind about the wrapper (also because the solution I came up with was relatively simple).

edit: Hmm, maybe not. I'm going to go back to your PR and play with it. Perhaps just one enriched generic error is sufficient. It may help a good bit if we eliminate the possibility of multiple errors too.

I'll try to set apart some time this weekend to try and make it work with multiple errors too. But like I said, I wouldn't oppose just sticking with concrete types either.

And I haven't fully ruled out the associated types approach, I'm just dubious of the net benefit. If you think of another motivation for it I may change my stance.

@chanced
Copy link
Owner Author

chanced commented Nov 22, 2024

I'll be awol this weekend but I think I'm leaning toward the wrapper as well. The alternative is going to be basically the equivalent of it anyway.

Basically the two options are:

enum ParseError {}
struct EnrichedParseError { input: String, error: ParseError }

or something like this to get the same API across types:

enum Cause {}
struct ParseError { cause: Cause }
struct EnrichedParseError { cause: Cause, input: String }

Both of which don't make a whole lot of sense if we have a wrapper type.

@asmello
Copy link
Collaborator

asmello commented Nov 23, 2024

Taking a step back to re-evaluate the overall approach with a fresher perspective.

For rich errors (including multi-errors) in the parsing API, I think the right approach would be to introduce separate parsers for each level of error richness. This way we keep the core API simple and avoid doing extra work without giving up the option to produce the enhanced errors when it matters.

impl Pointer {
    pub fn parse(s: &str) -> Result<Self, ParseError> {...}
}

impl PointerBuf {
    pub fn parse(s: impl Into<Cow<'_, str>>) -> Result<Self, ParseError> {...}
}

trait Parser {
    type Error;
    pub fn parse(s: impl Into<Cow<'_, str>>) -> Result<Cow<'_, Pointer>, Self::Error>;
}

let err: ParseError = SimpleParser::parse("foo").unwrap_err();
let err: RichParseError = RichParser::parse("foo").unwrap_err();
let err: FullParseError = FullParser::parse("foo").unwrap_err();

I'm keeping the Cow signature in PointerBuf for the ergonomics. It'd be more explicit to require Pointer::parse("/foo").to_buf() as the way to go from reference to owned, but alas it's annoying to add the .to_buf() which is why we moved to Into<String> in the first place.

For the Assign error, it doesn't really make sense to have multi-errors, as far as I can see. So a helper struct like Assigner makes a bit less sense. That said, it also gives us an opportunity to also allow users to configure how they want to deal with missing nodes along the path. I know you've explored this in the past, but unless my memory fails we haven't tried this as a separate struct yet, right? I'm thinking of HTTP clients here — where you configure your request parameters, then execute the request —, except we configure the assignment strategy and then execute it. The details there are for a separate discussion, but wanted to bring up the side benefit.

For resolve, it makes even less sense to have a helper struct, since there's nothing really to configure, but may still be worth it for the rich error interface. Multi-errors also don't make sense here.

Another option for assign and resolve is to just condition the error types on the miette flag, and have it always be rich when miette integration is enabled. There isn't much of a runtime cost, really, it's just a difference in API. Doing it like this is more annoying for us from a code organisation and testing perspective, but it's the most straightforward for users.

Multi-errors could make sense in Index parsing, but not sure it's worth it. Otherwise it's the same as the resolve errors.

You mentioned being away this weekend, so no pressure to reply. I'll probably come back to all of this next week(end?) after I've given it all some more thought.

@asmello
Copy link
Collaborator

asmello commented Nov 23, 2024

Also worth mentioning, the one issue with the wrapper type, even if we make it work with multi-errors, is the potential of mismatch between the subject and the error. It's even weirder with multi-errors because, to generate them, we'd need to essentially run another full scan and may find entirely different errors if given a different source. This is what made me feel like it's the wrong direction to take. Ideally we don't need to revisit the subject other than for display purposes, once an error has been generated.

I said before the mismatch is a low risk given typically you'd use the wrapper immediately like Pointer::parse(s).diagnose(s), but it smells, and I don't like it. The Parser approach seems safer without compromising ergonomics, and I think it's a bit easier to evolve, though it requires a bit more boilerplate. That said, I may still change my mind if other factors surface.

@chanced
Copy link
Owner Author

chanced commented Nov 26, 2024

For rich errors (including multi-errors) in the parsing API, I think the right approach would be to introduce separate parsers for each level of error richness. This way we keep the core API simple and avoid doing extra work without giving up the option to produce the enhanced errors when it matters.

Ohh! Now that's a clever way to handle it! I really like the approach of having different traits. It solves what this design failed to do. That is, allow for structuring the error by specifying a call without cluttering up the API with numerous methods.

impl PointerBuf {
    pub fn parse(s: impl Into<Cow<'_, str>>) -> Result<Self, ParseError> {...}
}

What if we always allocate for rich and full parser? It seems like if you want an error with the input, you are on an error path and allocating a string is fine, like you were saying earlier.

Digesting the rest.

edit:
Sorry if this sent off two notifications. I sent the previous version before I intended and deleted without thinking.

@chanced
Copy link
Owner Author

chanced commented Nov 26, 2024

For the Assign error, it doesn't really make sense to have multi-errors, as far as I can see. So a helper struct like Assigner makes a bit less sense.

Right, {assign, resolve}::Error really don't need a wrapper error and definitely dont need multiple errors right now. The wrapper class did provide a means to report the specific error and underline the token.

That said, it also gives us an opportunity to also allow users to configure how they want to deal with missing nodes along the path. I know you've explored this in the past, but unless my memory fails we haven't tried this as a separate struct yet, right? I'm thinking of HTTP clients here — where you configure your request parameters, then execute the request —, except we configure the assignment strategy and then execute it. The details there are for a separate discussion, but wanted to bring up the side benefit.

That sounds like an awesome idea. I'm game for trying it in the future. I would love something like it.

Edit

Missed a reply.

Also worth mentioning, the one issue with the wrapper type, even if we make it work with multi-errors, is the potential of mismatch between the subject and the error.

Yea, I kept going back and forth on this. Its especially weird with the first Report type I wrote as it uses the enum rather than a generic. If the error type didn't align with the subject, it just failed silently. I considered asserts on lengths, unwraps, and such but figured the last thing someone would want is an error to panic.

It's even weirder with multi-errors because, to generate them, we'd need to essentially run another full scan and may find entirely different errors if given a different source

This is an incredibly small nit, but we don't have to do a full scan - we can start from the last encoding error's offset + 1. Basically nothing, but yea. I get your point.

I said before the mismatch is a low risk given typically you'd use the wrapper immediately like Pointer::parse(s).diagnose(s), but it smells, and I don't like it.

Oh, I know. I really don't like it either. We are just sort of bound by the language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants