Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Streams 🌊] Enrichment state management improvements #211686

Open
wants to merge 47 commits into
base: main
Choose a base branch
from

Conversation

tonyghiani
Copy link
Contributor

@tonyghiani tonyghiani commented Feb 19, 2025

📓 Summary

Closes https://github.com/elastic/streams-program/issues/102

This re-work of the enrichment state management introduces XState as state library to prepare scaling the enrichment part for more processors and improve performance reducing unnecessary side effects.

🤓 Reviewers note

There is a lot to digest on this PR, I'm open to any suggestion and I left some notes around to guide the review.
This is also far from perfect as there is margin for other minor DX improvements for consuming the state machines, but it will all come in follow-up work after we resolve prioritized work such as integrating the Schema Editor.

Most of the changes on this PR are about the state management for the stream enrichment, but it touches also some other areas to integrate the event-based flow.

Stream Detail machine

This machine is responsible for the whole stream detail view initialization.
At the moment, it only handle the fetching of the stream definition we are navigating to, but it'll work as the entry point for restoring url state or handle more complex initialization flows, such as determining the stream type flow.

Screenshot 2025-02-27 at 17 01 30

Stream enrichment machine

This machine handles the complexity around updating/promoting/deleting processors, and the available simulation states.
It's a root level machine that spawns and manages its children machine, one for the simulation behaviour and one for each processor instantiated.

Screenshot 2025-02-27 at 17 10 03

Simulation machine

This machine handle the flow around sampling -> simulating, handling debouncing and determining once a simulation can run or should refresh. It also spawn a child date range machine to react to the observable time changes and reloads.
It also derives all the required table configurations (columns, filters, documents) centralizing the parsing and reducing the cases for re-computing, since we don't rely anymore on the previous live processors copy.

Screenshot 2025-02-27 at 17 33 40

Processor machine

A processor can be in different states depending on the changes, not this tracks each of them independently and send events to the parent machine to react accordingly. It provide a boost in performance compared to the previous approach, as we don't have to rerender the whole page tree since the changes are encapsulated in the machine state.

Screenshot 2025-02-27 at 17 23 01

🔜 Follow Ups

  • Refactor Grok AI suggestion with a smaller machine and fix known issues.
  • Integrate usage of existing state machine with existing stream detail sections.
  • Improve machines communication event bus.

@tonyghiani tonyghiani added backport:version Backport to applied version labels Feature:Streams This is the label for the Streams Project v9.1.0 v8.19.0 labels Feb 19, 2025
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

Just moved it into this shared package from https://github.com/elastic/kibana/pull/211686/files#diff-588ef3f03c3164d4bc7833514b6f1d8fe7605c998d730517547a2fcc4332f2de, it gives a typed placeholder for required services.

const dynamicImageImport = colorMode === 'LIGHT' ? light() : dark();

dynamicImageImport.then((module) => setImageSrc(module.default));
dynamicImageImport.then((module) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

Fixes an async state update warning for unmounted images.

@@ -101,7 +101,9 @@ export const ProcessorErrors = ({ metrics }: { metrics: ProcessorMetrics }) => {
const shouldDisplayErrorToggle = remainingCount > 0;

const getCalloutProps = (type: ProcessorMetrics['errors'][number]['type']): EuiCallOutProps => {
const isWarningError = type === 'generic_processor_failure' && success_rate > 0;
const isWarningError =
type === 'non_additive_processor_failure' ||
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

Make non-additive changes a warning, as we'll now allow them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

This is a small re-usable actor to invoke when we need a confirmation prompt. A couple of presets used in the simulation machine are also defined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

This small actor is in charge of the samples data fetching and mapping to flatten records.
It will handle request abortion with the built-in signal in case the fetching state is left before resolving.

Copy link
Contributor Author

@tonyghiani tonyghiani Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

These are a bunch of utilities moved from ...lic/components/data_management/stream_detail_enrichment/hooks/use_processing_simulator.ts, there is no new logic to review so reviewers can skip this part.


const consoleInspector = createConsoleInspector();

const StreamEnrichmentContext = createActorContext(streamEnrichmentMachine);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

Using the createActorContext utility from xstate, we get a context already set with strong typing and the utility we need exposed from the context.

Comment on lines +81 to +92
const ListenForDefinitionChanges = ({
children,
definition,
}: React.PropsWithChildren<StreamEnrichmentInput>) => {
const service = StreamEnrichmentContext.useActorRef();

useEffect(() => {
service.send({ type: 'stream.received', definition });
}, [definition, service]);

return children;
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

This is (not very elegant) workaround to make the machine react to definition changes until I will connect them to react by event, which will be a clean-up task.

@@ -16,7 +16,6 @@ import {
export type WithUIAttributes<T extends ProcessorDefinition> = T & {
id: string;
type: ProcessorTypeOf<T>;
status: 'draft' | 'saved' | 'updated';
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

Not necessary anymore, the processorMachine will track exactly in which status a processor is.

Comment on lines +23 to +26
const Content = ({ definition, refreshDefinition }: Required<SchemaEditorProps>) => {
const isLoadingDefinition = useStreamDetailSelector((state) =>
state.matches({ initialized: 'loadingDefinition' })
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

Removes prop drilling for the loading state of a definition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

This is a simple and re-usable state machine that replicates the behaviour of useDateRange.
To implement it a part of the enrichment flow and communicate with its parent machine, I put together this machine, that can be used with useActor to replicate the same behaviour the other hook, while the other way around was not possible to re-use the logic in the simulation machine.

@tonyghiani
Copy link
Contributor Author

/ci

@tonyghiani tonyghiani marked this pull request as ready for review February 27, 2025 16:52
@tonyghiani tonyghiani requested review from a team as code owners February 27, 2025 16:52
@tonyghiani tonyghiani added the release_note:skip Skip the PR/issue when compiling release notes label Feb 27, 2025
Copy link
Contributor

@flash1293 flash1293 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, this is a massive body of work! I really like how we don't need opaque files with tons of hooks in them to define state transitions, but have a clear model of how transitions work.

I left a couple comments, but they all go into the same direction - I think we shouldn't put all logic into xstate, only where we can actually benefit from it.

In a lot of cases, xstate strikes me as overly complex and the clean modeling of state transitions is not actually as valuable as the development speed that's lost with it.

My high level opinion: Let's use xstate in central places that get real messy without, but let's try to split out local state into simpler alternatives where possible. I think the ideal of having all state within an xstate machine is the wrong one to pursue.

Basically the way to approach a feature would be trying to solve it with local state. Once

  • that becomes hard to maintain introduce a sub machine for it to manage state transitions better.
  • you need to lift state up, add it to the higher level xstate machine

I really worry we lose a lot of development speed by putting too much into this setup. Maybe a rule of thumb could be "if you can solve it with less or equal three hooks, do it locally, otherwise turn it into xstate".

I know that there is a learning curve aspect to it - I didn't work a lot with xstate before, so I need to get used to the abstractions. But it's not just that - the modeling approach of xstate is a very verbose one. Adding something equivalent to a useState hook requires changes in 5 different places instead of 1. Also when reading the code it's not easy to follow the control flow because things are in multiple places - actor definition in one file, event types in another file, guards in one place and all is assembled in yet another place. There is value in this for some things, especially when local state updates and side effects are mixed in a single control flow, but I think the hurdle for moving into this model should be really high.

Refactor Grok AI suggestion with a smaller machine

I'm bit on the edge whether a state machine makes sense for the suggestions or not. They seem relatively simple, so not sure. Maybe it's worth it if we lift the state up so the state is shared across different processors (e.g. for the block list that would be nice)

input: ({ context }) => ({
condition: context.samplingCondition,
streamName: context.streamName,
absoluteTimeRange: context.dateRangeRef.getSnapshot().context.absoluteTimeRange,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we are doing a ton of work to integrate the timefilter service with xstate by wrapping it in a sub machine with multiple actors, just to get the time range here.

I don't think this is necessary, we should just pass data to createSamplesFetchActor so this can be an implementation detail of the actor instead of modelling it in xstate.

If we one day need to react to time updates, we can add a simple fromObservable actor for it - no need to have a whole state machine just to wrap some other state somewhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this logic into a smaller state machine (which can be re-used easily as useDateRange, but it's not coupled to the react interface and can be spawned like here if necessary) because we already need to programmatically react to events related to time changes.

We don't only need to get the absolute time range, but we want this transition to happen only when a time change occurs and we are in the simulation state.

If I keep it separate, I would still need to use the date range hook, which is not synced with the state of this machine, leaving room to undesired side effects. The dateRange machine is a pretty small and re-usable piece, I don't see much need to maintain it once it's done and integrates well not only with other machines but can be used also independently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want this transition to happen only when a time change occurs and we are in the simulation state

Isn't the update coming from here in practice?

And even if we do need it, why not hooking the actor directly into the place that needs it?

The dateRange machine is a pretty small and re-usable piece, I don't see much need to maintain it once it's done and integrates well not only with other machines but can be used also independently.

It's still code the next dev needs to read and understand, which comes at a cost. Let's not put abstractions in place because we might need them later, this is a classic death-by-a-thousand-cuts issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And even if we do need it, why not hooking the actor directly into the place that needs it?

I left a note here regarding this here.

It's still code the next dev needs to read and understand.

Forgive me here, but I don't see any difference with next dev having to read and understands the useTimefilter hook, expect here the state changes can be visualized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this relates, this comment is about changes of the definition, not the time range, right?

I don't see any difference with next dev having to read and understands the useTimefilter hook, expect here the state changes can be visualized

Why do we need useTimefilter, we would just need a fromObservable(data.search.timefilter.getTimefilter$()) as an actor in the state machine that needs access, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would just need a fromObservable(data.search.timefilter.getTimefilter$()) as an actor in the state machine that needs access

Correct, adding a context value to hold the observable output, an initial value for this context value and an action to set it.
It reduces to what the date range machine does, but I don't see the need to couple it all together it we could re-use this, for example, for the routing.

Copy link
Contributor

@flash1293 flash1293 Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do it twice, then generalize out - we only do it once here, it's too early for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, it would be too early if we were not aware of use cases where we need it. We have those use cases already implemented, where this 30 lines machine would be very handy.
But mostly the main reason I split it is to keep simpler the simulation machine, so that it doesn't have to worry about knowing how the timefilter work.
I don't see that as too early given the point we are at with the stream project and what is coming next.

import { createLoadDefinitionActor } from './load_definition_actor';
import { StreamDetailContext, StreamDetailEvent, StreamDetailInput } from './types';

export const streamDetailMachine = setup({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned this in your description, but this state machine is basically doing nothing and a simple hook could do the same job. Things might get more complex later, but this smells a lot like YAGNI. Let's introduce the complexity we need right now, not the one we might need at some point.

For example, I'm not sure whether we even want to introduce more complex URL state syncing and I still have high hope we can keep the context-awareness logic much simpler.

I think we are going too far here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair concerns Joe, and agree it might be premature, a couple of reasons why I went this way:

  • Props drilling: passing down and down the definition and the refresh function has been the default pattern so far. It works well sure, but it makes every component much less maintainable, and trigger a rerender of the whole tree on every change, which is probably unnecessary as it breaks memoization in many places.
    Consuming this from the top-level context makes it easier to access it and provide updates. I didn't change any usage site here, but there is good margin of usage for this context in the whole client app.
  • useAbortableSync, which is used behind the scene, is purely side-effect based, which limits in reacting sequentially when we update the value. We want to deterministically trigger an update and know when this is done, using that hook doesn't help with that. Simply said, we cannot just await refreshDefinition() and then do something, we need to put another useEffect on the definition changes, which is unlikely what we want to do for some cases + it adds more side effects burden to the app.

The other points about the initialization flow are probably preventive yeah, but I feel the above already gives some good reasons to have a stricter root level machine. Happy to discuss more and pull in more folks to discuss the trade-off here.

Copy link
Contributor

@flash1293 flash1293 Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the props drilling - the definition doesn't change all the time, so I'm not too worried about the rerenders. We could also add a normal context for this, that might be helpful anyway because the definition is such a central thing to access (especially for the name).

I'm not sure I follow your point about useAbortableSync, how does the state machine change that we need to await a new definition? Where in the existing code would we need an additional hook?

We anyway have this here, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, is about the note here regarding this #211686 (comment).

The issue with useAbortableSync is that it treats the refetch not as an event, but as a side-effect, hence it makes difficult to wait and know when the refetch is done. With this instead we can notified by the stream.update anyway, and we can wire it directly to the machine by event as I mentioned in the note.

storeSimulation: assign((_, params: { simulation: Simulation }) => ({
simulation: params.simulation,
})),
derivePreviewColumns: assign(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert in xstate, but storing derived state in the context like this strikes me as pretty weird - why not using a simple useMemo in the component where they are needed?

This way it's possible to forget to derive them at some state condition, and I don't see an upside. You could argue business logic and UI are separated this way, but we can still keep the logic in a separate function that's easily unit-testable in isolation.

Xstate has its strong suits, but derived data is not one of them, I think we should keep it out of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable, we don't strictly need to keep the derive columns here, I initially put them here because a memoization with useMemo() won't check for the content and we'll need to use something like a useEffect that checks deeply for changes, but it shouldn't be a big change, I'll give it a try.

@tonyghiani
Copy link
Contributor Author

@flash1293 thanks for the in-depth review, much appreciated!

I addressed all of you comments, happy to jump in a call if you feel something needs more discussion and we can pull in the team for a wider audience.

To give a wider view of where I decided there was a need for a machine, it reduces to having a state that reflects what the UI allows the user to do. The complex states we manage here have subscriptions to time changes, sequential data fetching and particularly, reactivity to recurrent events. All of this need to occur under strict conditions, I added machines only where I thought it would bring benefits.

I also agree with you that we shouldn't go with Xstate for anything has a state, and in the enrichment section there is still plenty of local states that don't need to update, it's about finding the balance.

Maybe a rule of thumb could be "if you can solve it with less or equal three hooks, do it locally, otherwise turn it into xstate"

WRT this statement, IMO it's not about the amount of logic or custom hooks that we have, but what they do and they interact with each other.

If we have a page with 100s of independent toggles, there is no need for a machine.
But if we get even 2 hooks with multiple dependencies (e.g. fetch samples, fetch simulation) that react to each other consistently and only under certain circumstancies, the event-based control given by a finite machine looks more solid to me. I guess it always depends on the case 🙂

I'm bit on the edge whether a state machine makes sense for the suggestions or not.

There is no urgency in updating this part yet, although as we mentioned offline I spotted a couple of bugs (not addressed here) due to the several conditions we have to decide when suggestions should be available or not. We can post-pone this if there are doubts and evaluate the change if it gets more complex.

@flash1293
Copy link
Contributor

flash1293 commented Mar 3, 2025

WRT this statement, IMO it's not about the amount of logic or custom hooks that we have, but what they do and they interact with each other.

Yeah, it's just a rule of thumb, I tried to make it quantifiable somehow :) Maybe it's fine to say "it needs to be decided on a case-by-case basis" - IT, Depends ;)

Added some follow-up comments to the threads above.

@elasticmachine
Copy link
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
datasetQuality 289 290 +1
observabilityOnboarding 230 231 +1
streamsApp 316 380 +64
total +66

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/xstate-utils 14 16 +2

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
logsShared 202.0KB 202.0KB -1.0B
streamsApp 307.6KB 369.3KB +61.7KB
total +61.7KB
Unknown metric groups

API count

id before after diff
@kbn/xstate-utils 14 16 +2

History

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:version Backport to applied version labels Feature:Streams This is the label for the Streams Project release_note:skip Skip the PR/issue when compiling release notes v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants