spark integration #242

geoHeil · 2020-09-07T10:36:44Z

geoHeil
Sep 7, 2020

Have you integrated stumpy with apache spark?
I would be interested in working with stumpy on:

apache-spark 3.x
in streaming (structured streaming)

For multiple time-series i.e. processing a grouped window of ordered events
But maintaining state to only recompute changes as outlined in the https://stumpy.readthedocs.io/en/latest/Tutorial_Matrix_Profiles_For_Streaming_Data.html

geoHeil · 2020-09-07T11:10:32Z

geoHeil
Sep 7, 2020
Author

It looks like a https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html mapGroupsWithState would be required - but that is only available in the JVM bsed APIs.

0 replies

seanlaw · 2020-09-07T12:41:14Z

seanlaw
Sep 7, 2020
Maintainer

@geoHeil I know what Spark is but have limited experience with it so I don't know what integration would entail. When you say "integration", do you mean:

Re-writing STUMPY (which is essentially pure Python) in Java so that it can be distributed across a Spark cluster and be able to access data that is local on each server?
Extracting a single (streaming) data point from a Spark cluster and updating the matrix profile and sending the resulting matrix profile back to Spark
Neither (please provide more information)

As you can see from the tutorial, there's really nothing particularly tricky about calling the stumpy.stumpi function. Perhaps, you can provide some more context as to what you expect the end-to-end process to be. I can't promise anything as it may be beyond the scope of STUMPY but I want to help.

0 replies

geoHeil · 2020-09-08T11:19:37Z

geoHeil
Sep 8, 2020
Author

I think the easiest way of integration would be a:

On a data frame grouped by a window using SQL WINDOW GROUP BY and

df = spark.sql("SELECT sort_array(collect_list(struct(time, device_id, device_group, metric1, metric2, ..., metricn))) GROUP BY WINDOW(4 hours, 1 hour), device_id")

as aggregation.
This leaves us with an array of all the values in that window (i.e. 4-hour window for example) and we easily could apply stumpy using a UDF in parallel for each grouped window.

But this also means that in case of a sliding window (slide every hour) the distance metrics are recalculated over and over again.

Instead in the stumpi approach as a stream using structured streaming spark would need to provide the state of the current matrix profile and the last couple of seen events. For this arbitrary state handling, mapGroupsWithState seems to be required.
But this is only part of the JVM (java, scala) API for now as far as I have figured out.

So one possibility would be to rewrite for the JVM or have the JVM call out to python - but this is rather complex & convoluted. I guess ideally, arbitrary state handling should be simplified for the python API from the spark side.
Or is this a misunderstanding and there already exists a stumpy library on the JVM?

0 replies

seanlaw · 2020-09-08T14:44:05Z

seanlaw
Sep 8, 2020
Maintainer

STUMPY is written exclusively in Python and auto-magically JIT compiled via Numba for the local hardware. I am not aware of any Java/Scala matrix profile implementation and since we try very hard to ensure 100% code coverage/unit testing in STUMPY, it isn't clear to me how we can sustain/maintain this criteria for Spark (as I don't have access to a Spark cluster) and so this is likely beyond the scope of this package.

Having said that, I am curious why the matrix profile computation needs to happen on the JVM side? Unfortunately, stumpi needs to be stateful since it needs to maintain some history. This is what allows it to be fast instead of, as you pointed out, recomputing everything as more data is streamed in.

Instead of having the JVM call out to Python, would it be possible to set up a Flask app with a RESTful stumpi endpoint on it and send over one window of data to instantiate the streaming object and then update the matrix profile one data point at a time thereafter? Of course, this would depend on the initial size of your data but local data transfer within the same network should, in theory, be fast. The subsequent updates should be performant since it will only be processing one data point at a time. Again, not trying to dictate here, just throw some quick ideas out there.

0 replies

geoHeil · 2020-09-08T14:58:41Z

geoHeil
Sep 8, 2020
Author

I will need to think about this, also potentially about switching to dask for this task instead. I will close this ticket here for now. Thanks.

0 replies

seanlaw · 2020-09-08T16:08:40Z

seanlaw
Sep 8, 2020
Maintainer

I had assumed that you had some exceptionally large data set distributed across an immovable cluster (or backed by HDFS). I didn't realize that dask was an option. Let me know if you want to provide more background as to the problem you are trying to solve in terms of pre-processing and aggregating the data with dask. I would be happy to talk through that as I have some experience with dask

0 replies

geoHeil · 2020-09-08T19:27:40Z

geoHeil
Sep 8, 2020
Author

Indeed, our default infrastructure and processes are built around HDFS, YARN, and spark. This is why I would prefer a tight integration. So far, I had bad experiences as soon as only medium-sized data was shuffled in Dask - whereas spark was performing just fine.

The data is not exactly movable ;) but also by far not petabaytes.

The idea of using Dask was just a not fully thought out approach.
I think most likely, I will start with the wasteful pySpark batch approach outlined above and will then see if the performance must be improved to reduce latency.

Many thanks for offering this. Let me take some first steps in both directions - and maybe I will come back with questions.

0 replies

geoHeil · 2020-09-12T13:07:05Z

geoHeil
Sep 12, 2020
Author

FYI and for reference: Flink plans to support such an arbitrary state handling functionality for python from the next big release 1.12 using a keyedProcessFunction.

0 replies

hmcoservit · 2020-10-13T15:40:27Z

hmcoservit
Oct 13, 2020

FYI and for reference: Flink plans to support such an arbitrary state handling functionality for python from the next big release 1.12 using a keyedProcessFunction.

@geoHeil Could you please point me to a link regarding such state handling functionality in Flink?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark integration #242

{{title}}

Replies: 9 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

spark integration #242

geoHeil Sep 7, 2020

Replies: 9 comments

geoHeil Sep 7, 2020 Author

seanlaw Sep 7, 2020 Maintainer

geoHeil Sep 8, 2020 Author

seanlaw Sep 8, 2020 Maintainer

geoHeil Sep 8, 2020 Author

seanlaw Sep 8, 2020 Maintainer

geoHeil Sep 8, 2020 Author

geoHeil Sep 12, 2020 Author

hmcoservit Oct 13, 2020

geoHeil
Sep 7, 2020

geoHeil
Sep 7, 2020
Author

seanlaw
Sep 7, 2020
Maintainer

geoHeil
Sep 8, 2020
Author

seanlaw
Sep 8, 2020
Maintainer

geoHeil
Sep 8, 2020
Author

seanlaw
Sep 8, 2020
Maintainer

geoHeil
Sep 8, 2020
Author

geoHeil
Sep 12, 2020
Author

hmcoservit
Oct 13, 2020