Replies: 9 comments
-
It looks like a https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html |
Beta Was this translation helpful? Give feedback.
-
@geoHeil I know what Spark is but have limited experience with it so I don't know what integration would entail. When you say "integration", do you mean:
As you can see from the tutorial, there's really nothing particularly tricky about calling the |
Beta Was this translation helpful? Give feedback.
-
I think the easiest way of integration would be a: On a data frame grouped by a window using SQL
as aggregation. But this also means that in case of a sliding window (slide every hour) the distance metrics are recalculated over and over again. Instead in the So one possibility would be to rewrite for the JVM or have the JVM call out to python - but this is rather complex & convoluted. I guess ideally, arbitrary state handling should be simplified for the python API from the spark side. |
Beta Was this translation helpful? Give feedback.
-
STUMPY is written exclusively in Python and auto-magically JIT compiled via Having said that, I am curious why the matrix profile computation needs to happen on the JVM side? Unfortunately, Instead of having the JVM call out to Python, would it be possible to set up a Flask app with a RESTful |
Beta Was this translation helpful? Give feedback.
-
I will need to think about this, also potentially about switching to dask for this task instead. I will close this ticket here for now. Thanks. |
Beta Was this translation helpful? Give feedback.
-
I had assumed that you had some exceptionally large data set distributed across an immovable cluster (or backed by HDFS). I didn't realize that |
Beta Was this translation helpful? Give feedback.
-
Indeed, our default infrastructure and processes are built around HDFS, YARN, and spark. This is why I would prefer a tight integration. So far, I had bad experiences as soon as only medium-sized data was shuffled in Dask - whereas spark was performing just fine. The data is not exactly movable ;) but also by far not petabaytes. The idea of using Dask was just a not fully thought out approach. Many thanks for offering this. Let me take some first steps in both directions - and maybe I will come back with questions. |
Beta Was this translation helpful? Give feedback.
-
FYI and for reference: Flink plans to support such an arbitrary state handling functionality for python from the next big release 1.12 using a |
Beta Was this translation helpful? Give feedback.
-
@geoHeil Could you please point me to a link regarding such state handling functionality in Flink? |
Beta Was this translation helpful? Give feedback.
-
Have you integrated stumpy with apache spark?
I would be interested in working with stumpy on:
For multiple time-series i.e. processing a grouped window of ordered events
But maintaining state to only recompute changes as outlined in the https://stumpy.readthedocs.io/en/latest/Tutorial_Matrix_Profiles_For_Streaming_Data.html
Beta Was this translation helpful? Give feedback.
All reactions