-
Notifications
You must be signed in to change notification settings - Fork 2
Ideas for examples, recipes, and utils
The data items are "table rows" (could be tuple
, dict
, pandas.Series
, namedtuple
...) and you are saving them in some table format (e.g. numpy 2d array, pandas dataframe, csv, xlsx). In all these cases, you don't want to be opening and closing a file every time you're writing a row. So you cumul, aggreg, and write (flush). (Maybe I should call it CumulAggregFlush
...)
Most DBs have a "cached writes" functionality. In mongo and sql this is called "bulk insert". This is especially useful when the DB is hosted remotely: You don't want to be incurring the overhead of communication for every small write. Example: Mongo: https://docs.mongodb.com/manual/reference/method/Bulk.insert/, SQL: https://codingsight.com/sql-server-bulk-insert-part-1/ In fact, these are cases that will probably find their way to actual utils in the py2store core.
Abstract use case: You have a commutative and associative operation that you use to make hierarchal aggregations. Concrete use case: You aggregate seconds-granularity numerical time series in minute, hour, day, and week levels, recording for each bucket, the count, sum and sum of squares (for variance) of each bucket. Easier to assume that data will come ordered, but could also handle the case where it is "mostly ordered" (that is, where we allow some limited number of small-range derangements). See https://github.com/thorwhalen/ut/blob/master/dpat/buffer.py