Model persistence

Currently models are not persisted due to the broken nature of jsonpickle interacting with NumPy and code changes. Thus, models require custom persistence in a custom .json format (likely, weights and coefficients stored as base64, but with json field for the size and dtype so as to allow proper loading regardless of code or version changes). 

Other options (ONNX, PyTorch pickling) all are either hopelessly non-portable, unstable, or just don't work: we want the persisted models to move across machines and still be loaded correctly. 

We must also decide _which_ trained models to persist: 

1. Tuned models trained on the full training set
2. Tuned models trained on the full dataset (training and test) for applying to new data
3. Both of the above

### Further Thoughts

Persistence / serialization must be done with the consumer in mind. Currently, df-analyze models are [DfAnalyzeModel](https://github.com/stfxecutables/df-analyze/blob/bc0f0a995b678dca3385b2ac64622f5b6f67208f/src/df_analyze/models/base.py#L96) objects, but, no one can use a DfAnalyzeModel unless df-analyze also exposes a public API or library. Now, ideally, if ONNX actually worked for more than a handful of models, one could export the wrapped models to ONNX, but, it doesn't really.

The _correct_ solution is to make a separate library (say, `df-persist`) which users can install with minimal dependencies. The `df-persist` library would specify a serialization format, e.g. with schema:

```
{
  "model_source": one of ["scikit-learn", "LightGBM", "PyTorch"],
  "model_class": one of ["KNN", "SGDLinear", "..."],
  "hyperparams": {dict of arguments for the model_class constructor, generally},
  "params": {different for each source, but, generally, the weights}, 
  "meta": {whatever other junk turns out to be needed}
}
```

Then `df-analyze` writes custom serialization routines that outputs the base models contained inside the `DfAnalyzeModel`s to be consistent with the `df-persist` format,  and then users can use `df-persist` to load a model into a model that is now an actual scikit-learn, PyTorch, or other model that can be manipulated with Python code. Updates to `df-analyze` will now never break a persisted model expect in very limited cases, and only very significant updates to the model provider libraries (e.g. scikit-learn changing their internals) would break `df-persist`. 

Note of course this also **ignores that models trained by `df-analyze` expect data to be pre-processed using the df-analyze processing pipeline**: technically, this means we now need to expose some kind of preprocessing API...

So: why do we want persistence in the first place? If we only care about persisting models for reuse by df-analyze, within a version, then, sure, we can use pickles. But, beyond checkpointing, I am not sure how useful this is. If we want df-analyze to serialize models that other people can use, we basically need a `df-persist` (or to limit persistence to ONNX-compatible models, which will mostly exclude GANDALF and any future deep-learning models, and certainly all custom models).



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Model persistence #33

Further Thoughts

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model persistence #33

Description

Further Thoughts

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions