-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Currently models are not persisted due to the broken nature of jsonpickle interacting with NumPy and code changes. Thus, models require custom persistence in a custom .json format (likely, weights and coefficients stored as base64, but with json field for the size and dtype so as to allow proper loading regardless of code or version changes).
Other options (ONNX, PyTorch pickling) all are either hopelessly non-portable, unstable, or just don't work: we want the persisted models to move across machines and still be loaded correctly.
We must also decide which trained models to persist:
- Tuned models trained on the full training set
- Tuned models trained on the full dataset (training and test) for applying to new data
- Both of the above
Further Thoughts
Persistence / serialization must be done with the consumer in mind. Currently, df-analyze models are DfAnalyzeModel objects, but, no one can use a DfAnalyzeModel unless df-analyze also exposes a public API or library. Now, ideally, if ONNX actually worked for more than a handful of models, one could export the wrapped models to ONNX, but, it doesn't really.
The correct solution is to make a separate library (say, df-persist) which users can install with minimal dependencies. The df-persist library would specify a serialization format, e.g. with schema:
{
"model_source": one of ["scikit-learn", "LightGBM", "PyTorch"],
"model_class": one of ["KNN", "SGDLinear", "..."],
"hyperparams": {dict of arguments for the model_class constructor, generally},
"params": {different for each source, but, generally, the weights},
"meta": {whatever other junk turns out to be needed}
}
Then df-analyze writes custom serialization routines that outputs the base models contained inside the DfAnalyzeModels to be consistent with the df-persist format, and then users can use df-persist to load a model into a model that is now an actual scikit-learn, PyTorch, or other model that can be manipulated with Python code. Updates to df-analyze will now never break a persisted model expect in very limited cases, and only very significant updates to the model provider libraries (e.g. scikit-learn changing their internals) would break df-persist.
Note of course this also ignores that models trained by df-analyze expect data to be pre-processed using the df-analyze processing pipeline: technically, this means we now need to expose some kind of preprocessing API...
So: why do we want persistence in the first place? If we only care about persisting models for reuse by df-analyze, within a version, then, sure, we can use pickles. But, beyond checkpointing, I am not sure how useful this is. If we want df-analyze to serialize models that other people can use, we basically need a df-persist (or to limit persistence to ONNX-compatible models, which will mostly exclude GANDALF and any future deep-learning models, and certainly all custom models).