Skip to content

Commit d0da4bd

Browse files
bcm-at-zamaosansevieropcuenca
authored
Add a new blog post: How to use HF endpoints to run Concrete-ML privacy-preserving ML models. (huggingface#1885)
* Add a blog about how to use HF endpoints to run Concrete-ML privacy-preserving ML models. * reviews (to be squashed) * reviews (to be squashed) * reviews (to be squashed) * reviews (to be squashed) * reviews (to be squashed) * reviews (to be squashed) * reviews (to be squashed) * reviews (to be squashed) * reviews (to be squashed) * reviews (to be squashed) * reviews (to be squashed) * Apply suggestions from code review reviews to be squashed Co-authored-by: Omar Sanseviero <[email protected]> Co-authored-by: Pedro Cuenca <[email protected]> * rename files * reviews * reviews * reviews * rename * you / users / reader etc * you / users / reader etc * last modifications * last modifications * reviews * reviews * reviews * try to repair zh/starcoder2.md * add the internal link * adding tags * polish * polish * polish * polish * merge * merge * Apply suggestions from code review suggestion by pcuenca Co-authored-by: Pedro Cuenca <[email protected]> * Adding a recently seen issue. * New link, with larger images * Attempting to use the images from the other repo. * New proposed date. * Mention we need to force python 3.9 * merge * Apply suggestions from code review Co-authored-by: Omar Sanseviero <[email protected]> * Merge --------- Co-authored-by: Omar Sanseviero <[email protected]> Co-authored-by: Pedro Cuenca <[email protected]>
1 parent b6d9a24 commit d0da4bd

File tree

3 files changed

+226
-1
lines changed

3 files changed

+226
-1
lines changed

_blog.yml

+13-1
Original file line numberDiff line numberDiff line change
@@ -3767,7 +3767,7 @@
37673767
- multimodal
37683768
- guide
37693769
- trl
3770-
3770+
37713771
- local: idefics2
37723772
title: "Introducing Idefics2: A Powerful 8B Vision-Language Model for the community"
37733773
author: Leyo
@@ -3787,3 +3787,15 @@
37873787
date: April 16, 2024
37883788
tags:
37893789
- case-studies
3790+
3791+
- local: fhe-endpoints
3792+
title: "Running Privacy-Preserving Inference on Hugging Face Endpoints"
3793+
author: binoua
3794+
guest: true
3795+
thumbnail: /blog/assets/fhe-endpoints/thumbnail.png
3796+
date: April 17, 2024
3797+
tags:
3798+
- guide
3799+
- privacy
3800+
- research
3801+
- FHE

assets/fhe-endpoints/thumbnail.png

312 KB
Loading

fhe-endpoints.md

+213
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
title: "Running Privacy-Preserving Inferences on Hugging Face Endpoints"
2+
thumbnail: /blog/assets/fhe-endpoints/thumbnail.png
3+
authors:
4+
- user: binoua
5+
guest: true
6+
---
7+
8+
# Running Privacy-Preserving Inferences on Hugging Face Endpoints
9+
10+
> [!NOTE] This is a guest blog post by the Zama team. Zama is an open source cryptography company building state-of-the-art FHE solutions for blockchain and AI.
11+
12+
Eighteen months ago, Zama started [Concrete ML](https://github.com/zama-ai/concrete-ml), a privacy-preserving ML framework with bindings to traditional ML frameworks such as scikit-learn, ONNX, PyTorch, and TensorFlow. To ensure privacy for users' data, Zama uses Fully Homomorphic Encryption (FHE), a cryptographic tool that allows to make direct computations over encrypted data, without ever knowing the private key.
13+
14+
From the start, we wanted to pre-compile some FHE-friendly networks and make them available somewhere on the internet, allowing users to use them trivially. We are ready today! And not in a random place on the internet, but directly on Hugging Face.
15+
16+
More precisely, we use Hugging Face [Endpoints](https://huggingface.co/docs/inference-endpoints/en/index) and [custom inference handlers](https://huggingface.co/docs/inference-endpoints/en/guides/custom_handler), to be able to store our Concrete ML models and let users deploy on HF machines in one click. At the end of this blog post, you will understand how to use pre-compiled models and how to prepare yours. This blog can also be considered as another tutorial for custom inference handlers.
17+
18+
## Deploying a pre-compiled model
19+
20+
Let's start with deploying an FHE-friendly model (prepared by Zama or third parties - see [Preparing your pre-compiled model](#preparing-your-pre-compiled-model) section below for learning how to prepare yours).
21+
22+
First, look for the model you want to deploy: We have pre-compiled a [bunch of models](https://huggingface.co/zama-fhe?#models) on Zama's HF page (or you can [find them](https://huggingface.co/models?other=concrete-ml) with tags). Let's suppose you have chosen [concrete-ml-encrypted-decisiontree](https://huggingface.co/zama-fhe/concrete-ml-encrypted-decisiontree): As explained in the description, this pre-compiled model allows you to detect spam without looking at the message content in the clear.
23+
24+
Like with any other model available on the Hugging Face platform, select _Deploy_ and then _Inference Endpoint (dedicated)_:
25+
26+
<p align="center">
27+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/fhe-endpoints/inference_endpoint.png" alt="Inference Endpoint (dedicated)" style="width: 90%; height: auto;"><be>
28+
<em>Inference Endpoint (dedicated)</em>
29+
</p>
30+
31+
Next, choose the Endpoint name or the region, and most importantly, the CPU (Concrete ML models do not use GPUs for now; we are [working](https://www.zama.ai/post/tfhe-rs-v0-5) on it) as well as the best machine available - in the example below we chose eight vCPU. Now click on _Create Endpoint_ and wait for the initialization to finish.
32+
33+
<p align="center">
34+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/fhe-endpoints/create_endpoint.png" alt="Create Endpoint" style="width: 90%; height: auto;"><be>
35+
<em>Create Endpoint</em>
36+
</p>
37+
38+
After a few seconds, the Endpoint is deployed, and your privacy-preserving model is ready to operate.
39+
40+
<p align="center">
41+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/fhe-endpoints/endpoint_is_created.png" alt="Endpoint is created" style="width: 90%; height: auto;"><be>
42+
<em>Endpoint is created</em>
43+
</p>
44+
45+
> [!NOTE]: Don’t forget to delete the Endpoint (or at least pause it) when you are no longer using it, or else it will cost more than anticipated.
46+
47+
## Using the Endpoint
48+
49+
### Installing the client side
50+
51+
The goal is not only to deploy your Endpoint but also to let your users play with it. For that, they need to clone the repository on their computer. This is done by selecting _Clone Repository_, in the dropdown menu:
52+
53+
<p align="center">
54+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/fhe-endpoints/clone_repository.png" alt="Clone Repository" style="width: 90%; height: auto;"><be>
55+
<em>Clone Repository</em>
56+
</p>
57+
58+
They will be given a small command line that they can run in their terminal:
59+
60+
```bash
61+
git clone https://huggingface.co/zama-fhe/concrete-ml-encrypted-decisiontree
62+
```
63+
64+
Once the command is done, they go to the `concrete-ml-encrypted-decisiontree` directory and open `play_with_endpoint.py` with their editor. Here, they will find the line with `API_URL = …` and should replace it with the new URL of the Endpoint created in the previous section.
65+
66+
```bash
67+
API_URL = "https://vtx9w974oxrq54ff.us-east-1.aws.endpoints.huggingface.cloud"
68+
```
69+
70+
Of course, fill it in with with _your_ Entrypoint’s URL. Also, define an [access token](https://huggingface.co/docs/hub/en/security-tokens) and store it in an environment variable:
71+
72+
```bash
73+
export HF_TOKEN=[your token hf_XX..XX]
74+
```
75+
76+
Lastly, your user machines need to have Concrete ML installed locally: Make a virtual environment, source it, and install the necessary dependencies:
77+
78+
```bash
79+
python3.10 -m venv .venv
80+
pip install -U setuptools pip wheel
81+
source .venv/bin/activate
82+
pip install -r requirements.txt
83+
```
84+
85+
> [!NOTE] Remark that we currently force the use of Python 3.10 (which is also the default python version used in Hugging Face Endpoints). This is because our development files currently depend on the Python version. We are working on making them independent. This should be available in a further version.
86+
87+
### Running inferences
88+
89+
Now, your users can run inference on the Endpoint launching the script:
90+
91+
```bash
92+
python play_with_endpoint.py
93+
```
94+
95+
It should generate some logs similar to the following:
96+
97+
```bash
98+
Sending 0-th piece of the key (remaining size is 71984.14 kbytes)
99+
Storing the key in the database under uid=3307376977
100+
Sending 1-th piece of the key (remaining size is 0.02 kbytes)
101+
Size of the payload: 0.23 kilobytes
102+
for 0-th input, prediction=0 with expected 0 in 3.242 seconds
103+
for 1-th input, prediction=0 with expected 0 in 3.612 seconds
104+
for 2-th input, prediction=0 with expected 0 in 4.765 seconds
105+
106+
(...)
107+
108+
for 688-th input, prediction=0 with expected 1 in 3.176 seconds
109+
for 689-th input, prediction=1 with expected 1 in 4.027 seconds
110+
for 690-th input, prediction=0 with expected 0 in 4.329 seconds
111+
Accuracy on 691 samples is 0.8958031837916064
112+
Total time: 2873.860 seconds
113+
Duration per inference: 4.123 seconds
114+
```
115+
116+
### Adapting to your application or needs
117+
118+
If you edit `play_with_endpoint.py`, you'll see that we iterate over different samples of the test dataset and run encrypted inferences directly on the Endpoint.
119+
120+
```python
121+
for i in range(nb_samples):
122+
123+
# Quantize the input and encrypt it
124+
encrypted_inputs = fhemodel_client.quantize_encrypt_serialize(X_test[i].reshape(1, -1))
125+
126+
# Prepare the payload
127+
payload = {
128+
"inputs": "fake",
129+
"encrypted_inputs": to_json(encrypted_inputs),
130+
"method": "inference",
131+
"uid": uid,
132+
}
133+
134+
if is_first:
135+
print(f"Size of the payload: {sys.getsizeof(payload) / 1024:.2f} kilobytes")
136+
is_first = False
137+
138+
# Run the inference on HF servers
139+
duration -= time.time()
140+
duration_inference = -time.time()
141+
encrypted_prediction = query(payload)
142+
duration += time.time()
143+
duration_inference += time.time()
144+
145+
encrypted_prediction = from_json(encrypted_prediction)
146+
147+
# Decrypt the result and dequantize
148+
prediction_proba = fhemodel_client.deserialize_decrypt_dequantize(encrypted_prediction)[0]
149+
prediction = np.argmax(prediction_proba)
150+
151+
if verbose:
152+
print(
153+
f"for {i}-th input, {prediction=} with expected {Y_test[i]} in {duration_inference:.3f} seconds"
154+
)
155+
156+
# Measure accuracy
157+
nb_good += Y_test[i] == prediction
158+
```
159+
160+
Of course, this is just an example of the Entrypoint's usage. Developers are encouraged to adapt this example to their own use-case or application.
161+
162+
### Under the hood
163+
164+
Please note that all of this is done thanks to the flexibility of [custom handlers](https://huggingface.co/docs/inference-endpoints/en/guides/custom_handler), and we express our gratitude to the Hugging Face developers for offering such flexibility. The mechanism is defined in `handler.py`. As explained in the Hugging Face documentation, you can define the `__call__` method of `EndpointHandler` pretty much as you want: In our case, we have defined a `method` parameter, which can be `save_key` (to save FHE evaluation keys), `append_key` (to save FHE evaluation keys piece by piece if the key is too large to be sent in one single call) and finally `inference` (to run FHE inferences). These methods are used to set the evaluation key once and then run all the inferences, one by one, as seen in `play_with_endpoint.py`.
165+
166+
### Limits
167+
168+
One can remark, however, that keys are stored in the RAM of the Endpoint, which is not convenient for a production environment: At each restart, the keys are lost and need to be re-sent. Plus, when you have several machines to handle massive traffic, this RAM is not shared between the machines. Finally, the available CPU machines only provide eight vCPUs at most for Endpoints, which could be a limit for high-load applications.
169+
170+
## Preparing your pre-compiled model
171+
172+
Now that you know how easy it is to deploy a pre-compiled model, you may want to prepare yours. For this, you can fork [one of the repositories we have prepared](https://huggingface.co/zama-fhe?#models). All the model categories supported by Concrete ML ([linear](https://docs.zama.ai/concrete-ml/built-in-models/linear) models, [tree-based](https://docs.zama.ai/concrete-ml/built-in-models/tree) models, built-in [MLP](https://docs.zama.ai/concrete-ml/built-in-models/neural-networks), [PyTorch](https://docs.zama.ai/concrete-ml/deep-learning/torch_support) models) have at least one example, that can be used as a template for new pre-compiled models.
173+
174+
Then, edit `creating_models.py`, and change the ML task to be the one you want to tackle in your pre-compiled model: For example, if you started with [concrete-ml-encrypted-decisiontree](https://huggingface.co/zama-fhe/concrete-ml-encrypted-decisiontree), change the dataset and the model kind.
175+
176+
As explained earlier, you must have installed Concrete ML to prepare your pre-compiled model. Remark that you may have to use the same python version than Hugging Face use by default (3.10 when this blog is written), or your models may need people to use a container with your python during the deployment.
177+
178+
Now you can launch `python creating_models.py`. This will train the model and create the necessary development files (`client.zip`, `server.zip`, and `versions.json`) in the `compiled_model` directory. As explained in the [documentation](https://docs.zama.ai/concrete-ml/deployment/client_server), these files contain your pre-compiled model. If you have any issues, you can get support on the [fhe.org discord](http://discord.fhe.org).
179+
180+
The last step is to modify `play_with_endpoint.py` to also deal with the same ML task as in `creating_models.py`: Set the dataset accordingly.
181+
182+
Now, you can save this directory with the `compiled_model` directory and files, as well as your modifications in `creating_models.py` and `play_with_endpoint.py` on Hugging Face models. Certainly, you will need to run some tests and make slight adjustments for it to work. Do not forget to add a `concrete-ml` and `FHE` tag, such that your pre-compiled model appears easily in [searches](https://huggingface.co/models?other=concrete-ml).
183+
184+
## Pre-compiled models available today
185+
186+
For now, we have prepared a few pre-compiled models as examples, hoping the community will extend this soon. Pre-compiled models can be found by searching for the [concrete-ml](https://huggingface.co/models?other=concrete-ml) or [FHE](https://huggingface.co/models?other=FHE) tags.
187+
188+
| Model kind | Dataset | Execution time on HF Endpoint |
189+
|---|---|---|
190+
| [Logistic Regression](https://huggingface.co/zama-fhe/concrete-ml-encrypted-logreg) | Synthetic | 0.4 sec |
191+
[DecisionTree](https://huggingface.co/zama-fhe/concrete-ml-encrypted-decisiontree) | Spam | 2.0 sec
192+
[QNN](https://huggingface.co/zama-fhe/concrete-ml-encrypted-qnn) | Iris | 3.7 sec
193+
[CNN](https://huggingface.co/zama-fhe/concrete-ml-encrypted-deeplearning) | MNIST | 24 sec
194+
195+
Keep in mind that there's a limited set of configuration options in Hugging Face for CPU-backed Endpoints (up to 8 vCPU with 16 GB of RAM today). Depending on your production requirements and model characteristics, execution times could be faster on more powerful cloud instances. Hopefully, more powerful machines will soon be available on Hugging Face Endpoints to improve these timings.
196+
197+
## Additional resources
198+
199+
- Check out Zama libraries [Concrete](https://github.com/zama-ai/concrete) and [Concrete-ML](https://github.com/zama-ai/concrete-ml) and start using FHE in your own applications.
200+
- Check out [Zama's Hugging Face profile](https://huggingface.co/zama-fhe) to read more blog posts and try practical FHE demos.
201+
- Check out [@zama_fhe](https://twitter.com/zama_fhe) on twitter to get our latest updates.
202+
203+
## Conclusion and next steps
204+
205+
In this blog post, we have shown that custom Endpoints are pretty easy yet powerful to use. What we do in Concrete ML is pretty different from the regular workflow of ML practitioners, but we are still able to accommodate the custom Endpoints to deal with most of our needs. Kudos to Hugging Face engineers for developing such a generic solution.
206+
207+
We explained how:
208+
209+
- Developers can create their own pre-compiled models and make them available on Hugging Face models.
210+
- Companies can deploy developers' pre-compiled models and make them available to their users via HF Endpoints.
211+
- Final users can use these Endpoints to run their ML tasks over encrypted data.
212+
213+
To go further, it would be useful to have more powerful machines available on Hugging Face Endpoints to make inferences faster. Also, we could imagine that Concrete ML becomes more integrated into Hugging Face's interface and has a _Private-Preserving Inference Endpoint_ button, simplifying developers' lives even more. Finally, for integration in several server machines, it could be helpful to have a way to share a state between machines and keep this state non-volatile (FHE inference keys would be stored there).

0 commit comments

Comments
 (0)