Skip to content

Commit 24a0de2

Browse files
committed
Hopsworks Python library installation documentation improvements (#431)
* Hopsworks Python library installation documentation improvements - Remmove references to pip install hsfs and hsfs.connection() - Improve the documentation for the installation of the Python library (Including profiles) - Add documentation for the installation of the Java library * Typo * Fix for review
1 parent 3ea3f5d commit 24a0de2

20 files changed

+223
-468
lines changed
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.

docs/user_guides/client_installation/index.md

+87-28
Original file line numberDiff line numberDiff line change
@@ -1,56 +1,115 @@
11
---
2-
description: Documentation on how to install the Hopsworks and HSFS Python libraries, including the specific requirements for Mac OSX and Windows.
2+
description: Documentation on how to install the Hopsworks Python and Java library.
33
---
44
# Client Installation Guide
55

6-
## Hopsworks (including Feature Store and MLOps)
7-
The Hopsworks client library is required to connect to the Hopsworks Feature Store and MLOps services from your local machine or any other Python environment such as Google Colab or AWS Sagemaker. Execute the following command to install the full Hopsworks client library in your Python environment:
6+
## Hopsworks Python library
7+
8+
The Hopsworks Python client library is required to connect to Hopsworks from your local machine or any other Python environment such as Google Colab or AWS Sagemaker. Execute the following command to install the Hopsworks client library in your Python environment:
89

910
!!! note "Virtual environment"
1011
It is recommended to use a virtual python environment instead of the system environment used by your operating system, in order to avoid any side effects regarding interfering dependencies.
1112

12-
```bash
13-
pip install hopsworks
14-
```
15-
Supported versions of Python: 3.8, 3.9, 3.10, 3.11, 3.12 ([PyPI ↗](https://pypi.org/project/hopsworks/))
16-
17-
!!! attention "OSX Installation"
18-
Hopsworks latest version should work on OSX systems without any additional requirements. However if installing an older version of the Hopsworks SDK you might need to install `librdkafka` manually. Checkout the documentation for the specific version you are installing.
19-
2013
!!! attention "Windows/Conda Installation"
2114

2215
On Windows systems you might need to install twofish manually before installing hopsworks, if you don't have the Microsoft Visual C++ Build Tools installed. In that case, it is recommended to use a conda environment and run the following commands:
2316

2417
```bash
2518
conda install twofish
26-
pip install hopsworks
19+
pip install hopsworks[python]
2720
```
2821

29-
## Feature Store only
30-
To only install the Hopsworks Feature Store client library, execute the following command:
22+
```bash
23+
pip install hopsworks[python]
24+
```
25+
Supported versions of Python: 3.8, 3.9, 3.10, 3.11, 3.12 ([PyPI ↗](https://pypi.org/project/hopsworks/))
26+
27+
### Profiles
28+
29+
The Hopsworks library has several profiles that bring additional dependencies and enable additional functionalities:
30+
31+
| Profile Name | Description |
32+
| ------------------ | ------------- |
33+
| No Profile | This is the base installation. Supports interacting with the feature store metadata, model registry and deployments. It also supports reading and writing from the feature store from PySpark environments. |
34+
| `python` | This profile enables reading and writing from/to the feature store from a Python environment |
35+
| `great-expectations` | This profile installs the [Great Expectations](https://greatexpectations.io/) Python library and enables data validation on feature pipelines |
36+
| `polars` | This profile installs the [Polars](https://pola.rs/) library and enables reading and writing Polars DataFrames |
37+
38+
You can install all the above profiles with the following command:
3139

3240
```bash
33-
pip install hsfs[python]
34-
# or if using zsh
35-
pip install 'hsfs[python]'
41+
pip install hopsworks[python,great-expectations,polars]
3642
```
37-
Supported versions of Python: 3.8, 3.9, 3.10, 3.11, 3.12 ([PyPI ↗](https://pypi.org/project/hsfs/))
3843

39-
!!! attention "OSX Installation"
40-
Hopsworks latest version should work on OSX systems without any additional requirements. However if installing an older version of the Hopsworks SDK you might need to install `librdkafka` manually. Checkout the documentation for the specific version you are installing.
44+
## HSFS Java Library:
4145

42-
!!! attention "Windows/Conda Installation"
46+
If you want to interact with the Hopsworks Feature Store from environments such as Spark, Flink or Beam, you can use the Hopsworks Feature Store (HSFS) Java library.
4347

44-
On Windows systems you might need to install twofish manually before installing hsfs, if you don't have the Microsoft Visual C++ Build Tools installed. In that case, it is recommended to use a conda environment and run the following commands:
45-
46-
```bash
47-
conda install twofish
48-
pip install hsfs[python]
49-
```
48+
!!!note "Feature Store Only"
49+
50+
The Java library only allows interaction with the Feature Store component of the Hopsworks platform. Additionally each environment might restrict the supported API operation. You can see which API operation is supported by which environment [here](../fs/compute_engines)
51+
52+
The HSFS library is available on the Hopsworks' Maven repository. If you are using Maven as build tool, you can add the following in your `pom.xml` file:
53+
54+
```
55+
<repositories>
56+
<repository>
57+
<id>Hops</id>
58+
<name>Hops Repository</name>
59+
<url>https://archiva.hops.works/repository/Hops/</url>
60+
<releases>
61+
<enabled>true</enabled>
62+
</releases>
63+
<snapshots>
64+
<enabled>true</enabled>
65+
</snapshots>
66+
</repository>
67+
</repositories>
68+
```
69+
70+
The library has different builds targeting different environments:
71+
72+
### Spark
73+
74+
The `artifactId` for the Spark build is `hsfs-spark-spark{spark.version}`, if you are using Maven as build tool, you can add the following dependency:
75+
76+
```
77+
<dependency>
78+
<groupId>com.logicalclocks</groupId>
79+
<artifactId>hsfs-spark-spark3.1</artifactId>
80+
<version>${hsfs.version}</version>
81+
</dependency>
82+
```
83+
84+
Hopsworks provides builds for Spark 3.1, 3.3 and 3.5. The builds are also provided as JAR files which can be downloaded from [Hopsworks repository](https://repo.hops.works/master/hsfs)
85+
86+
### Flink
87+
88+
The `artifactId` for the Flink build is `hsfs-flink`, if you are using Maven as build tool, you can add the following dependency:
89+
90+
```
91+
<dependency>
92+
<groupId>com.logicalclocks</groupId>
93+
<artifactId>hsfs-flink</artifactId>
94+
<version>${hsfs.version}</version>
95+
</dependency>
96+
```
97+
98+
### Beam
99+
100+
The `artifactId` for the Beam build is `hsfs-beam`, if you are using Maven as build tool, you can add the following dependency:
101+
102+
```
103+
<dependency>
104+
<groupId>com.logicalclocks</groupId>
105+
<artifactId>hsfs-beam</artifactId>
106+
<version>${hsfs.version}</version>
107+
</dependency>
108+
```
50109

51110
## Next Steps
52111

53-
If you are using a local python environment and want to connect to the Hopsworks Feature Store, you can follow the [Python Guide](../integrations/python.md#generate-an-api-key) section to create an API Key and to get started.
112+
If you are using a local python environment and want to connect to Hopsworks, you can follow the [Python Guide](../integrations/python.md#generate-an-api-key) section to create an API Key and to get started.
54113

55114
## Other environments
56115

docs/user_guides/fs/sharing/sharing.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -64,12 +64,12 @@ To access features from a shared feature store you need to first retrieve the ha
6464
To retrieve the handle use the get_feature_store() method and provide the name of the shared feature store
6565

6666
```python
67-
import hsfs
67+
import hopsworks
6868

69-
connection = hsfs.connection()
69+
project = hopsworks.login()
7070

71-
project_feature_store = connection.get_feature_store()
72-
shared_feature_store = connection.get_feature_store(name="name_of_shared_feature_store")
71+
project_feature_store = project.get_feature_store()
72+
shared_feature_store = project.get_feature_store(name="name_of_shared_feature_store")
7373
```
7474

7575
### Step 2: Fetch feature groups

docs/user_guides/fs/storage_connector/usage.md

+3-4
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,10 @@ We retrieve a storage connector simply by its unique name.
1414

1515
=== "PySpark"
1616
```python
17-
import hsfs
17+
import hopsworks
1818
# Connect to the Hopsworks feature store
19-
hsfs_connection = hsfs.connection()
20-
# Retrieve the metadata handle
21-
feature_store = hsfs_connection.get_feature_store()
19+
project = hopsworks.login()
20+
feature_store = project.get_feature_store()
2221
# Retrieve storage connector
2322
connector = feature_store.get_storage_connector('connector_name')
2423
```
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Hopsworks API key
22

3-
In order for the Databricks cluster to be able to communicate with the Hopsworks Feature Store, the clients running on Databricks need to be able to access a Hopsworks API key.
3+
In order for the Databricks cluster to be able to communicate with Hopsworks, clients running on Databricks need to be able to access a Hopsworks API key.
44

55
## Generate an API key
66

@@ -15,127 +15,19 @@ For instructions on how to generate an API key follow this [user guide](../../pr
1515

1616
!!! hint "API key as Argument"
1717
To get started quickly, without saving the Hopsworks API in a secret storage, you can simply supply it as an argument when instantiating a connection:
18-
```python hl_lines="6"
19-
import hsfs
20-
conn = hsfs.connection(
21-
host='my_instance', # DNS of your Feature Store instance
22-
port=443, # Port to reach your Hopsworks instance, defaults to 443
23-
project='my_project', # Name of your Hopsworks Feature Store project
24-
api_key_value='apikey', # The API key to authenticate with Hopsworks
25-
hostname_verification=True # Disable for self-signed certificates
26-
)
27-
fs = conn.get_feature_store() # Get the project's default feature store
28-
```
2918

30-
## Store the API key
3119

32-
### AWS
33-
34-
#### Step 1: Create an instance profile to attach to your Databricks clusters
35-
36-
Go to the *AWS IAM* choose *Roles* and click on *Create Role*. Select *AWS Service* as the type of trusted entity and *EC2* as the use case as shown below:
37-
38-
<p align="center">
39-
<figure>
40-
<img src="../../../../assets/images/guides/integrations/create-instance-profile.png" alt="Create an instance profile">
41-
<figcaption>Create an instance profile</figcaption>
42-
</figure>
43-
</p>
44-
45-
Click on *Next: Permissions*, *Next:Tags*, and then *Next: Review*. Name the instance profile role and then click *Create role*.
46-
47-
#### Step 2: Storing the API Key
48-
49-
**Option 1: Using the AWS Systems Manager Parameter Store**
50-
51-
In the AWS Management Console, ensure that your active region is the region you use for Databricks.
52-
Go to the *AWS Systems Manager* choose *Parameter Store* and select *Create Parameter*.
53-
As name enter `/hopsworks/role/[MY_DATABRICKS_ROLE]/type/api-key` replacing `[MY_DATABRICKS_ROLE]` with the name of the AWS role you have created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters). Select *Secure String* as type and create the parameter.
54-
55-
<p align="center">
56-
<figure>
57-
<img src="../../../../assets/images/guides/integrations/databricks/aws/databricks_parameter_store.png" alt="Storing the Feature Store API key in the Parameter Store">
58-
<figcaption>Storing the Feature Store API key in the Parameter Store</figcaption>
59-
</figure>
60-
</p>
61-
62-
63-
Once the API Key is stored, you need to grant access to it from the AWS role that you have created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters).
64-
In the AWS Management Console, go to *IAM*, select *Roles* and then search for the role that you have created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters).
65-
Select *Add inline policy*. Choose *Systems Manager* as service, expand the *Read* access level and check *GetParameter*.
66-
Expand Resources and select *Add ARN*.
67-
Enter the region of the *Systems Manager* as well as the name of the parameter **WITHOUT the leading slash** e.g. *hopsworks/role/[MY_DATABRICKS_ROLE]/type/api-key* and click *Add*.
68-
Click on *Review*, give the policy a name and click on *Create policy*.
69-
70-
<p align="center">
71-
<figure>
72-
<img src="../../../../assets/images/guides/integrations/databricks/aws/databricks_parameter_store_policy.png" alt="Configuring the access policy for the Parameter Store">
73-
<figcaption>Configuring the access policy for the Parameter Store</figcaption>
74-
</figure>
75-
</p>
76-
77-
78-
**Option 2: Using the AWS Secrets Manager**
79-
80-
In the AWS management console ensure that your active region is the region you use for Databricks.
81-
Go to the *AWS Secrets Manager* and select *Store new secret*. Select *Other type of secrets* and add *api-key*
82-
as the key and paste the API key created in the previous step as the value. Click next.
83-
84-
<p align="center">
85-
<figure>
86-
<img src="../../../../assets/images/guides/integrations/databricks/aws/databricks_secrets_manager_step_1.png" alt="Storing a Feature Store API key in the Secrets Manager Step 1">
87-
<figcaption>Storing a Feature Store API key in the Secrets Manager Step 1</figcaption>
88-
</figure>
89-
</p>
90-
91-
As secret name, enter *hopsworks/role/[MY_DATABRICKS_ROLE]* replacing [MY_DATABRICKS_ROLE] with the AWS role you have created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters). Select next twice and finally store the secret.
92-
Then click on the secret in the secrets list and take note of the *Secret ARN*.
93-
94-
<p align="center">
95-
<figure>
96-
<img src="../../../../assets/images/guides/integrations/databricks/aws/databricks_secrets_manager_step_2.png" alt="Storing a Feature Store API key in the Secrets Manager Step 2">
97-
<figcaption>Storing a Feature Store API key in the Secrets Manager Step 2</figcaption>
98-
</figure>
99-
</p>
100-
101-
Once the API Key is stored, you need to grant access to it from the AWS role that you have created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters).
102-
In the AWS Management Console, go to *IAM*, select *Roles* and then the role that that you have created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters).
103-
Select *Add inline policy*. Choose *Secrets Manager* as service, expand the *Read* access level and check *GetSecretValue*.
104-
Expand Resources and select *Add ARN*. Paste the ARN of the secret created in the previous step.
105-
Click on *Review*, give the policy a name and click on *Create policy*.
106-
107-
<p align="center">
108-
<figure>
109-
<img src="../../../../assets/images/guides/integrations/databricks/aws/databricks_secrets_manager_policy.png" alt="Configuring the access policy for the Secrets Manager">
110-
<figcaption>Configuring the access policy for the Secrets Manager</figcaption>
111-
</figure>
112-
</p>
113-
114-
#### Step 3: Allow Databricks to use the AWS role created in Step 1
115-
116-
First you need to get the AWS role used by Databricks for deployments as described in [this step](https://docs.databricks.com/administration-guide/cloud-configurations/aws/instance-profiles.html#step-3-note-the-iam-role-used-to-create-the-databricks-deployment). Once you get the role name, go to *AWS IAM*, search for the role, and click on it. Then, select the *Permissions* tab, click on *Add inline policy*, select the *JSON* tab, and paste the following snippet. Replace *[ACCOUNT_ID]* with your AWS account id, and *[MY_DATABRICKS_ROLE]* with the AWS role name created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters).
117-
118-
```json
119-
{
120-
"Version": "2012-10-17",
121-
"Statement": [
122-
{
123-
"Sid": "PassRole",
124-
"Effect": "Allow",
125-
"Action": "iam:PassRole",
126-
"Resource": "arn:aws:iam::[ACCOUNT_ID]:role/[MY_DATABRICKS_ROLE]"
127-
}
128-
]
129-
}
20+
```python hl_lines="6"
21+
import hopsworks
22+
project = hopsworks.login(
23+
host='my_instance', # DNS of your Feature Store instance
24+
port=443, # Port to reach your Hopsworks instance, defaults to 443
25+
project='my_project', # Name of your Hopsworks Feature Store project
26+
api_key_value='apikey', # The API key to authenticate with Hopsworks
27+
)
28+
fs = project.get_feature_store() # Get the project's default feature store
13029
```
13130

132-
Click *Review Policy*, name the policy, and click *Create Policy*. Then, go to your Databricks workspace and follow [this step](https://docs.databricks.com/administration-guide/cloud-configurations/aws/instance-profiles.html#step-5-add-the-instance-profile-to-databricks) to add the instance profile to your workspace. Finally, when launching Databricks clusters, select *Advanced* settings and choose the instance profile you have just added.
133-
134-
135-
### Azure
136-
137-
On Azure we currently do not support storing the API key in a secret storage. Instead just store the API key in a file in your Databricks workspace so you can access it when connecting to the Feature Store.
138-
13931
## Next Steps
14032

14133
Continue with the [configuration guide](configuration.md) to finalize the configuration of the Databricks Cluster to communicate with the Hopsworks Feature Store.

docs/user_guides/integrations/databricks/configuration.md

+10-32
Original file line numberDiff line numberDiff line change
@@ -90,38 +90,16 @@ When a cluster is configured for a specific project user, all the operations wit
9090
At the end of the configuration, Hopsworks will start the cluster.
9191
Once the cluster is running users can establish a connection to the Hopsworks Feature Store from Databricks:
9292

93-
!!! note "API key on Azure"
94-
Please note, for Azure it is necessary to store the Hopsworks API key locally on the cluster as a file. As we currently do not support storing the API key on an Azure Secret Management Service as we do for AWS. Consult the [API key guide for Azure](api_key.md#azure), for more information.
95-
96-
=== "AWS"
97-
98-
```python
99-
import hsfs
100-
conn = hsfs.connection(
101-
'my_instance', # DNS of your Feature Store instance
102-
443, # Port to reach your Hopsworks instance, defaults to 443
103-
'my_project', # Name of your Hopsworks Feature Store project
104-
secrets_store='secretsmanager', # Either parameterstore or secretsmanager
105-
hostname_verification=True # Disable for self-signed certificates
106-
)
107-
fs = conn.get_feature_store() # Get the project's default feature store
108-
```
109-
110-
=== "Azure"
111-
112-
```python
113-
import hsfs
114-
conn = hsfs.connection(
115-
'my_instance', # DNS of your Feature Store instance
116-
443, # Port to reach your Hopsworks instance, defaults to 443
117-
'my_project', # Name of your Hopsworks Feature Store project
118-
secrets_store='local',
119-
api_key_file="featurestore.key", # For Azure, store the API key locally
120-
secrets_store = "local",
121-
hostname_verification=True # Disable for self-signed certificates
122-
)
123-
fs = conn.get_feature_store() # Get the project's default feature store
124-
```
93+
```python
94+
import hopsworks
95+
project = hopsworks.login(
96+
host='my_instance', # DNS of your Hopsworks instance
97+
port=443, # Port to reach your Hopsworks instance, defaults to 443
98+
project='my_project', # Name of your Hopsworks project
99+
api_key_value='apikey', # The API key to authenticate with Hopsworks
100+
)
101+
fs = project.get_feature_store() # Get the project's default feature store
102+
```
125103

126104
## Next Steps
127105

0 commit comments

Comments
 (0)