Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FSTORE-1008] enable interacting with java client to hopsworks - 3.9 #432

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 18 additions & 14 deletions docs/user_guides/fs/compute_engines.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,31 +4,31 @@ In order to execute a feature pipeline to write to the Feature Store, as well as
Hopsworks Feature Store APIs are built around dataframes, that means feature data is inserted into the Feature Store from a Dataframe and likewise when reading data from the Feature Store, it is returned
as a Dataframe.

As such, Hopsworks supports three computational engines:
As such, Hopsworks supports five computational engines:

1. [Apache Spark](https://spark.apache.org): Spark Dataframes and Spark Structured Streaming Dataframes are supported, both from Python environments (PySpark) and from Scala environments.
2. [Python](https://www.python.org/): For pure Python environments without dependencies on Spark, Hopsworks supports [Pandas Dataframes](https://pandas.pydata.org/) and [Polars Dataframes](https://pola.rs/).
3. [Apache Flink](https://flink.apache.org): Flink Data Streams are currently supported as an experimental feature from Java/Scala environments.
3. [Apache Beam](https://beam.apache.org/) *experimental*: Beam Data Streams are currently supported as an experimental feature from Java/Scala environments.
4. [Apache Beam](https://beam.apache.org/) *experimental*: Beam Data Streams are currently supported as an experimental feature from Java/Scala environments.
5. [Java](https://www.java.com): For pure Java environments without dependencies on Spark, Hopsworks supports writing using List of POJO Objects.

Hopsworks supports running [compute on the platform itself](../../concepts/dev/inside.md) in the form of [Jobs](../projects/jobs/pyspark_job.md) or in [Jupyter Notebooks](../projects/jupyter/python_notebook.md).
Alternatlively, you can also connect to Hopsworks using Python or Spark from [external environments](../../concepts/dev/outside.md), given that there is network connectivity.

## Functionality Support

Hopsworks is aiming to provide funtional parity between the computational engines, however, there are certain Hopsworks functionalities which are exclusive to the engines.
Hopsworks is aiming to provide functional parity between the computational engines, however, there are certain Hopsworks functionalities which are exclusive to the engines.

| Functionality | Method | Spark | Python | Flink | Beam | Comment |
| ----------------------------------------------------------------- | ------ | ----- | ------ | ------ | ------ | ------- |
| Feature Group Creation from dataframes | [`FeatureGroup.create_feature_group()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#create_feature_group) | :white_check_mark: | :white_check_mark: | - | - | Currently Flink/Beam doesn't support registering feature group metadata. Thus it needs to be pre-registered before you can write real time features computed by Flink/Beam.|
| Training Dataset Creation from dataframes | [`TrainingDataset.save()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/training_dataset_api/#save) | :white_check_mark: | - | - | - | Functionality was deprecated in version 3.0 |
| Data validation using Great Expectations for streaming dataframes | [`FeatureGroup.validate()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#validate) [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | - | - | - | - | `insert_stream` does not perform any data validation even when a expectation suite is attached. |
| Stream ingestion | [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | :white_check_mark: | - | :white_check_mark: | :white_check_mark: | Python/Pandas/Polars has currently no notion of streaming. |
| Stream ingestion | [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | :white_check_mark: | - | :white_check_mark: | :white_check_mark: | Python/Pandas/Polars has currently no notion of streaming. |
| Reading from Streaming Storage Connectors | [`KafkaConnector.read_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/storage_connector_api/#read_stream) | :white_check_mark: | - | - | - | Python/Pandas/Polars has currently no notion of streaming. For Flink/Beam only write operations are supported |
| Reading training data from external storage other than S3 | [`FeatureView.get_training_data()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_training_data) | :white_check_mark: | - | - | - | Reading training data that was written to external storage using a Storage Connector other than S3 can currently not be read using HSFS APIs, instead you will have to use the storage's native client. |
| Reading External Feature Groups into Dataframe | [`ExternalFeatureGroup.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/external_feature_group_api/#read) | :white_check_mark: | - | - | - | Reading an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the [Query API](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/) to create Feature Views/Training Data containing External Feature Groups. |
| Read Queries containing External Feature Groups into Dataframe | [`Query.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/#read) | :white_check_mark: | - | - | - | Reading a Query containing an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the Query to create Feature Views/Training Data and write the data to a Storage Connector, from where you can read up the data into a Pandas/Polars Dataframe. |
| Functionality | Method | Spark | Python | Flink | Beam | Java | Comment |
| ----------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------ | ------------------ | ---------------------- | ------------------ | ------------------ |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Feature Group Creation from dataframes | [`FeatureGroup.create_feature_group()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#create_feature_group) | :white_check_mark: | :white_check_mark: | - | - | - | Currently Flink/Beam/Java doesn't support registering feature group metadata. Thus it needs to be pre-registered before you can write real time features computed by Flink/Beam. |
| Training Dataset Creation from dataframes | [`TrainingDataset.save()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/training_dataset_api/#save) | :white_check_mark: | - | - | - | - | Functionality was deprecated in version 3.0 |
| Data validation using Great Expectations for streaming dataframes | [`FeatureGroup.validate()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#validate) <br/> [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | - | - | - | - | - | `insert_stream` does not perform any data validation even when a expectation suite is attached. |
| Stream ingestion | [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | :white_check_mark: | - | :white_check_mark: | :white_check_mark: | :white_check_mark: | Python/Pandas/Polars has currently no notion of streaming. |
| Reading from Streaming Storage Connectors | [`KafkaConnector.read_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/storage_connector_api/#read_stream) | :white_check_mark: | - | - | - | - | Python/Pandas/Polars has currently no notion of streaming. For Flink/Beam/Java only write operations are supported |
| Reading training data from external storage other than S3 | [`FeatureView.get_training_data()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_training_data) | :white_check_mark: | - | - | - | - | Reading training data that was written to external storage using a Storage Connector other than S3 can currently not be read using HSFS APIs, instead you will have to use the storage's native client. |
| Reading External Feature Groups into Dataframe | [`ExternalFeatureGroup.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/external_feature_group_api/#read) | :white_check_mark: | - | - | - | - | Reading an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the [Query API](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/) to create Feature Views/Training Data containing External Feature Groups. |
| Read Queries containing External Feature Groups into Dataframe | [`Query.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/#read) | :white_check_mark: | - | - | - | - | Reading a Query containing an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the Query to create Feature Views/Training Data and write the data to a Storage Connector, from where you can read up the data into a Pandas/Polars Dataframe. |

## Python

Expand Down Expand Up @@ -77,3 +77,7 @@ Apache Beam integration with Hopsworks feature store was only tested using Dataf

For more details head over to the [Getting Started Guide](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/integrations/java/beam).

## Java
It is also possible to interact to Hopsworks feature store using pure Java environments without dependencies on Spark, Flink or Beam.

For more details head over to the [Getting Started Guide](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/java).
3 changes: 2 additions & 1 deletion docs/user_guides/integrations/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,10 @@
Hopsworks is an open platform aiming to be accessible from a variety of tools. Learn in this section how to connect to Hopsworks from

- [Python](python)
- [Java](java)
- [Databricks](databricks/networking)
- [AWS SageMaker](sagemaker)
- [AWS EMR](emr/networking)
- [Azure HDInsight](hdinsight)
- [Azure Machine Learning](mlstudio_designer)
- [Apache Spark](spark)
- [Apache Spark](spark)
100 changes: 100 additions & 0 deletions docs/user_guides/integrations/java.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
---
description: Documentation on how to connect to Hopsworks from a Java client.
---

# Java client

Starting from version 3.9.0-RC13, HSFS provides a pure Java client. This guide explains how to use the client to connect to Hopsworks and read or write feature data.

## Generate an API key

For instructions on how to generate an API key follow this [user guide](../projects/api_key/create_api_key.md). For the Java client to work correctly make sure you add the following scopes to your API key:

1. featurestore
2. project
3. job
4. kafka

## Add the HSFS dependency to your project:

The HSFS library is available on the Hopsworks' Maven repository. If you are using Maven as build tool, you can add the following in your pom.xml file:

```xml
<repositories>
<repository>
<id>Hops</id>
<name>Hops Repository</name>
<url>https://archiva.hops.works/repository/Hops/</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>
```

The artifactId for the HSFS Java build is hsfs, if you are using Maven as build tool, you can add the following dependency:

```xml
<dependency>
<groupId>com.logicalclocks</groupId>
<artifactId>hsfs</artifactId>
<version>${hsfs.version}</version>
</dependency>
```

!!!note "Java Version"

Please note that the Java client has been tested with Java versions up to Java 17

## Connecting to the Feature Store

You are now ready to connect to the Hopsworks Feature Store from a Java client:

```Java
//Import necessary classes
import com.logicalclocks.hsfs.FeatureStore;
import com.logicalclocks.hsfs.FeatureView;
import com.logicalclocks.hsfs.HopsworksConnection;

//Establish connection with Hopsworks.
HopsworksConnection hopsworksConnection = HopsworksConnection.builder()
.host("my_instance") // DNS of your Feature Store instance
.port(443) // Port to reach your Hopsworks instance, defaults to 443
.project("my_project") // Name of your Hopsworks Feature Store project
.apiKeyValue("api_key") // The API key to authenticate with the feature store
.hostnameVerification(false) // Disable for self-signed certificates
.build();

//get feature store handle
FeatureStore fs = hopsworksConnection.getFeatureStore();

//get feature view handle
FeatureView fv = fs.getFeatureView(fvName, fvVersion);

// get feature vector
List<Object> singleVector = fv.getFeatureVector(new HashMap<String, Object>() {{
put("id", 100);
}});
```

### Update feature data

The Java client allows you to update data on existing feature groups using the streaming interface. You can provide a list of POJO objects to the [insertStream](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/javadoc/com/logicalclocks/hsfs/StreamFeatureGroup.html#insertStream-java.util.List-) method.

The feature group should exists already (can be created using the Python client) and the POJO objects should be serializable with the feature group's AVRO schema.

Please see the [tutorial](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/integrations/java/java) for a code example on how to write data.

### Limitations

Currently using the Java client to retrieve feature vectors have the following limitations:

* Only the SQL interface is supported. It is not possible to retrieve feature vectors using the REST API Interface
* Feature Views with model dependent transformations attached are not applied. If your feature view has model dependent transformations, please use the Python client.

## Next Steps

You can find more information on how to interact from Java client in the [JavaDoc](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/javadoc/) or this [tutorial](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/integrations/java/java)
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ nav:
- Apache Spark: user_guides/integrations/spark.md
- Apache Flink: user_guides/integrations/flink.md
- Apache Beam: user_guides/integrations/beam.md
- Java: user_guides/integrations/java.md
- Sharing: user_guides/fs/sharing/sharing.md
- Tags: user_guides/fs/tags/tags.md
- Provenance: user_guides/fs/provenance/provenance.md
Expand Down