You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/user_guides/projects/jobs/notebook_job.md
+29-3Lines changed: 29 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -82,7 +82,7 @@ It is possible to also set following configuration settings for a `PYTHON` job.
82
82
*`Environment`: The python environment to use
83
83
*`Container memory`: The amount of memory in MB to be allocated to the Jupyter Notebook script
84
84
*`Container cores`: The number of cores to be allocated for the Jupyter Notebook script
85
-
*`Additional files`: List of files that will be locally accessible by the application
85
+
*`Additional files`: List of files that will be locally accessible in the working directory of the application. Only recommended to use if project datasets are not mounted under `/hopsfs`.
86
86
You can always modify the arguments in the job settings.
87
87
88
88
<palign="center">
@@ -142,7 +142,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration
|`type`| string | Type of the job configuration |`"pythonJobConfiguration"`|
175
+
|`appPath`| string | Project path to notebook (e.g `Resources/foo.ipynb`) |`null`|
176
+
|`environmentName`| string | Name of the python environment |`"pandas-training-pipeline"`|
177
+
|`resourceConfig.cores`| number (float) | Number of CPU cores to be allocated |`1.0`|
178
+
|`resourceConfig.memory`| number (int) | Number of MBs to be allocated |`2048`|
179
+
|`resourceConfig.gpus`| number (int) | Number of GPUs to be allocated |`0`|
180
+
|`logRedirection`| boolean | Whether logs are redirected |`true`|
181
+
|`jobType`| string | Type of job |`"PYTHON"`|
182
+
183
+
184
+
## Accessing project data
185
+
!!! notice "Recommended approach if `/hopsfs` is mounted"
186
+
If your Hopsworks installation is configured to mount the project datasets under `/hopsfs`, which it is in most cases, then please refer to this section instead of the `Additional files` property to reference file resources.
187
+
188
+
### Absolute paths
189
+
The project datasets are mounted under `/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/hopsfs/Resources/data.csv` in your notebook.
190
+
191
+
### Relative paths
192
+
The notebook's working directory is the folder it is located in. For example, if it is located in the `Resources` dataset, and you have a file named `data.csv` in that dataset, you simply access it using `data.csv`. Also, if you write a local file, for example `output.txt`, it will be saved in the `Resources` dataset.
|`type`| string | Type of the job configuration |`"sparkJobConfiguration"`|
220
+
|`appPath`| string | Project path to script (e.g `Resources/foo.py`) |`null`|
221
+
|`environmentName`| string | Name of the project spark environment |`"spark-feature-pipeline"`|
222
+
|`spark.driver.cores`| number (float) | Number of CPU cores allocated for the driver |`1.0`|
223
+
|`spark.driver.memory`| number (int) | Memory allocated for the driver (in MB) |`2048`|
224
+
|`spark.executor.instances`| number (int) | Number of executor instances |`1`|
225
+
|`spark.executor.cores`| number (float) | Number of CPU cores per executor |`1.0`|
226
+
|`spark.executor.memory`| number (int) | Memory allocated per executor (in MB) |`4096`|
227
+
|`spark.dynamicAllocation.enabled`| boolean | Enable dynamic allocation of executors |`true`|
228
+
|`spark.dynamicAllocation.minExecutors`| number (int) | Minimum number of executors with dynamic allocation |`1`|
229
+
|`spark.dynamicAllocation.maxExecutors`| number (int) | Maximum number of executors with dynamic allocation |`2`|
230
+
|`spark.dynamicAllocation.initialExecutors`| number (int) | Initial number of executors with dynamic allocation |`1`|
231
+
|`spark.blacklist.enabled`| boolean | Whether executor/node blacklisting is enabled |`false`|
232
+
233
+
234
+
## Accessing project data
235
+
236
+
### Read directly from the filesystem (recommended)
237
+
238
+
To read a dataset in your project using Spark, use the full filesystem path where the data is stored. For example, to read a CSV file named `data.csv` located in the `Resources` dataset of a project called `my_project`:
Different file types can be attached to the spark job and made available in the `/srv/hops/artifacts` folder when the PySpark job is started. This configuration is mainly useful when you need to add additional setup, such as jars that needs to be added to the CLASSPATH.
248
+
249
+
When reading data in your Spark job it is recommended to use the Spark read API as previously demonstrated, since this reads from the filesystem directly, whereas `Additional files` configuration options will download the files in its entirety and is not a scalable option.
Copy file name to clipboardExpand all lines: docs/user_guides/projects/jobs/python_job.md
+30-3Lines changed: 30 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -81,7 +81,8 @@ It is possible to also set following configuration settings for a `PYTHON` job.
81
81
*`Environment`: The python environment to use
82
82
*`Container memory`: The amount of memory in MB to be allocated to the Python script
83
83
*`Container cores`: The number of cores to be allocated for the Python script
84
-
*`Additional files`: List of files that will be locally accessible by the application
84
+
*`Additional files`: List of files that will be locally accessible in the working directory of the application. Only recommended to use if project datasets are not mounted under `/hopsfs`.
85
+
You can always modify the arguments in the job settings.
85
86
86
87
<palign="center">
87
88
<figure>
@@ -129,7 +130,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration
|`type`| string | Type of the job configuration |`"pythonJobConfiguration"`|
173
+
|`appPath`| string | Project path to script (e.g `Resources/foo.py`) |`null`|
174
+
|`environmentName`| string | Name of the project python environment |`"pandas-training-pipeline"`|
175
+
|`resourceConfig.cores`| number (float) | Number of CPU cores to be allocated |`1.0`|
176
+
|`resourceConfig.memory`| number (int) | Number of MBs to be allocated |`2048`|
177
+
|`resourceConfig.gpus`| number (int) | Number of GPUs to be allocated |`0`|
178
+
|`logRedirection`| boolean | Whether logs are redirected |`true`|
179
+
|`jobType`| string | Type of job |`"PYTHON"`|
180
+
181
+
182
+
## Accessing project data
183
+
!!! notice "Recommended approach if `/hopsfs` is mounted"
184
+
If your Hopsworks installation is configured to mount the project datasets under `/hopsfs`, which it is in most cases, then please refer to this section instead of the `Additional files` property to reference file resources.
185
+
186
+
### Absolute paths
187
+
The project datasets are mounted under `/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/hopsfs/Resources/data.csv` in your script.
188
+
189
+
### Relative paths
190
+
The script's working directory is the folder it is located in. For example, if it is located in the `Resources` dataset, and you have a file named `data.csv` in that dataset, you simply access it using `data.csv`. Also, if you write a local file, for example `output.txt`, it will be saved in the `Resources` dataset.
Copy file name to clipboardExpand all lines: docs/user_guides/projects/jobs/ray_job.md
+8-3Lines changed: 8 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ description: Documentation on how to configure and execute a Ray job on Hopswork
8
8
9
9
All members of a project in Hopsworks can launch the following types of applications through a project's Jobs service:
10
10
11
-
- Python (*Hopsworks Enterprise only*)
11
+
- Python
12
12
- Apache Spark
13
13
- Ray
14
14
@@ -168,7 +168,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration
168
168
169
169
```python
170
170
171
-
jobs_api = project.get_jobs_api()
171
+
jobs_api = project.get_job_api()
172
172
173
173
ray_config = jobs_api.get_configuration("RAY")
174
174
@@ -203,7 +203,12 @@ print(f_err.read())
203
203
204
204
```
205
205
206
-
### API Reference
206
+
## Accessing project data
207
+
208
+
The project datasets are mounted under `/home/yarnapp/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/home/yarnapp/hopsfs/Resources/data.csv` in your script.
|`type`| string | Type of the job configuration |`"sparkJobConfiguration"`|
221
+
|`appPath`| string | Project path to spark program (e.g `Resources/foo.jar`) |`null`|
222
+
|`mainClass`| string | Name of the main class to run (e.g `org.company.Main`) |`null`|
223
+
|`environmentName`| string | Name of the project spark environment |`"spark-feature-pipeline"`|
224
+
|`spark.driver.cores`| number (float) | Number of CPU cores allocated for the driver |`1.0`|
225
+
|`spark.driver.memory`| number (int) | Memory allocated for the driver (in MB) |`2048`|
226
+
|`spark.executor.instances`| number (int) | Number of executor instances |`1`|
227
+
|`spark.executor.cores`| number (float) | Number of CPU cores per executor |`1.0`|
228
+
|`spark.executor.memory`| number (int) | Memory allocated per executor (in MB) |`4096`|
229
+
|`spark.dynamicAllocation.enabled`| boolean | Enable dynamic allocation of executors |`true`|
230
+
|`spark.dynamicAllocation.minExecutors`| number (int) | Minimum number of executors with dynamic allocation |`1`|
231
+
|`spark.dynamicAllocation.maxExecutors`| number (int) | Maximum number of executors with dynamic allocation |`2`|
232
+
|`spark.dynamicAllocation.initialExecutors`| number (int) | Initial number of executors with dynamic allocation |`1`|
233
+
|`spark.blacklist.enabled`| boolean | Whether executor/node blacklisting is enabled |`false`|
234
+
235
+
## Accessing project data
236
+
237
+
### Read directly from the filesystem (recommended)
238
+
239
+
To read a dataset in your project using Spark, use the full filesystem path where the data is stored. For example, to read a CSV file named `data.csv` located in the `Resources` dataset of a project called `my_project`:
240
+
241
+
```java
242
+
Dataset<Row> df = spark.read()
243
+
.option("header", "true") // CSV has header
244
+
.option("inferSchema", "true") // Infer data types
245
+
.csv("/Projects/my_project/Resources/data.csv");
246
+
247
+
df.show();
248
+
```
249
+
250
+
### Additional files
251
+
252
+
Different file types can be attached to the spark job and made available in the `/srv/hops/artifacts` folder when the Spark job is started. This configuration is mainly useful when you need to add additional configuration such as jars that needs to be added to the CLASSPATH.
253
+
254
+
When reading data in your Spark job it is recommended to use the Spark read API as previously demonstrated, since this reads from the filesystem directly, whereas `Additional files` configuration options will download the files in its entirety and is not a scalable option.
Copy file name to clipboardExpand all lines: docs/user_guides/projects/jupyter/python_notebook.md
+12-1Lines changed: 12 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@
5
5
Jupyter is provided as a service in Hopsworks, providing the same user experience and features as if run on your laptop.
6
6
7
7
* Supports JupyterLab and the classic Jupyter front-end
8
-
* Configured with Python and PySpark kernels
8
+
* Configured with Python3, PySpark and Ray kernels
9
9
10
10
## Step 1: Jupyter dashboard
11
11
@@ -82,6 +82,17 @@ Start the Jupyter instance by clicking the `Run Jupyter` button.
82
82
</figure>
83
83
</p>
84
84
85
+
## Accessing project data
86
+
!!! notice "Recommended approach if `/hopsfs` is mounted"
87
+
If your Hopsworks installation is configured to mount the project datasets under `/hopsfs`, which it is in most cases, then please refer to this section.
88
+
If the file system is not mounted, then project files can be localized using the [download api](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/datasets/#download) to localize files in the current working directory.
89
+
90
+
### Absolute paths
91
+
The project datasets are mounted under `/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/hopsfs/Resources/data.csv` in your notebook.
92
+
93
+
### Relative paths
94
+
The notebook's working directory is the folder it is located in. For example, if it is located in the `Resources` dataset, and you have a file named `data.csv` in that dataset, you simply access it using `data.csv`. Also, if you write a local file, for example `output.txt`, it will be saved in the `Resources` dataset.
Copy file name to clipboardExpand all lines: docs/user_guides/projects/jupyter/ray_notebook.md
+5-1Lines changed: 5 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -139,4 +139,8 @@ In the Ray Dashboard, you can monitor the resources used by code you are runnin
139
139
<img src="../../../../assets/images/guides/jupyter/ray_jupyter_notebook_session.png" alt="Access Ray Dashboard">
140
140
<figcaption>Access Ray Dashboard for Jupyter Ray session</figcaption>
141
141
</figure>
142
-
</p>
142
+
</p>
143
+
144
+
## Accessing project data
145
+
146
+
The project datasets are mounted under `/home/yarnapp/hopsfs` in the Ray containers, so you can access `data.csv` from the `Resources` dataset using `/home/yarnapp/hopsfs/Resources/data.csv`.
Copy file name to clipboardExpand all lines: docs/user_guides/projects/jupyter/spark_notebook.md
+17Lines changed: 17 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -135,6 +135,23 @@ Navigate back to Hopsworks and a Spark session will have appeared, click on the
135
135
</figure>
136
136
</p>
137
137
138
+
## Accessing project data
139
+
140
+
### Read directly from the filesystem (recommended)
141
+
142
+
To read a dataset in your project using Spark, use the full filesystem path where the data is stored. For example, to read a CSV file named `data.csv` located in the `Resources` dataset of a project called `my_project`:
Different files can be attached to the jupyter session and made available in the `/srv/hops/artifacts` folder when the PySpark kernel is started. This configuration is mainly useful when you need to add additional configuration such as jars that needs to be added to the CLASSPATH.
152
+
153
+
When reading data in your Spark application, it is recommended to use the Spark read API as previously demonstrated, since this reads from the filesystem directly, whereas `Additional files` configuration options will download the files in its entirety and is not a scalable option.
154
+
138
155
## Going Further
139
156
140
157
You can learn how to [install a library](../python/python_install.md) so that it can be used in a notebook.
0 commit comments