To access your data in your storage account, Azure Machine Learning offers datastores and datasets. Create an Azure Machine Learning datasets to interact with data in your datastores and package your data into a consumable object for machine learning tasks. Register the dataset to your workspace to share and reuse it across different experiments without data ingestion complexities.
Datasets can be created from local files, public urls, Azure Open Datasets, or specific file(s) in your datastores. To create a dataset from an in memory pandas dataframe, write the data to a local file, like a csv, and create your dataset from that file. Datasets aren't copies of your data, but are references that point to the data in your storage service, so no extra storage cost is incurred.
In this lab, we are using a subset of NYC Taxi & Limousine Commission - green taxi trip records available from Azure Open Datasets to show how you can register and version a Dataset using the AML designer interface. In the first exercises we use a modified version of the original CSV file, which includes collected records for five months (January till May). The second exercise demonstrates how we can create a new version of the initial dataset when new data is collected (in this case, we included records collected in June in the CSV file).
-
In Azure portal, open the available machine learning workspace.
-
Select Launch now under the Try the new Azure Machine Learning studio message.
-
When you first launch the studio, you may need to set the directory and subscription. If so, you will see this screen:
For the directory, select Udacity and for the subscription, select Azure Sponsorship. For the machine learning workspace, you may see multiple options listed. Select any of these (it doesn't matter which) and then click Get started.
-
From the studio, select Datasets, + Create dataset, From web files. This will open the
Create dataset from web files
dialog on the right. -
Provide the following information and then select Next:
-
Web URL:
https://introtomlsampledata.blob.core.windows.net/data/nyc-taxi/nyc-taxi-sample-data-5months.csv
-
Name:
nyc-taxi-sample-dataset
-
-
On the Settings and preview panel, set the Column headers drop down to
All files have same headers
. -
Scroll the data preview to right to observe the target column:
totalAmount
. After you are done reviewing the data, select Next
-
Select columns from the dataset to include as part of your training data. Leave the default selections and select Next
-
From the Azure Machine Learning studio, select Datasets and select the
nyc-taxi-sample-dataset
dataset created in the first exercise. This will open theDataset details
page. -
Select New version, From web files to open the same
Create dataset from web files
dialog you already entered in the first exercise. -
This time, the Name and Dataset version fields are already filled in for you. Provide the following information and select Next to move on to the next step:
- Web URL:
https://introtomlsampledata.blob.core.windows.net/data/nyc-taxi/nyc-taxi-sample-data-6months.csv
- Web URL:
-
Select
All files have the same headers
in the Column headers drop-down and move on to the schema selection step. -
On the
Schema
page, let's suppose you decided to exclude some columns from your dataset. Exclude columns: snowDepth, prcipTime, precipDepth. Select Next to move on to the final step. -
Notice the
Dataset version
value in the basic info section. Select Create to close the new version confirmation page.
-
Back to the Datasets page, in the Registered datasets list, notice the version value for the
nyc-taxi-sample-dataset
dataset. -
Select the
nyc-taxi-sample-dataset
dataset link to open the dataset details page, where Version 2(latest) is automatically selected. Go to the Explore section to observe the structure and content of the new version. Notice the columns and rows structure in the dataset preview pane:- Number of columns: 11
- Number of rows: 10000
- Scroll right to check that the three excluded columns are missing (snowDepth, prcipTime, precipDepth)
-
Select Version 1 from the drop-down near the dataset name title and notice the changing values for:
- Number of columns: 14 (since the previous version still contains the three excluded columns)
- Number of rows: 9776 (since the previous version contains only data for 5 months)
Congratulations! You have now explored a first simple scenario for dataset versioning using the Azure Machine Learning studio. You found out how you can create and version a simple dataset when new training data is available. You can continue to experiment in the environment but are free to close the lab environment tab and return to the Udacity portal to continue with the lesson.