In this lab you learn how to import your own data in the designer to create custom solutions. There are two ways you can import data into the designer in Azure Machine Learning Studio:
-
Azure Machine Learning datasets
Register datasets in Azure Machine Learning to enable advanced features that help you manage your data.
-
Import Data module
Use the Import Data module to directly access data from online datasources.
The first approach will be covered later in the next lab, which focuses on registering and versioning a dataset in Azure Machine Learning studio.
While the use of datasets is recommended to import data, you can also use the Import Data module from the designer. Data comes into the designer from either a Datastore
or from Tabular Datasets
. Datastores will be covered later in this course, but just for a quick definition, you can use Datastores to access your storage without having to hard code connection information in your scripts. As for the second option, the Tabular datasets, the following datasources are supported in the designer: Delimited files, JSON files, Parquet files or SQL queries.
The following exercise focuses on the Import Data module to load data into a machine learning pipeline from several datasets that will be merged and restructured. We will be using some sample data from the UCI dataset repository to demonstrate how you can perform basic data import transformation steps with the modules available in Azure Machine Learning designer.
-
In Azure portal, open the available machine learning workspace.
-
Select Launch now under the Try the new Azure Machine Learning studio message.
-
When you first launch the studio, you may need to set the directory and subscription. If so, you will see this screen:
For the directory, select Udacity and for the subscription, select Azure Sponsorship. For the machine learning workspace, you may see multiple options listed. Select any of these (it doesn't matter which) and then click Get started.
-
From the studio, select Designer, +. This will open a
visual pipeline authoring editor
.
-
In the settings panel on the right, select Select compute target.
-
In the
Set up compute target
editor, select the existing compute target, and then select Save.
Note: If you are facing difficulties in accessing pop-up windows or buttons in the user interface, please refer to the Help section in the lab environment.
-
Select Data Input and Output section in the left navigation. Next, select Import Data and drag and drop the selected module on to the canvas.
-
In the
Import data
panel on the right, select the URL via HTTP option in theData Source
drop-down and provide the followingData source URL
for the first CSV file you will import in your pipeline:https://introtomlsampledata.blob.core.windows.net/data/crime-data/crime-dirty.csv
-
Select the Preview schema to filter the columns you want to include. You can also define advanced settings like Delimiter in
Parsing options
. Select Save to close the dialog.
-
Back to the pipeline canvas, select Submit on the top right corner to open the
Setup pipeline run
editor. -
In the
Setup pipeline run editor
, select Experiment, Create new and provideNew experiment name:
designer-data-import, and then select Submit.Please note that the button name in the UI is changed from Run to Submit.
-
Wait for pipeline run to complete. It will take around 10 minutes to complete the run.
-
Select the
Import Data
module on the canvas and then select Outputs on the right pane. Click on the Visualize icon to open theImport Data result visualization
dialog. -
In the
Import Data result visualization
dialog take some moments to explore all the metadata that is now available to you, such as: number of rows, columns, preview of data and for each column you select you can observe: Mean, Median, Min, Max and also number of Unique Values and Missing Values. Data profiles help you glimpse into the column types and summary statistics of a dataset. Scroll right and select the X Coordinate column. Notice theNan
value on the third row in the preview table and check theMissing values
number in the Statistics section. -
Select Close to return to the pipeline designer canvas where you can continue the data import phase.
-
Select Data Input and Output section in the left navigation. Next, drag and drop two Import Data modules on to the canvas as demonstrated in the first exercise and fill in the Web URLs as follows:
- for the first one, Data source URL :
https://introtomlsampledata.blob.core.windows.net/data/crime-data/crime-spring.csv
- for the second one, Data source URL :
https://introtomlsampledata.blob.core.windows.net/data/crime-data/crime-winter.csv
- for the first one, Data source URL :
-
For each of the three
Import Data
modules, select Preview schema and ensure that the data type forFBI Code
andLocation
is of typeString
and then select Save. -
Select the Data Transformation section in the left navigation. Drag and drop the Add rows module and connect it to the above added Import data modules.
-
Repeat the same step and add a second Add rows module that connects the output from the first Import data module to the output of the first Add rows module.
-
Drag the Clean Missing Data module from the Data Transformation section in the left navigation.
-
Select Edit column in the right pane to configure the list of columns to be cleaned. Select
Column names
from the available include options and type the name of the columns you intend to clean at this step:X Coordinate
andY Coordinate
. Select Save to close the dialog. -
Set the Minimum missing value ratio to
0.1
and the Maximum missing value ratio to0.5
. SelectReplace with mean
in the Cleaning mode field.
-
Select Submit to open the
Setup pipeline run
editor. -
In the
Setup pipeline run
editor, select Select existing, designer-data-import forExperiment
, and then select Submit.Please note that the button name in the UI is changed from Run to Submit.
-
Wait for pipeline run to complete. It will take around 8 minutes to complete the run.
-
Select the
Clean missing data
module you created on the canvas and then select Outputs + logs on the right pane. Click on the Save icon under the Cleaned dataset section to open theSave as dataset
dialog. -
Check the option to create a new dataset and enter
crime-all
in the dataset name field. Select Save to close the dialog. -
From the left navigation, select Datasets. This will open the
Registered datasets
page. See your registered dataset among the other datasets you used during this lesson.
Congratulations! You completed a few basic steps involved in the data explore and transform process, using the prebuilt modules you can find in the visual editor provided by Azure Machine Learning Studio. You can continue to experiment in the environment but are free to close the lab environment tab and return to the Udacity portal to continue with the lesson.