-
Notifications
You must be signed in to change notification settings - Fork 59
Scheduling a Data Job for automatic execution
In this example, we will use a local installation of the Versatile Data Kit Control Service to create and schedule a continuously running Data Job. The job itself will merely print a message in the logs.
This guide consists of 3 parts:
- Part 1: Data Job
- Part 2: Deployment
- Part 3: Execution
To follow this guide, you need to have Control Service installed. To install the Control Service of the Versatile Data Kit locally, follow Installation guide
After the Control Service is installed, you can create a new Data Job by running the vdk create
command:
Run vdk create --help
to see what are all the options and examples.
If you run
vdk create
It will prompt you for all the necessary info. The rest of this example assumes that the selected job name is hello-world.
To verify that the job was indeed created in the Control Service, list all jobs:
vdk list --all
This should produce the following output:
job_name job_team status ----------- ---------- ------------ hello-world my-team NOT_DEPLOYED
You can also observe the code of the newly created Data Job by inspecting the content of the hello-world folder in the current directory. The code will be organized in the following structure:
hello-world/
├── 10_python_step.py
├── 20_sql_step.sql
├── config.ini
├── README.md
├── requirements.txt
You can modify This Data Job sample to customize the Data Job to your needs. For more information on the structure of the Data Jobs, please check the Data-Job page.
For the purpose of this example, let's delete the python and SQL step and just leave one Python step file - 10_python_step.py with the following content:
def run(job_input):
print(f'\n============ HELLO WORLD! ============\n')
Finally, modify the schedule_cron
property inside the config.ini file as follows:
schedule_cron = */2 * * * *
This property specifies the execution schedule for the Data Job when it is deployed. */2 * * * *
indicates that the Data Job will be executed every 2 minutes.
After the changes we have the following file structure:
hello-world/
├── 10_python_step.py
├── config.ini
├── README.md
├── requirements.txt
After the deployment is complete, the job will be automatically executed by the Control Service as per its schedule. The list of executions can be verified at any point by using the following command:
vdk execute --list -n hello-world -t my-team
This should show details about the last executions of the Data Job:
id job_name status type start_time end_time started_by message op_id job_version --------------------------- ----------- -------- --------- ------------------------- ------------------------- ------------ -------------- --------------------------- ---------------------------------------- hello-world-latest-27193696 hello-world finished scheduled 2021-09-14 12:16:00+00:00 2021-09-14 12:16:51+00:00 Success hello-world-latest-27193696 d9eedb67fc8d52301dbb61c6d9db4397c3f9a9ec hello-world-latest-27193698 hello-world finished scheduled 2021-09-14 12:18:00+00:00 2021-09-14 12:18:57+00:00 Success hello-world-latest-27193698 d9eedb67fc8d52301dbb61c6d9db4397c3f9a9ec hello-world-latest-27193700 hello-world finished scheduled 2021-09-14 12:20:00+00:00 2021-09-14 12:20:53+00:00 Success hello-world-latest-27193700 d9eedb67fc8d52301dbb61c6d9db4397c3f9a9ec hello-world-latest-27193702 hello-world finished scheduled 2021-09-14 12:22:00+00:00 2021-09-14 12:22:58+00:00 Success hello-world-latest-27193702 d9eedb67fc8d52301dbb61c6d9db4397c3f9a9ec hello-world-latest-27193704 hello-world running scheduled 2021-09-14 12:24:00+00:00 hello-world-latest-27193704 d9eedb67fc8d52301dbb61c6d9db4397c3f9a9ec
A new execution can be started manually at any time by using the following command:
vdk execute --start -n hello-world -t my-team
This command can potentially fail if there is an already running Data Job execution of the hello-world job at this time because parallel executions of the same job are currently not allowed, in order to ensure data integrity.
For the curious: what is going on behind the scenes?
Every execution is carried out by a pod. You can see the execution if you get the list of pods in the cluster:
kubectl get podsThe names of the pods corresponding to our Data Job start with the Data Job name (e.g. hello-world-latest-27193734--1-gb8t2). Find one such pod and show details by running:
kubectl describe hello-world-latest-27193734--1-gb8t2
Finally, to check the logs of a Data Job Execution use:
vdk execute --logs -n hello-world -t my-team --execution-id [execution-id-printed-from-vdk-execute-start]
Keep in mind that logs are kept only for the last few executions of a Data Job so looking too far into the past is not possible.
SDK - Develop Data Jobs
SDK Key Concepts
Control Service - Deploy Data Jobs
Control Service Key Concepts
- Scheduling a Data Job for automatic execution
- Deployment
- Execution
- Production
- Properties and Secrets
Operations UI
Community
Contacts