ATTENTION: At the end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format, please include these directly in the README file of your repository.
In case you don't get one option exactly, select the closest one
For the homework, we'll be working with the green taxi dataset located here:
https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/green/download
To get a wget
-able link, use this prefix (note that the link itself gives 404):
https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/
So far in the course, we processed data for the year 2019 and 2020. Your task is to extend the existing flows to include data for the year 2021.
As a hint, Kestra makes that process really easy:
- You can leverage the backfill functionality in the scheduled flow to backfill the data for the year 2021. Just make sure to select the time period for which data exists i.e. from
2021-01-01
to2021-07-31
. Also, make sure to do the same for bothyellow
andgreen
taxi data (select the right service in thetaxi
input). - Alternatively, run the flow manually for each of the seven months of 2021 for both
yellow
andgreen
taxi data. Challenge for you: find out how to loop over the combination of Year-Month andtaxi
-type usingForEach
task which triggers the flow for each combination using aSubflow
task.
Complete the Quiz shown below. It’s a set of 6 multiple-choice questions to test your understanding of workflow orchestration, Kestra and ETL pipelines for data lakes and warehouses.
- Within the execution for
Yellow
Taxi data for the year2020
and month12
: what is the uncompressed file size (i.e. the output fileyellow_tripdata_2020-12.csv
of theextract
task)?
- 128.3 MB
- 134.5 MB
- 364.7 MB
- 692.6 MB
- What is the value of the variable
file
when the inputstaxi
is set togreen
,year
is set to2020
, andmonth
is set to04
during execution?
{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv
green_tripdata_2020-04.csv
green_tripdata_04_2020.csv
green_tripdata_2020.csv
- How many rows are there for the
Yellow
Taxi data for the year 2020?
- 13,537.299
- 24,648,499
- 18,324,219
- 29,430,127
- How many rows are there for the
Green
Taxi data for the year 2020?
- 5,327,301
- 936,199
- 1,734,051
- 1,342,034
- How many rows are there for the
Yellow
Taxi data for March 2021?
- 1,428,092
- 706,911
- 1,925,152
- 2,561,031
- How would you configure the timezone to New York in a Schedule trigger?
- Add a
timezone
property set toEST
in theSchedule
trigger configuration - Add a
timezone
property set toAmerica/New_York
in theSchedule
trigger configuration - Add a
timezone
property set toUTC-5
in theSchedule
trigger configuration - Add a
location
property set toNew_York
in theSchedule
trigger configuration
- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw2
- Check the link above to see the due date
Will be added after the due date