-
-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🎉 engineering: add prefect engine when running ETL #3029
base: master
Are you sure you want to change the base?
Conversation
Quick links (staging server):
Login: chart-diff: ✅No charts for review.data-diff:= Dataset garden/agriculture/2024-03-26/attainable_yields
= Table attainable_yields
= Dataset garden/agriculture/2024-03-26/long_term_crop_yields
= Table long_term_crop_yields
= Dataset garden/agriculture/2024-03-26/long_term_wheat_yields
= Table long_term_wheat_yields
= Dataset garden/agriculture/2024-03-26/uk_long_term_yields
= Table uk_long_term_yields
= Dataset garden/agriculture/2024-05-23/daily_calories_per_person
= Table daily_calories_per_person
= Dataset garden/animal_welfare/2023-08-08/farmed_finfishes_used_for_food
= Table farmed_finfishes_used_for_food
= Dataset garden/animal_welfare/2023-08-14/number_of_farmed_fish
= Table number_of_farmed_fish
= Dataset garden/animal_welfare/2023-08-15/number_of_farmed_decapod_crustaceans
= Table number_of_farmed_decapod_crustaceans
= Dataset garden/animal_welfare/2023-08-16/number_of_wild_fish_killed_for_food
= Table number_of_wild_fish_killed_for_food
= Dataset garden/animal_welfare/2024-05-20/animals_used_for_food
= Table animals_used_for_food
= Dataset garden/artificial_intelligence/2023-06-14/ai_national_strategy
= Table ai_national_strategy
= Dataset garden/artificial_intelligence/2023-07-25/cset
= Table cset
= Dataset garden/artificial_intelligence/2024-06-28/ai_strategies
= Table ai_strategies
= Dataset garden/artificial_intelligence/2024-07-16/cset
= Table cset
= Dataset garden/bgs/2024-07-09/world_mineral_statistics
= Table world_mineral_statistics_flat
= Table world_mineral_statistics
= Dataset garden/climate/2024-02-19/monthly_burned_area
= Table monthly_burned_area
= Dataset garden/climate/2024-02-19/monthly_fire_emissions
= Table monthly_fire_emissions
= Dataset garden/climate_watch/2023-10-31/emissions_by_sector
= Table carbon_dioxide_emissions_by_sector
= Table methane_emissions_by_sector
= Table nitrous_oxide_emissions_by_sector
= Table greenhouse_gas_emissions_by_sector
= Table fluorinated_gas_emissions_by_sector
= Dataset garden/countries/2023-09-25/gleditsch
= Table gleditsch_countries
= Table gleditsch_regions
= Table gleditsch
= Dataset garden/countries/2023-09-25/isd
= Table isd
= Table isd_regions
= Table isd_countries
= Dataset garden/countries/2023-09-29/cow_ssm
= Table cow_ssm_majors
= Table cow_ssm_system
= Table cow_ssm_countries
= Table cow_ssm_regions
= Table cow_ssm_states
= Dataset garden/cow/2024-07-26/national_material_capabilities
= Table national_material_capabilities
= Dataset garden/democracy/2024-03-07/bmr
= Table population_regime
= Table bmr
= Table num_countries_regime
= Table population_regime_years
= Table num_countries_regime_years
= Dataset garden/democracy/2024-03-07/eiu
= Table avg_pop
= Table num_countries
= Table num_people
= Table eiu
= Dataset garden/democracy/2024-03-07/ert
= Table region_aggregates
= Table ert
= Dataset garden/democracy/2024-03-07/fh
= Table fh_regions
= Table fh
= Dataset garden/democracy/2024-03-07/lexical_index
= Table region_aggregates
= Table lexical_index
= Dataset garden/democracy/2024-03-07/polity
= Table avg_pop
= Table num_countries
= Table num_people
= Table polity
= Dataset garden/democracy/2024-03-07/vdem
= Table vdem_population
= Table vdem_num_countries
= Table vdem_multi_with_regions
= Table vdem_multi_without_regions
= Table vdem
= Dataset garden/demography/2023-03-31/population
= Table population
= Table population_original
= Dataset garden/demography/2023-06-27/world_population_comparison
= Table world_population_comparison
= Dataset garden/demography/2024-07-15/population
= Table population_density
= Table population_original
= Table historical
= Table projections
= Table population
= Table population_growth_rate
= Dataset garden/demography/2024-07-18/population_doubling_times
= Table population_doubling_times
= Dataset garden/education/2023-07-17/education_barro_lee_projections
= Table education_barro_lee_projections
= Dataset garden/education/2023-07-17/education_lee_lee
= Table education_lee_lee
= Dataset garden/eia/2023-12-12/energy_consumption
= Table energy_consumption
= Dataset garden/ember/2024-05-08/yearly_electricity
= Table yearly_electricity
= Dataset garden/emdat/2024-04-11/natural_disasters
= Table natural_disasters_yearly_deaths
= Table natural_disasters_yearly
= Table natural_disasters_yearly_impact
= Table natural_disasters_decadal_deaths
= Table natural_disasters_decadal_impact
= Table natural_disasters_decadal
= Dataset garden/emissions/2024-04-08/national_contributions
= Table national_contributions
= Dataset garden/emissions/2024-06-20/gdp_and_co2_decoupling
= Table gdp_and_co2_decoupling
= Dataset garden/energy/2024-05-08/photovoltaic_cost_and_capacity
= Table photovoltaic_cost_and_capacity
= Dataset garden/energy/2024-06-20/electricity_mix
= Table electricity_mix
= Dataset garden/energy/2024-06-20/energy_mix
= Table energy_mix
= Dataset garden/energy/2024-06-20/fossil_fuel_production
= Table fossil_fuel_production
= Dataset garden/energy/2024-06-20/fossil_fuel_reserves_production_ratio
= Table fossil_fuel_reserves_production_ratio
= Dataset garden/energy/2024-06-20/global_primary_energy
= Table global_primary_energy
= Dataset garden/energy/2024-06-20/primary_energy_consumption
= Table primary_energy_consumption
= Dataset garden/energy/2024-06-20/uk_historical_electricity
= Table uk_historical_electricity
= Dataset garden/energy_institute/2024-06-20/statistical_review_of_world_energy
= Table statistical_review_of_world_energy_prices
= Table statistical_review_of_world_energy
= Table statistical_review_of_world_energy_price_index
= Dataset garden/ess/2023-08-02/ess_trust
= Table ess_trust
= Dataset garden/eth/2023-03-15/ethnic_power_relations
= Table ethnic_power_relations
= Dataset garden/faostat/2024-03-14/additional_variables
= Table macronutrient_compositions
= Table vegetable_oil_yields
= Table fertilizer_exports
= Table arable_land_per_crop_output
= Table hypothetical_meat_consumption
= Table cereal_allocation
= Table maize_and_wheat
= Table fertilizers
= Table area_used_per_crop_type
= Table food_available_for_consumption
= Table land_spared_by_increased_crop_yields
= Table share_of_sustainable_and_overexploited_fish
= Table agriculture_land_use_evolution
= Dataset garden/faostat/2024-03-14/faostat_cahd
= Table faostat_cahd
= Table faostat_cahd_flat
= Dataset garden/faostat/2024-03-14/faostat_ei
= Table faostat_ei_flat
= Table faostat_ei
= Dataset garden/faostat/2024-03-14/faostat_ek
= Table faostat_ek
= Table faostat_ek_flat
= Dataset garden/faostat/2024-03-14/faostat_emn
= Table faostat_emn_flat
= Table faostat_emn
= Dataset garden/faostat/2024-03-14/faostat_esb
= Table faostat_esb
= Table faostat_esb_flat
= Dataset garden/faostat/2024-03-14/faostat_fa
= Table faostat_fa
= Table faostat_fa_flat
= Dataset garden/faostat/2024-03-14/faostat_fbsc
= Table faostat_fbsc_flat
= Table faostat_fbsc
= Dataset garden/faostat/2024-03-14/faostat_fo
= Table faostat_fo
= Table faostat_fo_flat
= Dataset garden/faostat/2024-03-14/faostat_food_explorer
= Table faostat_food_explorer
= Dataset garden/faostat/2024-03-14/faostat_fs
= Table faostat_fs_flat
= Table faostat_fs
= Dataset garden/faostat/2024-03-14/faostat_ic
= Table faostat_ic_flat
= Table faostat_ic
= Dataset garden/faostat/2024-03-14/faostat_lc
= Table faostat_lc_flat
= Table faostat_lc
= Dataset garden/faostat/2024-03-14/faostat_qcl
= Table faostat_qcl
= Table faostat_qcl_flat
= Dataset garden/faostat/2024-03-14/faostat_qi
= Table faostat_qi_flat
= Table faostat_qi
= Dataset garden/faostat/2024-03-14/faostat_qv
= Table faostat_qv_flat
= Table faostat_qv
= Dataset garden/faostat/2024-03-14/faostat_rfb
= Table faostat_rfb_flat
= Table faostat_rfb
= Dataset garden/faostat/2024-03-14/faostat_rfn
= Table faostat_rfn_flat
= Table faostat_rfn
= Dataset garden/faostat/2024-03-14/faostat_rl
= Table faostat_rl_flat
= Table faostat_rl
= Dataset garden/faostat/2024-03-14/faostat_rp
= Table faostat_rp
= Table faostat_rp_flat
= Dataset garden/faostat/2024-03-14/faostat_rt
= Table faostat_rt
= Table faostat_rt_flat
= Dataset garden/faostat/2024-03-14/faostat_scl
= Table faostat_scl
= Table faostat_scl_flat
= Dataset garden/faostat/2024-03-14/faostat_sdgb
= Table faostat_sdgb_flat
= Table faostat_sdgb
= Dataset garden/faostat/2024-03-14/faostat_tcl
= Table faostat_tcl
= Table faostat_tcl_flat
= Dataset garden/faostat/2024-03-14/faostat_ti
= Table faostat_ti_flat
= Table faostat_ti
= Dataset garden/forests/2024-05-08/ifl
= Table ifl
= Dataset garden/forests/2024-07-10/tree_cover_loss_by_driver
= Table tree_cover_loss_by_driver
= Dataset garden/gcp/2024-06-20/global_carbon_budget
= Table global_carbon_budget
= Dataset garden/ggdc/2022-11-28/penn_world_table
= Table penn_world_table
= Dataset garden/happiness/2024-06-09/happiness
= Table happiness
= Dataset garden/harvard/2023-09-18/colonial_dates_dataset
= Table colonial_dates_dataset
= Dataset garden/harvard/2024-07-22/global_military_spending_dataset
= Table global_military_spending_dataset
= Dataset garden/health/2023-04-18/wgm_mental_health
= Table wgm_mental_health
= Dataset garden/health/2023-04-25/wgm_2018
= Table wgm_2018
= Dataset garden/health/2023-08-09/unaids
= Table unaids
= Dataset garden/health/2023-08-14/avian_influenza_h5n1_kucharski
= Table avian_influenza_h5n1_kucharski
= Dataset garden/health/2023-08-16/deaths_karlinsky
= Table deaths_karlinsky
= Dataset garden/health/2024-04-02/organ_donation_and_transplantation
= Table organ_donation_and_transplantation
= Dataset garden/health/2024-04-12/polio_free_countries
= Table polio_free_countries
= Dataset garden/hyde/2024-01-02/all_indicators
= Table all_indicators
= Dataset garden/irena/2023-12-12/renewable_electricity_capacity
= Table renewable_electricity_capacity
= Dataset garden/irena/2023-12-12/renewable_energy_patents
= Table renewable_energy_patents
= Table renewable_energy_patents_by_technology
= Dataset garden/lgbt_rights/2023-04-27/lgbti_policy_index
= Table lgbti_policy_index
= Dataset garden/lgbt_rights/2024-06-03/equaldex
= Table equaldex
= Dataset garden/lgbt_rights/2024-06-11/criminalization_mignot
= Table criminalization_mignot
= Dataset garden/lis/2024-06-13/luxembourg_income_study
= Table lis_percentiles
= Table luxembourg_income_study
= Table luxembourg_income_study_adults
= Table lis_percentiles_adults
= Dataset garden/maternal_mortality/2024-07-08/maternal_mortality
= Table maternal_mortality
= Dataset garden/minerals/2024-07-15/minerals
= Table minerals
= Dataset garden/missing_data/2024-03-26/children_out_of_school
= Table children_out_of_school
= Dataset garden/missing_data/2024-03-26/who_md_suicides
= Table who_md_suicides
= Dataset garden/missing_data/2024-03-26/who_neuropsychiatric_conditions
= Table neuropsychiatric_conditions
= Dataset garden/neglected_tropical_diseases/2024-05-02/lymphatic_filariasis
= Table lymphatic_filariasis_national
= Table lymphatic_filariasis
= Dataset garden/neglected_tropical_diseases/2024-05-02/schistosomiasis
= Table schistosomiasis
= Dataset garden/news/2024-05-08/guardian_mentions
= Table guardian_mentions
= Table avg_10y
= Dataset garden/noaa_ncei/2024-05-09/natural_hazards
= Table natural_hazards
= Dataset garden/oecd/2024-07-01/road_accidents
= Table road_accidents
= Dataset garden/owid/latest/key_indicators
= Table land_area
= Table population_density
= Table population
= Dataset garden/pew/2024-06-03/same_sex_marriage
= Table same_sex_marriage
= Dataset garden/regions/2023-01-01/regions
= Table regions
= Dataset garden/research_development/2024-05-20/patents_articles
= Table patents_articles
= Dataset garden/shift/2023-12-12/energy_production_from_fossil_fuels
= Table energy_production_from_fossil_fuels
= Dataset garden/smoking/2024-05-30/cigarette_sales
= Table cigarette_sales
= Dataset garden/state_capacity/2023-10-19/state_capacity_dataset
= Table state_capacity_dataset
= Dataset garden/state_capacity/2023-11-10/information_capacity_dataset
= Table information_capacity_dataset
= Dataset garden/survey/2023-08-04/trust_surveys
= Table trust_surveys
= Dataset garden/technology/2022/internet
= Table users
= Dataset garden/terrorism/2023-07-20/global_terrorism_database
= Table global_terrorism_database
= Dataset garden/tuberculosis/2023-11-27/budget
= Table budget
= Dataset garden/tuberculosis/2023-11-27/burden_disaggregated
= Table burden_disaggregated
= Table burden_disaggregated_rate
= Dataset garden/tuberculosis/2023-11-27/burden_estimates
= Table burden_estimates
= Dataset garden/tuberculosis/2023-11-27/drug_resistance_surveillance
= Table drug_resistance_surveillance
= Dataset garden/tuberculosis/2023-11-27/laboratories
= Table laboratories
= Dataset garden/tuberculosis/2023-11-27/notifications
= Table notifications
= Dataset garden/tuberculosis/2023-11-27/outcomes_disagg
= Table outcomes_disagg
= Dataset garden/un/2023-08-02/comtrade_pandemics
= Table comtrade_pandemics
= Dataset garden/un/2023-10-09/plastic_waste
= Table plastic_waste
= Dataset garden/un/2023-10-30/un_members
= Table un_members
= Dataset garden/un/2024-01-17/urbanization_urban_rural
= Table urbanization_urban_rural
= Dataset garden/un/2024-07-08/maternal_mortality
= Table maternal_mortality
= Dataset garden/un/2024-07-25/refugee_data
= Table refugee_data
= Dataset garden/un/2024-07-25/resettlement
= Table resettlement
= Dataset garden/unep/2023-03-17/consumption_controlled_substances
= Table consumption_controlled_substances
= Dataset garden/unicef/2024-07-30/child_migration
= Table child_migration
= Dataset garden/urbanization/2024-01-26/ghsl_degree_of_urbanisation
= Table ghsl_degree_of_urbanisation
= Dataset garden/war/2023-09-21/brecke
= Table brecke
= Dataset garden/war/2023-09-21/cow
= Table cow_country
= Table cow_locations
= Table cow
= Dataset garden/war/2023-09-21/cow_mid
= Table cow_mid_country
= Table cow_mid
= Dataset garden/war/2023-09-21/mars
= Table mars
= Table mars_country
= Dataset garden/war/2023-09-21/mie
= Table mie
= Table mie_country
= Dataset garden/war/2023-09-21/prio_v31
= Table prio_v31
= Table prio_v31_country
= Dataset garden/war/2023-09-21/ucdp
= Table ucdp
= Table ucdp_locations
= Table ucdp_country
= Dataset garden/war/2023-09-21/ucdp_prio
= Table ucdp_prio
= Dataset garden/war/2023-09-27/peace_diehl
= Table peace_diehl
= Table peace_diehl_agg
= Dataset garden/war/2024-01-11/nuclear_weapons_proliferation
= Table nuclear_weapons_proliferation_counts
= Table nuclear_weapons_proliferation
= Dataset garden/war/2024-01-23/nuclear_weapons_treaties
= Table nuclear_weapons_treaties_country_counts
= Table nuclear_weapons_treaties
= Dataset garden/wash/2024-01-06/who
= Table who
= Dataset garden/wb/2024-07-29/income_groups
= Table income_groups
= Table income_groups_latest
= Dataset garden/who/2022-09-30/ghe
= Table ghe_suicides_ratio
= Table ghe
= Dataset garden/who/2023-06-01/cholera
= Table cholera
2024-08-06 09:44:15 [error ] Traceback (most recent call last):
File "/home/owid/etl/.venv/lib/python3.10/site-packages/requests/models.py", line 974, in json
return complexjson.loads(self.text, **kwargs)
File "/home/owid/etl/.venv/lib/python3.10/site-packages/simplejson/__init__.py", line 514, in loads
return _default_decoder.decode(s)
File "/home/owid/etl/.venv/lib/python3.10/site-packages/simplejson/decoder.py", line 386, in decode
obj, end = self.raw_decode(s)
File "/home/owid/etl/.venv/lib/python3.10/site-packages/simplejson/decoder.py", line 416, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/owid/etl/etl/datadiff.py", line 423, in cli
lines = future.result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/owid/etl/etl/datadiff.py", line 416, in func
differ.summary()
File "/home/owid/etl/etl/datadiff.py", line 254, in summary
self._diff_tables(self.ds_a, self.ds_b, table_name)
File "/home/owid/etl/etl/datadiff.py", line 122, in _diff_tables
table_a = future_a.result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/owid/etl/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 330, in wrapped_f
return self(f, *args, **kw)
File "/home/owid/etl/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 467, in __call__
do = self.iter(retry_state=retry_state)
File "/home/owid/etl/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 368, in iter
result = action(retry_state)
File "/home/owid/etl/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 390, in <lambda>
self._add_action_func(lambda rs: rs.outcome.result())
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/home/owid/etl/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 470, in __call__
result = fn(*args, **kwargs)
File "/home/owid/etl/etl/datadiff.py", line 837, in get_table_with_retry
return ds[table_name]
File "/home/owid/etl/etl/datadiff.py", line 278, in __getitem__
return tables.load()
File "/home/owid/etl/lib/catalog/owid/catalog/catalogs.py", line 312, in load
return self.iloc[0].load() # type: ignore
File "/home/owid/etl/lib/catalog/owid/catalog/catalogs.py", line 363, in load
return Table.read(uri)
File "/home/owid/etl/lib/catalog/owid/catalog/tables.py", line 177, in read
table = cls.read_feather(path, **kwargs)
File "/home/owid/etl/lib/catalog/owid/catalog/tables.py", line 349, in read_feather
cls._add_metadata(df, path, **kwargs)
File "/home/owid/etl/lib/catalog/owid/catalog/tables.py", line 321, in _add_metadata
metadata = cls._read_metadata(path)
File "/home/owid/etl/lib/catalog/owid/catalog/tables.py", line 383, in _read_metadata
return cast(Dict[str, Any], requests.get(metadata_path).json())
File "/home/owid/etl/.venv/lib/python3.10/site-packages/requests/models.py", line 978, in json
raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
= Dataset garden/who/2024-02-14/gho_suicides
= Table gho_suicides
= Table gho_suicides_ratio
= Dataset garden/who/2024-04-08/polio
= Table polio
= Dataset garden/who/2024-05-20/vehicles
= Table vehicles
= Dataset garden/who/latest/avian_influenza_ah5n1
= Table avian_influenza_ah5n1_month
= Table avian_influenza_ah5n1_year
= Dataset garden/wid/2024-05-24/world_inequality_database
= Table world_inequality_database
= Table world_inequality_database_distribution
= Table world_inequality_database_fiscal
= Dataset garden/wvs/2023-06-25/longitudinal_wvs
= Table longitudinal_wvs
⚠ Found errors, create an issue please
Legend: +New ~Modified -Removed =Identical Details
Hint: Run this locally with etl diff REMOTE data/ --include yourdataset --verbose --snippet Automatically updated datasets matching weekly_wildfires|excess_mortality|covid|fluid|flunet|country_profile|garden/ihme_gbd/2019/gbd_risk are not included Edited: 2024-11-06 10:24:54 UTC |
355077c
to
2fcb696
Compare
This looks great, thanks for doing this, Mojmir. This would be an alternative to Buildkite, where error logs are more legible? I tried checking the Prefect UI on Wizard, and realised that one needs to first run Also, I tried running Some entry in the docs could be of great use if we want data managers / engineers to use this tool.
|
This is more of a complement. We would run the Prefect web UI every staging server for you, and it would be an additional way of seeing individual ETL runs or changes made to your staging server and their logs. |
Hm, weird. Could you try running
Oh right 🤦 Moved it there.
Good idea! Fixed that with nicer error message.
The plan is to "dark launch" this first. Once we confirm that it has value, we should let data managers know and add it to docs. |
@Marigold Marcel and I were trying this now, and overall it's quite impressive, though there's a ton of errors of this form littered through the run, preventing it from executing cleanly:
|
💡 If we got this working and integrated this, one cool thing is that we could potentially provide step-level and run-level analytics on all our ETL steps and production runs, basically telling you how things are changing over time and what's most expensive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks super nice, mainly the many task failures when executing it would need to be solved.
etl/command.py
Outdated
task_futures: Dict[str, PrefectFuture] = {} | ||
|
||
for step in steps: | ||
# task = prefect.task(name=str(step), on_failure=[on_failure_hook]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am noticing that the flows hang around forever on failure. Is there a reason this bit is commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, when does it happen? When I run it locally with an error, it doesn't hang forever, but it fails after all tasks are either completed or failed.
@larsyencken how did you get the error? I can't replicate it... EDIT: I fixed one serious bug which might have caused it. |
@Marigold Thanks for the suggestions. Removing Tried
|
@lucasrodes is the dashboard running on http://0.0.0.0:4200? I read that the timeout thing is more of a warning than an actual error. |
@Marigold after removing
I get the
|
@larsyencken how did you get the error, please? I can't replicate it. |
This is ready to be merged, but I'm rethinking whether it's a good idea to introduce more complexity to ETL. Setting it as blocked until we find the right moment. |
@Marigold that makes sense. I've unset myself from the reviewers list. Feel free to add me back once you want me to review it! Thanks <3 |
Adds option
--engine
for specifying which scheduler to use (default is--engine etl
). Using--engine prefect
orchestrates ETL with Prefect and creates SQLite file that could be inspected with Prefect UI.The Prefect UI runs on http://staging-site-prefect:4200/flow-runs (there's a new link from Wizard). Here's an example of a run after changing regions.
It runs steps concurrently with a single worker (default) and uses Dask with multiple workers (flag
--workers
).Comparison to
--engine etl
structlog.info
adds colour to output, but Prefect can't decode it and prints characters like[0m [�[32m�[1minfo
. This should be soon fixed in Enhancement: ANSI color support in logs PrefectHQ/prefect-ui-library#2582My 2 cents
I'd find it very helpful for inspecting ETL runs on staging servers and in production (where I find searching through logs really annoying). We could give it a try and see if it was worth it in a few weeks. Prefect could also be useful for automatic dataset updates, which currently exist as bash scripts and are run by Buildkite. It works, but as @lucasrodes suggested, we might need more flexibility.
TODO before merging