Ingest data from an API with Data Load Tool (DLT) via a rust pyo3 plugin.
Data ingestion the component of data engineering which involves receiving data from an outside source, and loading the data within one's own environment.
Common use cases for ingestion, in my experience in an enterprise setting, are threefold:
- Ingestion of data from a data suppplier outside one's own organisation.
- Ingestion of data from an upstream team or environment in the data lifecycle.
- Migration of data between platforms or, less commonly, environments (dev/prod).
This definition of ingestion allows us to consider three components of complexity in ingestion:
One's environment (ingestion destination) is likely to be significantly different from diverse ingestion sources.
Upstream data could be in diverse forms, including but not limited to: APIs, diverse flat files (excel, parquet, .wav), databases, and message streams.
These data sources can all exist with complex varieties of latency and schemas.
One's own file destination should be more consistent; in a data engineering team it is industry best practice to store data in an open table format (delta/iceberg) in cloud file storage.
Data catalogs, which are effectively the previous pattern with more built-in metadata capabilities, are becoming more common but not universal.
Nonetheless, different teams can work with:
- Different clouds
- Different networking security
- Different data models.
- Different standards of code, or inherited legacy cloud/code components.
Through diversity of systems, there remains complexity.
Technical components are important, but team structures are often the most important component of complexity. In any organisation of reasonable size and geographic dispersal, ingestion between sources and destinations remains increasingly complex. Team/communication interfaces that result in complexity, for ingestion, includes:
- Communication surrounding source/destination authentication.
- Communication and dependencies on source availability.
- Complexity of verifying source data quality and communication for resolution.
---
Data ingestion can sound simple: move data from one place to another. However, the above components result in complex patterns; include, with this, a high number of diverse data sources, and data ingestion becomes a hard problem in need of common patterns for simplification.
"Frameworks"
- YAML Engineering:
- No Code:
- Fortran
- Matillion
- Databricks(?)
- Ease of use:
- Flexibility:
- High in-built feature support:
- Plugins/extensibility:
- High performance:
- Low/no cost:
- In-built metadata/modelling:
- (Nominal) DQ checks:
Delta Load Tool (DLT)
DLT has great potential beyond simple ingestion. Within a data platform's total cost of ownership, storage is often the most cost effective.