Skip to content

Data Ops

Akio K edited this page Sep 8, 2025 · 12 revisions

Issues used in the creation of this page

(Some of these issues may be closed or open/in progress.)

Introduction to DataOps

DataOps is a methodology that applies DevOps principles and practices from traditional software engineering to data analytics and data management workflows. It's designed to improve the speed, quality, and reliability of data analytics by fostering better collaboration between data teams and automating many of the processes involved in getting data from source systems to analytical insights.

At Hack for LA, DataOps is not just about moving data efficiently — it’s about ensuring the sustainability, accuracy, and usefulness of civic tech data pipelines in a volunteer-driven environment. While the core principles align with industry practices like automation, version control, and cross-functional collaboration, our implementation reflects the unique constraints and opportunities of a civic tech ecosystem.

Organizations adopting DataOps typically see faster time-to-insight, improved data quality, better collaboration between teams, and more reliable data systems. However, implementation can be challenging, as it requires cultural change, significant tooling investment, and often a restructuring of how data teams operate.

  1. Data Orchestration: Orchestration is the automation of the data pipeline, from ingestion all the way down to the analysis.

  2. Data Governance: Governance ensures that the data is not only accurate, and consistent, but also compliant with data security laws and protocols.

  3. Continuous integration and continuous deployment (CI/CD): rapid development and integration of data automation processes.

  4. Data monitoring and observability: Checking the health and status of the data itself, as well as the pipelines.

Implementing DataOps

Data Pipeline and Version Control

  • Assess current processes to identify pain points and set clear improvement goals

  • Implement version control for all data assets (scripts, queries, configurations) using Git

  • Automate pipelines with orchestration tools like Airflow to replace manual workflows

Quality & Reliability

  • Add testing and monitoring using frameworks, with automated alerts for anomalies or failures. Common tools include Great Expectations, dbt

  • Establish CI/CD pipelines to automatically test and deploy data changes when code is updated

Cultural & Scaling

  • Build collaborative culture through regular team communication, shared dashboards, and cross-functional training

  • Iterate and scale gradually by starting with pilot projects, learning from experience, and expanding successful practices across the organization

DataOps at HfLA

How HfLA’s DataOps is Similar to Industry

  • Automation of repetitive tasks — We use scripts, cloud workflows, and dashboards to reduce manual intervention in data ingestion, cleaning, and reporting.
  • Version-controlled data schemas and ETL logic — GitHub repositories act as our source of truth for transformation logic, ensuring changes are auditable.
  • Cross-functional workflows — Data engineers, analysts, product managers, and designers work together, using agile-like sprints to align deliverables.

How HfLA’s DataOps Differs from Industry

  • Volunteer-driven execution — Contributors often onboard mid-project, so workflows must be well-documented and easy to learn.
  • Multi-project shared data governance — Centralized assets like the PeopleDepot schema support multiple teams, reducing duplication but requiring stricter change control.
  • Public-sector and open data integration — Our data sources often come from government agencies, requiring additional cleaning, validation, and ethical review before use.
  • Sustainability over speed — Unlike startups chasing rapid iteration, we balance innovation with long-term maintainability, ensuring future volunteers can continue the work.

Examples of DataOps in Action at HfLA

  • PeopleDepot — Centralized, governed data schema that consolidates people, program, and project data for multiple HfLA projects.
  • Food Oasis — Automated processes to validate, clean, and update food resource listings, blending automated checks with human review.
  • TDM Calculator — Structured datasets integrated into a web tool for the City of Los Angeles, with automated testing to ensure data integrity.
  • AI Skills Assessor Pipeline — Processing GitHub issue data through an AI-assisted classification pipeline, applying standardized skill labels and maintaining auditable logs for human-in-the-loop review.
  • Volunteer Engagement Dashboards — Pulling contribution, meeting attendance, and project activity data into Looker dashboards to support product managers in decision-making and resource planning.
  • Shared Drive File Deletion Monitor — Daily automated reporting on file deletions across shared drives, comparing current to prior snapshots and alerting stakeholders when thresholds are exceeded.
  • Product Board & Issue Health Dashboard — Aggregates GitHub issue activity across projects, surfacing backlog health, stale issues, and contributor load to help product managers prioritize work.
  • 311 Data Visualization — Ingests the City of Los Angeles 311 service request dataset from the public open data portal, cleans and standardizes the data, and powers a web app that visualizes requests on an interactive map. Users can filter by neighborhood council boundaries or by service category (e.g., streetlights, graffiti, illegal dumping, bulky item pickup). The DataOps workflow ensures the dataset stays up-to-date, supports performant filtering in the UI, and maintains geographic accuracy for civic engagement and advocacy.

Key Tools & Practices We Use

  • GitHub for code, schema, and ETL version control
  • Google Sheets & Google Apps Script for lightweight data manipulation and reporting
  • Looker for dashboards and data visualization
  • Documented onboarding guides to enable new volunteers to pick up work with minimal disruption
  • Data validation scripts to catch anomalies early in the pipeline

Definitions

Term Definition
CI/CD Continuous Integration/Continuous Development: the concept of iteratively maintaining the automation of data, its models, and analytics
ETL Extract, Transform, Load: the process from which raw data is extracted from its source, transformed into its appropriate format, then loaded into a target data storage
Schema governance process that ensures the structure and format of data stays consistent and accurate as it evolves across updates
Human-in-the-loop Concept that embeds the expertise of human judgement into the automation loop. Ensures that data is still relevant, and understandable to the end users, while maintaining the speed of automation
Version Control Concept of keeping track of the different iterations and updates a codebase experiences. A system/software that accomplishes this (such as GitHub) is called a Version Control System (VCS)

References & Resources

Contributors

Clone this wiki locally