Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 30 additions & 25 deletions pages/docs/data-pipelines.mdx
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
import { Callout } from 'nextra/components'
import { Callout } from "nextra/components";

# Data Pipelines Overview

<Callout type="warning">
If you already did setup older version of pipelines via the API and want to manage it,
go to docs [here](/docs/data-pipelines/old-pipelines).
If you already did setup older version of pipelines via the API and want to
manage it, go to docs [here](/docs/data-pipelines/old-pipelines).
</Callout>

<Callout type="info">
Customers on an Enterprise or Growth plan can access Data Pipeline as an add-on package. See our [pricing page](https://mixpanel.com/pricing/) for more details.
Customers on an Enterprise or Growth plan can access Data Pipeline as an
add-on package. See our [pricing page](https://mixpanel.com/pricing/) for
more details.
</Callout>

Data Pipelines is a feature that continuously exports data from your Mixpanel project to a Cloud Storage bucket or Data Warehouse of your choice. This feature is ideal for those who wish to perform SQL analysis on Mixpanel data within their own environment.
Expand All @@ -30,25 +32,25 @@ JSON pipelines export data as JSON files to a cloud storage bucket, providing a

For specific configuration instructions, see our guides for each storage destination:

- [AWS S3](/docs/data-pipelines/integrations/aws-s3)
- [Google Cloud Storage](/docs/data-pipelines/integrations/gcp-gcs)
- [Azure Blob Storage](/docs/data-pipelines/integrations/azure-blob-storage)
- [AWS S3](/docs/data-pipelines/integrations/aws-s3)
- [Google Cloud Storage](/docs/data-pipelines/integrations/gcp-gcs)
- [Azure Blob Storage](/docs/data-pipelines/integrations/azure-blob-storage)

Data is exported to the following structured paths in your bucket:

- Events: `<BUCKET_NAME>/<MIXPANEL_PROJECT_ID>/mp_master_event/<YEAR>/<MONTH>/<DAY>/`
- User profiles: `<BUCKET_NAME>/<MIXPANEL_PROJECT_ID>/mp_people_data/`
- Identity mappings: `<BUCKET_NAME>/<MIXPANEL_PROJECT_ID>/mp_identity_mappings_data/`
- Events: `<BUCKET_NAME>/<MIXPANEL_PROJECT_ID>/mp_master_event/<YEAR>/<MONTH>/<DAY>/`
- User profiles: `<BUCKET_NAME>/<MIXPANEL_PROJECT_ID>/mp_people_data/`
- Identity mappings: `<BUCKET_NAME>/<MIXPANEL_PROJECT_ID>/mp_identity_mappings_data/`

### Data Warehouse

JSON Pipelines also facilitate data export into tables, creating schemas that are inferred from your event data.

For detailed setup guides per destination, see:

- [BigQuery](/docs/data-pipelines/integrations/bigquery)
- [Redshift Spectrum](/docs/data-pipelines/integrations/redshift-spectrum)
- [Snowflake](/docs/data-pipelines/integrations/snowflake)
- [BigQuery](/docs/data-pipelines/integrations/bigquery)
- [Redshift Spectrum](/docs/data-pipelines/integrations/redshift-spectrum)
- [Snowflake](/docs/data-pipelines/integrations/snowflake)

## Step 2: Creating the Pipeline

Expand All @@ -73,16 +75,16 @@ To check a pipeline's configuration:

1. Go to the **Integrations** page
2. Either:
- Click on the pipeline name to view configuration at the top of the page, or
- Click the **3-dot** menu and select **View Configuration**
- Click on the pipeline name to view configuration at the top of the page, or
- Click the **3-dot** menu and select **View Configuration**

### Why does the number of events in Mixpanel not match the number of exported events to my destination?

Discrepancies between the event counts in Mixpanel and those exported to your destination can occur for several reasons:

- **Data Sync**: If [Events Data Sync](/docs/data-pipelines/json-pipelines#events-data-sync) is not enabled or is unsupported for your pipeline, this could prevent some data from being exported.
- **Data Delay**: Late-arriving data may take up to one day to sync from Mixpanel to your destination, leading to temporary discrepancies.
- **Hidden Events**: Mixpanel exports all events, including those hidden in the Mixpanel UI via Lexicon. To reconcile differences in counts, check if the events in your destination include those hidden in the Mixpanel UI.
- **Data Sync**: If [Events Data Sync](/docs/data-pipelines/json-pipelines#events-data-sync) is not enabled or is unsupported for your pipeline, this could prevent some data from being exported.
- **Data Delay**: Late-arriving data may take up to one day to sync from Mixpanel to your destination, leading to temporary discrepancies.
- **Hidden Events**: Mixpanel exports all events, including those hidden in the Mixpanel UI via Lexicon. To reconcile differences in counts, check if the events in your destination include those hidden in the Mixpanel UI.

### How can I count events exported by Mixpanel in the warehouse?

Expand All @@ -94,20 +96,24 @@ Mixpanel offers a 30-day free trial of the Data Pipelines, allowing you to creat

**Trial limitations**:

- Exports are scheduled on a daily basis only.
- Data synchronization feature is not available.
- Only one pipeline can be created per data source per project.
- Backfilled data is limited to one day prior to the creation date of the pipeline.
- Exports are scheduled on a daily basis only.
- Data synchronization feature is not available.
- Only one pipeline can be created per data source per project.
- Backfilled data is limited to one day prior to the creation date of the pipeline.

### Why can't I delete the trial pipeline?

You can’t delete a trial pipeline in Mixpanel — this is intentional. Each project is limited to one trial pipeline per data source, and keeping it prevents deleting/recreating trials to bypass limits. It also serves as a record that the trial has been used. Even after you upgrade to a paid Data Pipelines package, the trial pipeline remains visible and cannot be removed. This is a policy decision to preserve the integrity of the trial program, not a technical constraint.

### What is Active Pipeline Limit

Each project can have 2 recurring pipelines and 1 date ranged backfill pipeline active.

To maintain optimal performance across our services, we limit the number of concurrently running pipeline steps to one per project. This approach ensures that each job, including those involving substantial backfills, waits its turn, preventing any single project from monopolizing resources and thus promoting fair scheduling among all customers.

### When do pipeline exports run?
### Is it possible to specify when pipeline exports run?

Pipelines by default are set to start an export 30 minutes after the time period to export in project time is complete. For example a project in the Pacific timezone with a daily events pipeline will start the export for data from 5/22 on 5/23 00:30 AM PT.
No. Hourly pipelines are scheduled to run approximately 30 minutes past the hour to be exported, and daily pipelines are scheduled to run at approximately 12:30 AM (00:30) in the project’s timezone. For example, a project in the Pacific timezone with a daily events pipeline will start the export for data from 5/22 on 5/23 00:30 AM PT.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @xyn1nja , I think if we are going into more detail here, it's also important to call out these are reference times, but there's still a 24 hour SLA. I see you point they are approximate, but that's up to interpretation. I'd phrase it as those are the target times, but, there's still a 24 hour SLA and we point to the SLA part of the docs


### What is the Service Level Agreement?

Expand All @@ -122,4 +128,3 @@ Data arriving late is handled during a daily sync process the following day afte
### What should I set for the Account Name and Storage Integration when creating a Snowflake pipeline?

The Account Name should be set to your unique account identifier Eg. "blah2321.us-west-2" while the Storage Integration should be set to the name of the storage integration you created in Snowflake Eg. "MIXPANEL_EXPORT_STORAGE_INTEGRATION".

51 changes: 19 additions & 32 deletions pages/docs/data-pipelines/json-pipelines.mdx
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
import { Callout } from 'nextra/components'
import { Callout } from "nextra/components";

# Json Pipelines

<Callout type="info">
Customers on an Enterprise or Growth plan can access Data Pipeline as an add-on package. See our [pricing page](https://mixpanel.com/pricing/) for more details.
Customers on an Enterprise or Growth plan can access Data Pipeline as an
add-on package. See our [pricing page](https://mixpanel.com/pricing/) for
more details.
</Callout>

Json Pipeline is designed to export your Mixpanel data to supported data warehouses or object storage solutions. We maintain all properties in a high-level JSON format under the `properties` key for both events and user profile data.
Expand All @@ -16,7 +18,7 @@ Follow the instructions here in the [Overview](/docs/data-pipelines).

## Destination and Date Range Restrictions

To prevent data duplication and conflicts, the system enforces the following rule: **you cannot create multiple event pipelines that export to the same destination with overlapping date ranges**.
To prevent data duplication and conflicts, the system enforces the following rule: **you cannot create multiple event pipelines that export to the same destination with overlapping date ranges**.

For example, if you already have a pipeline exporting to BigQuery dataset "my_dataset" for dates January 1-31, you cannot create another pipeline exporting to the same dataset with dates January 15 - February 15, as the January 15-31 period would overlap.

Expand Down Expand Up @@ -64,9 +66,9 @@ Event data stored in Mixpanel’s datastore and in the export destination can fa

The discrepancy can be attributed to several different causes:

- Late data can arrive multiple days later due to a mobile client being offline.
- The import API can add data to previous days.
- Delete requests related to GDPR can cause deletion of events and event properties.
- Late data can arrive multiple days later due to a mobile client being offline.
- The import API can add data to previous days.
- Delete requests related to GDPR can cause deletion of events and event properties.

Mixpanel is able to detect any changes in your data with the granularity of a day and replaces the old data with the latest version both in object storage and data warehouse, if applicable. Data sync helps keep the data fresh and minimizes missing data points.

Expand Down Expand Up @@ -98,39 +100,24 @@ Note: Use the `resolved_distinct_id` from the identity mappings table instead of

Examples of querying the identity mapping table are available for [BigQuery](/docs/data-pipelines/integrations/bigquery#query-identity-mappings) and [Snowflake](/docs/data-pipelines/integrations/snowflake#query-identity-mappings).

## Change Log
## Incremental Pipelines

<details>
<summary><strong>2025-08-14 (US): Pipeline System Improvements By Incremental Export</strong></summary>

We have deployed the Incremental Export Improvement for newly created pipelines with US data residency.

The rollout of these updates to existing pipelines in all regions is scheduled for the near future.
Your pipeline will automatically transition to the new system when ready. Your data quality and completeness remain the same; only the processing method has improved.

You can find out more about Incremental Export in the previous changelog below (2025-06-26).
</details>

<details>
<summary><strong>2025-06-26: Pipeline System Improvements By Incremental Export</strong></summary>

We're rolling out an improved pipeline system to improve the efficiency and reliability of your data exports. We're deploying these improvements gradually across all customers, starting with new pipelines in our EU and IN data residency. New pipelines in projects with US residency and existing pipelines in all regions will follow after. Your pipeline will automatically transition to the new system when ready. Your data quality and completeness remain the same - only the processing method has improved.
As of 10 September 2025, all JSON pipelines in all regions (US/EU/IN) have been migrated to the improved incremental export system.

**What is affected?**

- **Events Pipelines with Sync Enabled Only**: This improvement only affects event pipelines that have sync enabled. People and identity mapping pipelines remain unchanged.
- **Events Pipelines with Sync Enabled Only**: This improvement only affects event pipelines that have sync enabled. People and identity mapping pipelines remain unchanged.

**Benefits**

- **Elimination of data sync delays** - no more waiting for daily sync processes to detect and fix data discrepancies
- **Complete data export** - all events are exported without the risk of missing late-arriving data. Late-arriving events are automatically exported regardless of how late they arrive, eliminating the previous 10-day sync window restriction
- **Elimination of data sync delays**: No more waiting for daily sync processes to detect and fix data discrepancies
- **Complete data export**: All events are exported without the risk of missing late-arriving data. Late-arriving events are automatically exported regardless of how late they arrive, eliminating the previous 10-day sync window restriction

**Changes You May Notice**

- **Event Count Display**: The event count shown per task in the UI now represents the total events processed per batch rather than events exported per day or per hour. Since each batch can span multiple days, this number may appear different from before.
- **Backfill Process**: When a new pipeline is created, it will complete the full historical backfill first before starting regular processing. For example, if you create a pipeline on January 15th at 11 AM with a backfill from January 1st, the system will first export all events that arrived in Mixpanel before around January 15th 11 AM as the initial backfill, then begin processing any new events that arrive after around January 15th 11 AM, regardless of which date those events are for. Existing pipelines will have the last 10 days backfilled as part of the migration and then the new incremental behavior will start.
- **Storage Location File Structure Changes**: Previous behavior of sync would replace files for a day when the day was re-synced. No sync means Mixpanel will no longer coalesce files for days when sync runs so files are no longer updated/removed. Incremental pipelines will instead add a new file with events seen in each day for each run of the pipeline meaning more small files are expected.
- **Pipelines Logs Reset**: Once your pipeline is migrated the logging available in th UI will be reset so past jobs log lines will no longer available. Only the new incremental jobs will be visible going forward.
- **Predicable Deletion Behavior**: In rare cases the sync functionality meant that Mixpanel could re-sync days for which data was deleted allowing the pipeline to also remove that data from your data warehouse. Sync keeping your warehouse in line with deletions was not guaranteed behavior however. The removal of sync means this unreliable behavior has been removed and as such warehouse data owners are responsible for the deletion of all data on the warehouse side.
- **More Pre-shuffled Distinct Ids in Data**: The faster export and removal of late syncs for data can lead to more events exported with their original distinct_id as opposed to the resolved identifier seen in Mixpanel after we’ve shuffled the data. These discrepancies are expected in pipelines on both the old and new behavior and can be resolved using the ID mappings table exported from identity pipelines outlined in [our docs here](docs/data-pipelines/json-pipelines#user-identity-resolution).
</details>
- **Event Count Display**: The event count shown per task in the UI now represents the total events processed per batch rather than events exported per day or per hour. Since each batch can span multiple days, this number may appear different from before.
- **Backfill Process**: When a new pipeline is created, it will complete the full historical backfill first before starting regular processing. For example, if you create a pipeline on January 15th at 11 AM with a backfill from January 1st, the system will first export all events that arrived in Mixpanel before around January 15th 11 AM as the initial backfill, then begin processing any new events that arrive after around January 15th 11 AM, regardless of which date those events are for. Existing pipelines will have the last 10 days backfilled as part of the migration and then the new incremental behavior will start.
- **Storage Location File Structure Changes**: Previous behavior of sync would replace files for a day when the day was re-synced. No sync means Mixpanel will no longer coalesce files for days when sync runs so files are no longer updated/removed. Incremental pipelines will instead add a new file with events seen in each day for each run of the pipeline meaning more small files are expected.
- **Pipelines Logs Reset**: Once your pipeline is migrated, the logging available in th UI will be reset so past jobs log lines will no longer available. Only the new incremental jobs will be visible going forward.
- **Predicable Deletion Behavior**: In rare cases, the sync functionality meant that Mixpanel could re-sync days for which data was deleted, allowing the pipeline to also remove that data from your data warehouse. Sync keeping your warehouse in line with deletions was not guaranteed behavior however. The removal of sync means this unreliable behavior has been removed, and as such, warehouse data owners are responsible for the deletion of all data on the warehouse side.
- **More Pre-shuffled Distinct IDs in Data**: The faster export and removal of late syncs for data can lead to more events exported with their original `distinct_id`s as opposed to the resolved identifier seen in Mixpanel after we’ve shuffled the data. These discrepancies are expected in pipelines on both the old and new behavior and can be resolved using the ID mappings table exported from identity pipelines outlined in [our docs here](/docs/data-pipelines/json-pipelines#user-identity-resolution).