Skip to content

DOCS-3198: offline data pipelines, SDK docs, hot data store fixes #4440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

nathan-contino
Copy link
Member

  • adds information about offline data pipelines (docs/data-ai/data/data-pipelines.md)
    • previously had no page in the docs besides some minimal CLI doc discussing pipelines; this introduces that missing page
    • includes examples in all supported languages (Python, Go, TypeScript) for basic data pipeline tasks
      • note that Go snippets link directly to Go API reference -- see SDK docs notes for more info
  • updates generated SDK documentation to include Python and Typescript data pipelines APIs (no Flutter yet)
    • no Go because our data page doesn't seem to have or support any Go snippets (i'm guessing there's an out-of-scope story here)
  • yanked hot data store out into its own page ((docs/data-ai/data/hot-data-store.md), since:
    • it has a lot in common with data pipelines, which could easily lead to user confusion
    • was very buried (increasing the likelihood of user confusion)
    • the recent API improvements for data pipelines broke our existing hot data store examples
  • slight reorder of 'Advanced data capture and sync configurations' since some short-but-useful sections were buried all the way at the end of a very long page of complex, niche examples
  • note that the alias for hot data store doesn't work -- leaving it for now in the hopes that someone can suggest a better alternative for relocating a single section of a still-existing page to another page

Copy link

netlify bot commented Jul 2, 2025

Deploy Preview for viam-docs ready!

Name Link
🔨 Latest commit b475d83
🔍 Latest deploy log https://app.netlify.com/projects/viam-docs/deploys/68659cd6aee1e900084fe33f
😎 Deploy Preview https://deploy-preview-4440--viam-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
Lighthouse
Lighthouse
1 paths audited
Performance: 53 (🔴 down 3 from production)
Accessibility: 100 (no change from production)
Best Practices: 100 (no change from production)
SEO: 92 (no change from production)
PWA: 70 (no change from production)
View the detailed breakdown and full score reports

To edit notification comments on pull requests, go to your Netlify project configuration.

@viambot viambot added the safe to build This pull request is marked safe to build from a trusted zone label Jul 2, 2025
Copy link
Collaborator

@JessamyT JessamyT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started reviewing but this is a philosophical one about the extent to which we want to redundantly document the APIs and CLI commands. Feels like a new precedent, since for each item for example "Delete a pipeline," there's a 1:1 matching section in the API docs as well as an example in the CLI page, so will leave to Naomi to review.

@@ -7,7 +7,7 @@
| [`TabularDataBySQL`](/dev/reference/apis/data-client/#tabulardatabysql) | Obtain unified tabular data and metadata, queried with SQL. Make sure your API key has permissions at the organization level in order to use this. |
| [`TabularDataByMQL`](/dev/reference/apis/data-client/#tabulardatabymql) | Obtain unified tabular data and metadata, queried with MQL. |
| [`BinaryDataByFilter`](/dev/reference/apis/data-client/#binarydatabyfilter) | Retrieve optionally filtered binary data from Viam. |
| [`BinaryDataByIDs`](/dev/reference/apis/data-client/#binarydatabyids) | Retrieve binary data from Viam by `BinaryID`. |
| [`BinaryDataByIDs`](/dev/reference/apis/data-client/#binarydatabyids) | Retrieve binary data from the Viam by `BinaryID`. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| [`BinaryDataByIDs`](/dev/reference/apis/data-client/#binarydatabyids) | Retrieve binary data from the Viam by `BinaryID`. |
| [`BinaryDataByIDs`](/dev/reference/apis/data-client/#binarydatabyids) | Retrieve binary data from Viam by `BinaryID`. |

Can you also update static/include/app/apis/overrides/protos/data.BinaryDataByIDs.md to fix this?

Viam stores the output of these pipelines in a cache so that you can access complex aggregation results more efficiently.
When late-arriving data syncs to Viam, pipelines automatically re-run to keep summaries accurate.

For example, you could use a data pipeline to pre-calculate results like "average temperature per hour".
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For example, you could use a data pipeline to pre-calculate results like "average temperature per hour".
For example, you could use a data pipeline to pre-calculate results like "average temperature per hour."

or better yet

Suggested change
For example, you could use a data pipeline to pre-calculate results like "average temperature per hour".
For example, you could use a data pipeline to pre-calculate results such as average temperature per hour.

@@ -2587,7 +2622,7 @@ User-defined metadata is billed as data.

**Parameters:**

- `robot_id` ([str](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str)) (required): The ID of the robot with which to associate the user-defined metadata. You can obtain your robot ID from your machine's page.
- `robot_id` ([str](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str)) (required): The ID of the robot with which to associate the user-defined metadata. You can obtain your robot ID from the machine page.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems less helpful

@@ -0,0 +1 @@
Get the configuration for multiple data pipelines.
Copy link
Collaborator

@JessamyT JessamyT Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Get the configuration for multiple data pipelines.
Get a list of configurations of all data pipelines for an organization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "for" instead of "of": Get a list of configurations for all data pipelines for an organization.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure--edited my suggestion!

@@ -0,0 +1 @@
Get the configuration of a data pipeline.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider linking to your new page on [data pipeline]

@JessamyT
Copy link
Collaborator

JessamyT commented Jul 2, 2025

When you address the merge conflicts with /generated/app.md, note that we changed a couple things manually in #4431 but I created this upstream PR to get them to stick.

Copy link
Member

@vijayvuyyuru vijayvuyyuru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nathan!


## Query

Queries typically execute on blog storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Queries typically execute on blog storage.
Queries typically execute on blob storage.


### Query limitations

You cannot use the following MongoDB aggregation operators when querying your hot data store:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is true for both standard storage and hot data. But there's a lot more operators we don't support. This link contains the list of the ones we allow. https://github.com/viamrobotics/app/blob/e706a2e3ea57a252f102b37e0ab2b9d6eeed51e0/datamanagement/tabular_data_by_query.go#L64

@@ -23,3 +23,8 @@
| [`ConfigureDatabaseUser`](/dev/reference/apis/data-client/#configuredatabaseuser) | Configure a database user for the Viam organization’s MongoDB Atlas Data Federation instance. |
| [`AddBinaryDataToDatasetByIDs`](/dev/reference/apis/data-client/#addbinarydatatodatasetbyids) | Add the `BinaryData` to the provided dataset. |
| [`RemoveBinaryDataFromDatasetByIDs`](/dev/reference/apis/data-client/#removebinarydatafromdatasetbyids) | Remove the BinaryData from the provided dataset. |
| [`GetDataPipeline`](/dev/reference/apis/data-client/#getdatapipeline) | Get the configuration of a data pipeline. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| [`GetDataPipeline`](/dev/reference/apis/data-client/#getdatapipeline) | Get the configuration of a data pipeline. |
| [`GetDataPipeline`](/dev/reference/apis/data-client/#getdatapipeline) | Get the configuration for a data pipeline. |

| [`GetDataPipeline`](/dev/reference/apis/data-client/#getdatapipeline) | Get the configuration of a data pipeline. |
| [`ListDataPipelines`](/dev/reference/apis/data-client/#listdatapipelines) | Get the configuration for multiple data pipelines. |
| [`CreateDataPipeline`](/dev/reference/apis/data-client/#createdatapipeline) | Create a data pipeline. |
| [`DeleteDataPipeline`](/dev/reference/apis/data-client/#deletedatapipeline) | Delete a data pipeline. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And all of its data)

| [`ListDataPipelines`](/dev/reference/apis/data-client/#listdatapipelines) | Get the configuration for multiple data pipelines. |
| [`CreateDataPipeline`](/dev/reference/apis/data-client/#createdatapipeline) | Create a data pipeline. |
| [`DeleteDataPipeline`](/dev/reference/apis/data-client/#deletedatapipeline) | Delete a data pipeline. |
| [`ListDataPipelineRuns`](/dev/reference/apis/data-client/#listdatapipelineruns) | List the statuses of individual executions of a data pipeline. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just status. So maybe List the individual executions of a data pipeline? Something like that?

@@ -0,0 +1 @@
List the statuses of individual executions of a data pipeline.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment

@@ -0,0 +1 @@
Get the configuration for multiple data pipelines.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "for" instead of "of": Get a list of configurations for all data pipelines for an organization.

{{% /tab %}}
{{< /tabs >}}

### Update a pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to not include this in the documentation. We don't want people changing pipeline schedules or queries after a pipeline has started inserting query results. Might end up just being the name that we allow them to update. Is it ok to leave this out?


### Disable a pipeline

Disabling a data pipeline lets you pause data pipeline execution without fully deleting the pipeline configuration from your organization.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe note that any time windows that pass while the pipeline is disabled will not contain data and will not be backfilled if the pipeline is enabled again

{{% /tab %}}
{{< /tabs >}}

### Delete a pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting a pipeline will also delete its run history and the pipeline results collection

{{< tabs >}}
{{% tab name="Python" %}}

Use [`DataClient.ListDataPipelineRuns`](/dev/reference/apis/data-client/#listdatapipelineruns) to view the statuses of past executions of a pipeline:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say past. A run might not be complete yet. Also like I mentioned elsewhere, shows more than status

bson.encode({"$match": {"component_name": "temperature-sensor"}}),
bson.encode({
"$group": {
"_id": "$location_id",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not do this. In fact, I think we should probably call out that you should not specify an id if your last stage is a group stage unless the id is guaranteed to be unique for every pipeline run. Otherwise there will be duplicate id errors, and only the first pipeline result will save successfully

@npentrel npentrel self-requested a review July 4, 2025 14:42
@npentrel
Copy link
Collaborator

npentrel commented Jul 4, 2025

(I'll hold off on review until commetns are resolved)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe to build This pull request is marked safe to build from a trusted zone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants