Skip to content

[Hold] Partition Endpoint with REST: new recommended asynchronous calling pattern #621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 20 additions & 5 deletions api-reference/partition/examples.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Here's how you can modify partition strategy for a PDF file, and select an alter
<Accordion title="POST">
<UseIngestOrPlatformInstead />
```bash POST
curl -X 'POST' $UNSTRUCTURED_API_URL \
curl -X 'POST' "$UNSTRUCTURED_API_URL/v1/partition_async" \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H 'unstructured-api-key: $UNSTRUCTURED_API_KEY' \
Expand All @@ -29,6 +29,9 @@ Here's how you can modify partition strategy for a PDF file, and select an alter
-F 'vlm_model_provider=openai' \
-F 'vlm_model=gpt-4o'
```

To get the results of this request, you must make a follow-up request with the job ID that is returned in the response. See the `/v1/partition_async/<job_id>` example
in [Process an individual file by making a direct POST request](/api-reference/partition/post-requests).
</Accordion>
<Accordion title="Python SDK">
<UseIngestOrPlatformInstead />
Expand Down Expand Up @@ -191,7 +194,7 @@ For better OCR results, you can specify what languages your document is in using
<Accordion title="POST">
<UseIngestOrPlatformInstead />
```bash POST
curl -X 'POST' $UNSTRUCTURED_API_URL \
curl -X 'POST' "$UNSTRUCTURED_API_URL/v1/partition_async" \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H 'unstructured-api-key: $UNSTRUCTURED_API_KEY' \
Expand All @@ -200,6 +203,9 @@ For better OCR results, you can specify what languages your document is in using
-F 'vlm_model_provider=openai' \
-F 'vlm_model=gpt-4o' \-F 'languages=kor'
```

To get the results of this request, you must make a follow-up request with the job ID that is returned in the response. See the `/v1/partition_async/<job_id>` example
in [Process an individual file by making a direct POST request](/api-reference/partition/post-requests).
</Accordion>
<Accordion title="Python SDK">
<UseIngestOrPlatformInstead />
Expand Down Expand Up @@ -359,14 +365,17 @@ Set the `coordinates` parameter to `true` to add this field to the elements in t
<Accordion title="POST">
<UseIngestOrPlatformInstead />
```bash POST
curl -X 'POST' $UNSTRUCTURED_API_URL \
curl -X 'POST' "$UNSTRUCTURED_API_URL/v1/partition_async" \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H 'unstructured-api-key: $UNSTRUCTURED_API_KEY' \
-F 'files=@sample-docs/layout-parser-paper.pdf' \
-F 'coordinates=true' \
-F 'strategy=hi_res'
```

To get the results of this request, you must make a follow-up request with the job ID that is returned in the response. See the `/v1/partition_async/<job_id>` example
in [Process an individual file by making a direct POST request](/api-reference/partition/post-requests).
</Accordion>
<Accordion title="Python SDK">
<UseIngestOrPlatformInstead />
Expand Down Expand Up @@ -530,7 +539,7 @@ This can be helpful if you'd like to use the IDs as a primary key in a database,
<Accordion title="POST">
<UseIngestOrPlatformInstead />
```bash POST
curl -X 'POST' $UNSTRUCTURED_API_URL \
curl -X 'POST' "$UNSTRUCTURED_API_URL/v1/partition_async" \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H 'unstructured-api-key: $UNSTRUCTURED_API_KEY' \
Expand All @@ -540,6 +549,9 @@ This can be helpful if you'd like to use the IDs as a primary key in a database,
-F 'vlm_model_provider=openai' \
-F 'vlm_model=gpt-4o'
```

To get the results of this request, you must make a follow-up request with the job ID that is returned in the response. See the `/v1/partition_async/<job_id>` example
in [Process an individual file by making a direct POST request](/api-reference/partition/post-requests).
</Accordion>
<Accordion title="Python SDK">
<UseIngestOrPlatformInstead />
Expand Down Expand Up @@ -703,7 +715,7 @@ By default, the `chunking_strategy` is set to `None`, and no chunking is perform
<Accordion title="POST">
<UseIngestOrPlatformInstead />
```bash POST
curl -X 'POST' $UNSTRUCTURED_API_URL \
curl -X 'POST' "$UNSTRUCTURED_API_URL/v1/partition_async" \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H 'unstructured-api-key: $UNSTRUCTURED_API_KEY' \
Expand All @@ -714,6 +726,9 @@ By default, the `chunking_strategy` is set to `None`, and no chunking is perform
-F 'vlm_model_provider=openai' \
-F 'vlm_model=gpt-4o'
```

To get the results of this request, you must make a follow-up request with the job ID that is returned in the response. See the `/v1/partition_async/<job_id>` example
in [Process an individual file by making a direct POST request](/api-reference/partition/post-requests).
</Accordion>
<Accordion title="Python SDK">
<UseIngestOrPlatformInstead />
Expand Down
86 changes: 74 additions & 12 deletions api-reference/partition/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ title: Overview
The Unstructured Partition Endpoint, part of the [Unstructured API](/api-reference/overview), is intended for rapid prototyping of Unstructured's
various partitioning strategies, with limited support for chunking. It is designed to work only with processing of local files, one file
at a time. Use the [Unstructured Workflow Endpoint](/api-reference/workflow/overview) for production-level scenarios, file processing in
batches, files and data in remote locations, generating embeddings, applying post-transform enrichments, using the latest and
large batches, files and data in remote locations, generating embeddings, applying post-transform enrichments, using the latest and
highest-performing models, and for the highest quality results at the lowest cost.

## Get started
Expand Down Expand Up @@ -52,45 +52,107 @@ import SharedPagesBilling from '/snippets/general-shared-text/pages-billing.mdx'

## Quickstart

This example uses the [curl](https://curl.se/) utility on your local machine to call the Unstructured Partition Endpoint. It sends a source (input) file from your local machine to the Unstructured Partition Endpoint which then delivers the processed data to a destination (output) location, also on your local machine. Data is processed on Unstructured-hosted compute resources.
This example uses the [curl](https://curl.se/) utility on your local machine to call the Unstructured Partition Endpoint. It sends one or more source (input) files from your local machine to the Unstructured Partition Endpoint which then delivers the processed data to a destination (output) location, also on your local machine. Data is processed on Unstructured-hosted compute resources.

If you do not have a source file readily available, you could use for example a sample PDF file containing the text of the United States Constitution,
If you do not have source files readily available, you could use for example a sample PDF file containing the text of the United States Constitution,
available for download from [https://constitutioncenter.org/media/files/constitution.pdf](https://constitutioncenter.org/media/files/constitution.pdf).

<Steps>
<Step title="Set environment variables">
From your terminal or Command Prompt, set the following two environment variables.

- Replace `<your-unstructured-api-url>` with the Unstructured Partition Endpoint URL, which is `https://api.unstructuredapp.io/general/v0/general`
- Replace `<your-unstructured-api-url>` with the Unstructured Partition Endpoint base URL, which is `https://api.unstructuredapp.io`
- Replace `<your-unstructured-api-key>` with your Unstructured API key, which you generated earlier on this page.

```bash
export UNSTRUCTURED_API_URL=<your-unstructured-api-url>
export UNSTRUCTURED_API_KEY="<your-unstructured-api-key>"
```
</Step>
<Step title="Run the curl command">
Run the following `curl` command, replacing `<path/to/file>` with the path to the source file on your local machine.
<Step title="Create a partition job">
Run the following `curl` command, replacing `<path/to/file-1>` with the path to the source file on your local machine. To specify
multiple files, repeat the `--form 'files=@<path/to/file-N>;type=application/pdf'` option in this command for each additional file.

If the source file is not a PDF file, then remove `;type=application/pdf` from the final `--form` option in this command.
If the source file is not a PDF file, then remove `;type=application/pdf` from the related `--form` option in this command.

```bash
curl --request 'POST' \
"$UNSTRUCTURED_API_URL" \
"$UNSTRUCTURED_API_URL/v1/partition_async" \
--header 'accept: application/json' \
--header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \
--header 'content-Type: multipart/form-data' \
--form 'content_type=string' \
--form 'strategy=vlm' \
--form 'vlm_model_provider=openai' \
--form 'vlm_model=gpt-4o' \
--form 'output_format=application/json' \
--form 'files=@<path/to/file>;type=application/pdf'
--form 'files=@<path/to/file-1>;type=application/pdf' \
--form 'files=@<path/to/file-N>;type=application/pdf'
```

The results are printed to your terminal or Command Prompt with a format similar to the following:

```json
{
"partition_id": "<job-id>",
"partition_status": "scheduled",
"partition_status_message": "Partition job created"
}
```

Make a note of the `<job-id>` value, as you will need it in the next step.
</Step>
<Step title="Check the status of the job">
Run the following `curl` command, replacing `<job_id>` with the `<job_id>` value from the previous step.

```bash
curl --request 'GET' \
"$UNSTRUCTURED_API_URL/v1/partition_async/<job_id>" \
--header 'accept: application/json' \
--header "unstructured-api-key: $UNSTRUCTURED_API_KEY"
```

The results are printed to your terminal or Command Prompt with a format similar to the following:

```json
{
"partition_id": "<job-id>",
"partition_status": "in_progress",
"partition_status_message": "Started processing partition request",
"elements": null
}
```

If the job is still in progress, repeat the `curl` command until the job is complete.
</Step>
<Step title="Examine the results">
After you run the `curl` command, the results are printed to your terminal or Command Prompt. The command might take several
minutes to complete.
If you run the preceding command and the job has successfully completed, the results that are printed to your terminal or Command Prompt will contain the processed data within the
`elements` array, for example:

```json
{
"partition_id": "<job-id>",
"partition_status": "in_progress",
"partition_status_message": "Started processing partition request",
"elements": [
{
"type": "...",
"element_id": "...",
"text": "...",
"metadata": {
"...": "..."
}
},
{
"type": "...",
"element_id": "...",
"text": "...",
"metadata": {
"...": "..."
}
}
]
}
```

By default, the JSON is printed without indenting or other whitespace. You can pretty-print the JSON output by using utilities such as [jq](https://jqlang.org/tutorial/) in future command runs.

Expand Down
Loading