Extract Document

Description

This endpoint extracts structured data from document files based on a user-defined prompt. It supports input via URL or base64-encoded file content and uses vision-capable Large Language Models (LLMs) to interpret and extract relevant information from the PDFs. The following file extensions are supported: .123 .602 .abw .bib .bmp .cdr .cgm .cmx .csv .cwk .dbf .dif .doc .docm .docx .dot .dotm .dotx .dxf .emf .eps .epub .fodg .fodp .fods .fodt .fopd .gif .htm .html .hwp .jpeg .jpg .key .ltx .lwp .mcw .met .mml .mw .numbers .odd .odg .odm .odp .ods .odt .otg .oth .otp .ots .ott .pages .pbm .pcd .pct .pcx .pdb .pdf .pgm .png .pot .potm .potx .ppm .pps .ppt .pptm .pptx .psd .psw .pub .pwp .pxl .ras .rtf .sda .sdc .sdd .sdp .sdw .sgl .slk .smf .stc .std .sti .stw .svg .svm .swf .sxc .sxd .sxg .sxi .sxm .sxw .tga .tif .tiff .txt .uof .uop .uos .uot .vdx .vor .vsd .vsdm .vsdx .webp .wb2 .wk1 .wks .wmf .wpd .wpg .wps .xbm .xhtml .xls .xlsb .xlsm .xlsx .xlt .xltm .xltx .xlw .xml .xpm .zabw

Endpoint

POST https://app.dumplingai.com/api/v1/extract-document

Headers

Content-Type: application/json
Authorization: Bearer <API_KEY> (required)

Request Body

{
  "inputMethod": "string", // Required. Either "url" or "base64".
  "files": ["string"], // Required. Array of URLs or base64-encoded file contents.
  "fileExtension": "string", // Optional. The file extension of the input files (e.g., ".pdf", ".docx"). Default: ".pdf".
  "prompt": "string", // Required. The prompt describing the data to extract.
  "jsonMode": boolean // Optional. Whether to return the result in JSON format. Default: false.
}

Responses

Success (200)

Returns the extracted data based on the provided prompt, along with additional information.

{
  "results": "string", // Extracted data based on the prompt
  "prompt": "string", // The original prompt used for extraction
  "pages": number, // Total number of pages processed
  "fileCount": number, // Number of files processed
  "creditUsage": number // Total credits used for this request
}

Content-Type: application/json
X-RateLimit-Limit: The rate limit for the user.
X-RateLimit-Remaining: The remaining number of requests for the user.

Bad Request (400)

Returned if the request is invalid or the total file size exceeds the limit.

{
  "error": "Error message describing the issue"
}

Unauthorized (401)

Returned if the API key is invalid or missing.

{
  "error": "Invalid or missing Authorization header"
}

Internal Server Error (500)

Returned if there’s an error during the document extraction process.

{
  "error": "Failed to extract document: [error details]"
}

Example Request

Example with PDF (default)

curl -X POST https://app.dumplingai.com/api/v1/extract-document \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
  "inputMethod": "url",
  "files": ["https://example.com/sample.pdf"],
  "prompt": "Extract the main topics and their descriptions from this PDF.",
  "jsonMode": false
}'

Example with Word document

curl -X POST https://app.dumplingai.com/api/v1/extract-document \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
  "inputMethod": "url",
  "files": ["https://example.com/report.docx"],
  "fileExtension": ".docx",
  "prompt": "Extract all tables and their data from this Word document.",
  "jsonMode": true
}'

Notes

The maximum total file size for all documents combined is 100MB.
The maximum file size for a single document is 30MB.
The maximum number of pages that can be processed in a single request is ~3000 pages.
fileExtension can be set to “autodetect” if not sure.
The maximum output is 8,192 tokens.
Credit usage:
- Base cost: 10 credits
- Additional 1 credit per page processed
The total credit usage is returned in the response as creditUsage.
If using the URL method, ensure the file is publicly accessible.
The jsonMode parameter determines whether the output is formatted as JSON (true) or plain text (false).
We apply vision-based LLMs to all pages of the document for extraction.
The endpoint can process multiple document files in a single request.
Temporary files are created during processing and are deleted after use.
You can get a list of supported file extensions by calling:

GET /api/v1/extract-document

Rate Limiting

Rate limit headers (X-RateLimit-Limit and X-RateLimit-Remaining) are included in the response to indicate the user’s current rate limit status.

Error Handling

If the required parameters (files or prompt) are missing, a 400 Bad Request error is returned.
If the total file size exceeds 100MB, a 400 Bad Request error is returned.
If there’s an error during extraction, a 500 Internal Server Error is returned with details about the failure.

Security and Privacy

Uploaded files are temporarily stored and then deleted after processing.

API Documentation

Endpoints

Description

Endpoint

Headers

Request Body

Responses

Success (200)

Bad Request (400)

Unauthorized (401)

Internal Server Error (500)

Example Request

Example with PDF (default)

Example with Word document

Notes

Rate Limiting

Error Handling

Security and Privacy

API Documentation

Endpoints

​Description

​Endpoint

​Headers

​Request Body

​Responses

​Success (200)

​Bad Request (400)

​Unauthorized (401)

​Internal Server Error (500)

​Example Request

​Example with PDF (default)

​Example with Word document

​Notes

​Rate Limiting

​Error Handling

​Security and Privacy

Description

Endpoint

Headers

Request Body

Responses

Success (200)

Bad Request (400)

Unauthorized (401)

Internal Server Error (500)

Example Request

Example with PDF (default)

Example with Word document

Notes

Rate Limiting

Error Handling

Security and Privacy