> ## Documentation Index
> Fetch the complete documentation index at: https://docs.dumplingai.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Extract Document

> Extract structured text and metadata from uploaded documents or document URLs.

## Description

This endpoint extracts structured data from document files based on a user-defined prompt. It supports input via URL or base64-encoded file content and uses vision-capable Large Language Models (LLMs) to interpret and extract relevant information from the PDFs.
The following file extensions are supported:
`.123` `.602` `.abw` `.bib` `.bmp` `.cdr` `.cgm` `.cmx` `.csv` `.cwk` `.dbf` `.dif` `.doc` `.docm` `.docx` `.dot` `.dotm` `.dotx` `.dxf` `.emf` `.eps` `.epub` `.fodg` `.fodp` `.fods` `.fodt` `.fopd` `.gif` `.htm` `.html` `.hwp` `.jpeg` `.jpg` `.key` `.ltx` `.lwp` `.mcw` `.met` `.mml` `.mw` `.numbers` `.odd` `.odg` `.odm` `.odp` `.ods` `.odt` `.otg` `.oth` `.otp` `.ots` `.ott` `.pages` `.pbm` `.pcd` `.pct` `.pcx` `.pdb` `.pdf` `.pgm` `.png` `.pot` `.potm` `.potx` `.ppm` `.pps` `.ppt` `.pptm` `.pptx` `.psd` `.psw` `.pub` `.pwp` `.pxl` `.ras` `.rtf` `.sda` `.sdc` `.sdd` `.sdp` `.sdw` `.sgl` `.slk` `.smf` `.stc` `.std` `.sti` `.stw` `.svg` `.svm` `.swf` `.sxc` `.sxd` `.sxg` `.sxi` `.sxm` `.sxw` `.tga` `.tif` `.tiff` `.txt` `.uof` `.uop` `.uos` `.uot` `.vdx` `.vor` `.vsd` `.vsdm` `.vsdx` `.webp` `.wb2` `.wk1` `.wks` `.wmf` `.wpd` `.wpg` `.wps` `.xbm` `.xhtml` `.xls` `.xlsb` `.xlsm` `.xlsx` `.xlt` `.xltm` `.xltx` `.xlw` `.xml` `.xpm` `.zabw`

## Endpoint

```
POST https://app.dumplingai.com/api/v1/extract-document
```

## Headers

* **Content-Type:** `application/json`
* **Authorization:** Bearer `<API_KEY>` (required)

## Request Body

```json theme={null}
{
  "inputMethod": "string", // Required. Either "url" or "base64".
  "files": ["string"], // Required. Array of URLs or base64-encoded file contents.
  "fileExtension": "string", // Optional. The file extension of the input files (e.g., ".pdf", ".docx"). Default: ".pdf".
  "prompt": "string", // Required. The prompt describing the data to extract.
  "jsonMode": boolean // Optional. Whether to return the result in JSON format. Default: false.
}
```

## Responses

### Success (200)

Returns the extracted data based on the provided prompt, along with additional information.

```json theme={null}
{
  "results": "string", // Extracted data based on the prompt
  "prompt": "string", // The original prompt used for extraction
  "pages": number, // Total number of pages processed
  "fileCount": number, // Number of files processed
  "creditUsage": number // Total credits used for this request
}
```

* **Content-Type:** application/json
* **X-RateLimit-Limit:** The rate limit for the user.
* **X-RateLimit-Remaining:** The remaining number of requests for the user.

### Bad Request (400)

Returned if the request is invalid or the total file size exceeds the limit.

```json theme={null}
{
  "error": "Error message describing the issue"
}
```

### Unauthorized (401)

Returned if the API key is invalid or missing.

```json theme={null}
{
  "error": "Invalid or missing Authorization header"
}
```

### Internal Server Error (500)

Returned if there's an error during the document extraction process.

```json theme={null}
{
  "error": "Failed to extract document: [error details]"
}
```

## Example Request

### Example with PDF (default)

```bash theme={null}
curl -X POST https://app.dumplingai.com/api/v1/extract-document \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
  "inputMethod": "url",
  "files": ["https://example.com/sample.pdf"],
  "prompt": "Extract the main topics and their descriptions from this PDF.",
  "jsonMode": false
}'
```

### Example with Word document

```bash theme={null}
curl -X POST https://app.dumplingai.com/api/v1/extract-document \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
  "inputMethod": "url",
  "files": ["https://example.com/report.docx"],
  "fileExtension": ".docx",
  "prompt": "Extract all tables and their data from this Word document.",
  "jsonMode": true
}'
```

## Notes

* The maximum total file size for all documents combined is 100MB.
* The maximum file size for a single document is 30MB.
* The maximum number of pages that can be processed in a single request is \~3000 pages.
* fileExtension can be set to "autodetect" if not sure.
* The maximum output is 8,192 tokens.
* Credit usage:
  * Base cost: 100 credits
  * Additional 10 credits per page processed
* The total credit usage is returned in the response as `creditUsage`.
* If using the URL method, ensure the file is publicly accessible.
* The `jsonMode` parameter determines whether the output is formatted as JSON (true) or plain text (false).
* We apply vision-based LLMs to all pages of the document for extraction.
* The endpoint can process multiple document files in a single request.
* Temporary files are created during processing and are deleted after use.
* You can get a list of supported file extensions by calling:

```
GET /api/v1/extract-document
```

## Rate Limiting

Rate limit headers (`X-RateLimit-Limit` and `X-RateLimit-Remaining`) are included in the response to indicate the user's current rate limit status.

## Error Handling

* If the required parameters (`files` or `prompt`) are missing, a 400 Bad Request error is returned.
* If the total file size exceeds 100MB, a 400 Bad Request error is returned.
* If there's an error during extraction, a 500 Internal Server Error is returned with details about the failure.

## Security and Privacy

* Uploaded files are temporarily stored and then deleted after processing.


## OpenAPI

````yaml POST /api/v1/extract-document
openapi: 3.0.3
info:
  title: DumplingAI API
  version: 1.0.0
  description: >
    REST API for DumplingAI's content intelligence and automation platform.

    All endpoints are grouped under `/api/v1`; most are secured via Bearer API
    keys unless an operation explicitly sets `security: []`.
servers:
  - url: https://app.dumplingai.com
    description: Production
security:
  - bearerAuth: []
tags:
  - name: YouTube
    description: Access metadata, search results, and transcripts from YouTube.
  - name: TikTok
    description: Retrieve TikTok profile, video, follower, and transcript data.
  - name: LinkedIn
    description: Programmatically fetch LinkedIn company and profile data.
  - name: Search
    description: Search-orientated endpoints spanning web, news, maps, and autocomplete.
  - name: Google
    description: Integrations with Google business listings and location data.
  - name: Scraping
    description: Webpage capture, crawling, and structured content extraction utilities.
  - name: Documents
    description: Document processing, conversion, and metadata utilities.
  - name: AI
    description: DumplingAI agent and knowledge base endpoints.
  - name: Developer Tools
    description: Utilities for executing sandboxed code via API.
paths:
  /api/v1/extract-document:
    post:
      tags:
        - Documents
      summary: Extract document to structured JSON
      description: >-
        Extract structured text and metadata from uploaded documents or document
        URLs.
      operationId: extractDocument
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/ExtractDocumentRequest'
            examples:
              default:
                value:
                  inputMethod: url
                  files:
                    - https://example.com/sample.pdf
                  prompt: >-
                    Extract the main topics and their summaries from this
                    document.
                  jsonMode: false
      responses:
        '200':
          description: Document extraction returned.
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ExtractDocumentResponse'
        '400':
          description: Invalid request payload.
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '401':
          description: Missing or invalid API key.
        '500':
          description: Unexpected server error.
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
components:
  schemas:
    ExtractDocumentRequest:
      type: object
      required:
        - inputMethod
        - files
        - prompt
      properties:
        inputMethod:
          $ref: '#/components/schemas/FileInputMethod'
        files:
          type: array
          minItems: 1
          description: >-
            Array of publicly accessible URLs or base64-encoded document
            contents.
          items:
            type: string
        fileExtension:
          type: string
          description: >-
            File extension for the provided documents (e.g., '.pdf', '.docx', or
            'autodetect'). Defaults to '.pdf'.
        prompt:
          type: string
          description: >-
            Instructions that describe the structured data to extract from the
            documents.
        jsonMode:
          type: boolean
          description: When true, requests the model to respond with JSON-formatted output.
          default: false
        requestSource:
          $ref: '#/components/schemas/RequestSource'
      additionalProperties: false
    ExtractDocumentResponse:
      type: object
      required:
        - results
        - prompt
        - pages
        - fileCount
        - creditUsage
      properties:
        results:
          type: string
          description: Model output returned from the extraction prompt.
        prompt:
          type: string
          description: Prompt that was sent to the extraction model.
        pages:
          type: integer
          description: Total number of document pages processed.
        fileCount:
          type: integer
          description: Number of documents included in the request.
        creditUsage:
          type: integer
          description: Credits consumed while processing the request.
      additionalProperties: false
    ErrorResponse:
      type: object
      properties:
        error:
          type: string
          description: Human-readable description of what went wrong.
      required:
        - error
    FileInputMethod:
      type: string
      description: >-
        Indicates whether binary content is supplied via URL or base64-encoded
        string.
      enum:
        - url
        - base64
    RequestSource:
      type: string
      description: Optional identifier describing where the API request originated.
      enum:
        - API
        - WEB
        - MAKE_DOT_COM
        - ZAPIER
        - N8N
        - PLAYGROUND
        - DEFAULT_AUTOMATION
        - AGENT_PREVIEW
        - AGENT_LIVE
        - AUTOPILOT
        - STUDIO
  securitySchemes:
    bearerAuth:
      type: http
      scheme: bearer
      bearerFormat: API Key

````