Description

This endpoint extracts structured data from a specified URL based on a user-defined schema. It uses visual and textual information from the webpage to extract the requested data.

Endpoint

POST /api/v1/extract

Headers

  • Content-Type: application/json
  • Authorization: Bearer <API_KEY> (required)

Request Body

{
  "url": "string", // Required. The URL to extract data from.
  "schema": "object" // Required. The schema defining the data to extract.
}

Responses

Success (200)

Returns the extracted data based on the provided schema, along with a URL to the screenshot of the page.

{
  "screenshotUrl": "string", // URL of the captured screenshot
  "results": "object" // Extracted data matching the provided schema
}
  • Content-Type: application/json
  • X-RateLimit-Limit: The rate limit for the user.
  • X-RateLimit-Remaining: The remaining number of requests for the user.

Bad Request (400)

Returned if the request is invalid.

{
  "error": "Error message describing the issue"
}

Internal Server Error (500)

Returned if there’s an error during the extraction process.

{
  "error": "Failed to extract URL: [error details]"
}

Example Request

curl -X POST https://app.dumplingai.com/api/v1/extract \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
  "url": "https://example.com",
  "schema": {
    "title": "string",
    "description": "string",
    "price": "number",
    "rating": "number"
  }
}'

Example Response

{
  "screenshotUrl": "https://storage.example.com/screenshots/abcdef123456.png",
  "results": {
    "title": "Example Product",
    "description": "This is an example product description.",
    "price": 29.99,
    "rating": 4.5
  }
}

Notes

  • The extraction process uses both visual (screenshot) and textual (HTML) information from the webpage.
  • The extracted data is limited to a single object. Extracting lists of objects may require separate handling.
  • The maximum allowed content for extraction is 100,000 tokens. If the content exceeds this limit, it will be truncated.
  • If the URL doesn’t include a protocol, https:// will be used by default.
  • This endpoint uses 25 credits per request.
  • The extraction is performed using a combination of web scraping and AI-powered analysis (Claude 3 Haiku model).

Rate Limiting

Rate limit headers (X-RateLimit-Limit and X-RateLimit-Remaining) are included in the response to indicate the user’s current rate limit status.