Skip to main content

Web Scraping Tutorial

This tutorial covers the current DumplingAI scraping workflow using the scrape, crawl, and extract endpoints.

Choose the right endpoint

  • Scrape: fetch one URL and return cleaned content in markdown, html, or screenshot format
  • Crawl: fetch multiple pages from one site with configurable depth and limit
  • Extract: return structured JSON from a URL using a schema you define
  • Screenshot: capture a visual snapshot when you need page imagery instead of text

Prerequisites

Before you begin, make sure you have:
  • A DumplingAI account with an API key
  • Basic knowledge of HTTP requests
  • A way to call the API, such as curl, Node.js, or Python

1. Scrape a single page

Start with POST /api/v1/scrape when you want the content of one page.
curl -X POST https://app.dumplingai.com/api/v1/scrape \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "url": "https://example.com/article",
    "format": "markdown",
    "cleaned": true,
    "renderJs": true
  }'

How to tune scrape

  • Set format to markdown for LLM-friendly text, html when you need original markup, or screenshot for image output
  • Keep cleaned: true unless you specifically need the noisier raw page structure
  • Set renderJs: false for faster, cheaper-style workflows when the page does not require client-side rendering

2. Crawl multiple pages from a site

Use POST /api/v1/crawl when you need more than one page.
curl -X POST https://app.dumplingai.com/api/v1/crawl \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "url": "https://example.com",
    "limit": 10,
    "depth": 2,
    "format": "markdown"
  }'
The response includes crawl metadata plus a results array, where each entry contains the page url, status, and extracted content.

When to use crawl

  • Use it for site audits, blog ingestion, and multi-page research
  • Tune depth to control how far DumplingAI follows links from the starting URL
  • Tune limit to cap page count and keep usage predictable

3. Extract structured data from a page

Use POST /api/v1/extract when you want JSON output instead of raw content.
curl -X POST https://app.dumplingai.com/api/v1/extract \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "url": "https://example.com/product",
    "schema": {
      "title": "string",
      "price": "number",
      "availability": "string"
    }
  }'
The response returns:
  • results: the extracted JSON object that matches your schema
  • screenshotUrl: a screenshot of the analyzed page used during extraction
This is the right choice when you want structured fields like product data, listings, or page metadata without writing your own parser.
  1. Start with scrape to confirm the page is reachable and the output is useful.
  2. Move to crawl if you need multiple pages from the same site.
  3. Use extract only when you need structured fields returned as JSON.
  4. Use screenshot when the visual layout matters more than the text.

Best practices

  • Start with a small limit when testing crawls
  • Disable renderJs if a page loads correctly without JavaScript
  • Validate extracted JSON before sending it downstream
  • Be deliberate about which pages you fetch to keep usage efficient
  • Respect the target site’s policies before automating scraping

Next Steps

  • Review the Scrape API for request and response details
  • Review the Crawl API for multi-page scraping
  • Review the Extract API for schema-based extraction
  • Use MCP Server if you want to access these tools from an AI client