> ## Documentation Index
> Fetch the complete documentation index at: https://docs.dumplingai.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Web Scraping Tutorial

> Learn how to extract and process web content using DumplingAI

# Web Scraping Tutorial

This tutorial covers the current DumplingAI scraping workflow using the `scrape`, `crawl`, and `extract` endpoints.

## Choose the right endpoint

* **Scrape**: fetch one URL and return cleaned content in `markdown`, `html`, or `screenshot` format
* **Crawl**: fetch multiple pages from one site with configurable `depth` and `limit`
* **Extract**: return structured JSON from a URL using a schema you define
* **Screenshot**: capture a visual snapshot when you need page imagery instead of text

## Prerequisites

Before you begin, make sure you have:

* A DumplingAI account with an API key
* Basic knowledge of HTTP requests
* A way to call the API, such as `curl`, Node.js, or Python

## 1. Scrape a single page

Start with `POST /api/v1/scrape` when you want the content of one page.

<CodeGroup>
  ```bash cURL theme={null}
  curl -X POST https://app.dumplingai.com/api/v1/scrape \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer YOUR_API_KEY" \
    -d '{
      "url": "https://example.com/article",
      "format": "markdown",
      "cleaned": true,
      "renderJs": true
    }'
  ```

  ```javascript Node.js theme={null}
  const axios = require("axios");

  async function scrapePage(url) {
    try {
      const response = await axios.post(
        "https://app.dumplingai.com/api/v1/scrape",
        {
          url,
          format: "markdown",
          cleaned: true,
          renderJs: true,
        },
        {
          headers: {
            "Content-Type": "application/json",
            Authorization: "Bearer YOUR_API_KEY",
          },
        }
      );

      console.log("Page Title:", response.data.title);
      console.log("Format:", response.data.format);
      console.log("Content Preview:", response.data.content.slice(0, 300));
      return response.data;
    } catch (error) {
      console.error(
        "Error:",
        error.response ? error.response.data : error.message
      );
    }
  }

  scrapePage("https://example.com/article");
  ```

  ```python Python theme={null}
  import requests

  def scrape_page(url):
      api_url = "https://app.dumplingai.com/api/v1/scrape"
      headers = {
          "Content-Type": "application/json",
          "Authorization": "Bearer YOUR_API_KEY"
      }
      data = {
          "url": url,
          "format": "markdown",
          "cleaned": True,
          "renderJs": True
      }

      response = requests.post(api_url, headers=headers, json=data)
      result = response.json()

      print(f"Page Title: {result['title']}")
      print(f"Format: {result['format']}")
      print(f"Content Preview: {result['content'][:300]}")
      return result

  scrape_page("https://example.com/article")
  ```
</CodeGroup>

### How to tune `scrape`

* Set `format` to `markdown` for LLM-friendly text, `html` when you need original markup, or `screenshot` for image output
* Keep `cleaned: true` unless you specifically need the noisier raw page structure
* Set `renderJs: false` for faster, cheaper-style workflows when the page does not require client-side rendering

## 2. Crawl multiple pages from a site

Use `POST /api/v1/crawl` when you need more than one page.

```bash theme={null}
curl -X POST https://app.dumplingai.com/api/v1/crawl \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "url": "https://example.com",
    "limit": 10,
    "depth": 2,
    "format": "markdown"
  }'
```

The response includes crawl metadata plus a `results` array, where each entry contains the page `url`, `status`, and extracted `content`.

### When to use `crawl`

* Use it for site audits, blog ingestion, and multi-page research
* Tune `depth` to control how far DumplingAI follows links from the starting URL
* Tune `limit` to cap page count and keep usage predictable

## 3. Extract structured data from a page

Use `POST /api/v1/extract` when you want JSON output instead of raw content.

```bash theme={null}
curl -X POST https://app.dumplingai.com/api/v1/extract \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "url": "https://example.com/product",
    "schema": {
      "title": "string",
      "price": "number",
      "availability": "string"
    }
  }'
```

The response returns:

* `results`: the extracted JSON object that matches your schema
* `screenshotUrl`: a screenshot of the analyzed page used during extraction

This is the right choice when you want structured fields like product data, listings, or page metadata without writing your own parser.

## Recommended workflow

1. Start with `scrape` to confirm the page is reachable and the output is useful.
2. Move to `crawl` if you need multiple pages from the same site.
3. Use `extract` only when you need structured fields returned as JSON.
4. Use `screenshot` when the visual layout matters more than the text.

## Best practices

* Start with a small `limit` when testing crawls
* Disable `renderJs` if a page loads correctly without JavaScript
* Validate extracted JSON before sending it downstream
* Be deliberate about which pages you fetch to keep usage efficient
* Respect the target site's policies before automating scraping

## Next Steps

* Review the [Scrape API](/api-reference/endpoint/scrape) for request and response details
* Review the [Crawl API](/api-reference/endpoint/crawl) for multi-page scraping
* Review the [Extract API](/api-reference/endpoint/extract) for schema-based extraction
* Use [MCP Server](/mcp-server) if you want to access these tools from an AI client
