Documentation Index
Fetch the complete documentation index at: https://docs.dumplingai.com/llms.txt
Use this file to discover all available pages before exploring further.
Web Scraping Tutorial
This tutorial covers the current DumplingAI scraping workflow using thescrape, crawl, and extract endpoints.
Choose the right endpoint
- Scrape: fetch one URL and return cleaned content in
markdown,html, orscreenshotformat - Crawl: fetch multiple pages from one site with configurable
depthandlimit - Extract: return structured JSON from a URL using a schema you define
- Screenshot: capture a visual snapshot when you need page imagery instead of text
Prerequisites
Before you begin, make sure you have:- A DumplingAI account with an API key
- Basic knowledge of HTTP requests
- A way to call the API, such as
curl, Node.js, or Python
1. Scrape a single page
Start withPOST /api/v1/scrape when you want the content of one page.
How to tune scrape
- Set
formattomarkdownfor LLM-friendly text,htmlwhen you need original markup, orscreenshotfor image output - Keep
cleaned: trueunless you specifically need the noisier raw page structure - Set
renderJs: falsefor faster, cheaper-style workflows when the page does not require client-side rendering
2. Crawl multiple pages from a site
UsePOST /api/v1/crawl when you need more than one page.
results array, where each entry contains the page url, status, and extracted content.
When to use crawl
- Use it for site audits, blog ingestion, and multi-page research
- Tune
depthto control how far DumplingAI follows links from the starting URL - Tune
limitto cap page count and keep usage predictable
3. Extract structured data from a page
UsePOST /api/v1/extract when you want JSON output instead of raw content.
results: the extracted JSON object that matches your schemascreenshotUrl: a screenshot of the analyzed page used during extraction
Recommended workflow
- Start with
scrapeto confirm the page is reachable and the output is useful. - Move to
crawlif you need multiple pages from the same site. - Use
extractonly when you need structured fields returned as JSON. - Use
screenshotwhen the visual layout matters more than the text.
Best practices
- Start with a small
limitwhen testing crawls - Disable
renderJsif a page loads correctly without JavaScript - Validate extracted JSON before sending it downstream
- Be deliberate about which pages you fetch to keep usage efficient
- Respect the target site’s policies before automating scraping
Next Steps
- Review the Scrape API for request and response details
- Review the Crawl API for multi-page scraping
- Review the Extract API for schema-based extraction
- Use MCP Server if you want to access these tools from an AI client