Web Scraping Tutorial
This tutorial covers the current DumplingAI scraping workflow using thescrape, crawl, and extract endpoints.
Choose the right endpoint
- Scrape: fetch one URL and return cleaned content in
markdown,html, orscreenshotformat - Crawl: fetch multiple pages from one site with configurable
depthandlimit - Extract: return structured JSON from a URL using a schema you define
- Screenshot: capture a visual snapshot when you need page imagery instead of text
Prerequisites
Before you begin, make sure you have:- A DumplingAI account with an API key
- Basic knowledge of HTTP requests
- A way to call the API, such as
curl, Node.js, or Python
1. Scrape a single page
Start withPOST /api/v1/scrape when you want the content of one page.
How to tune scrape
- Set
formattomarkdownfor LLM-friendly text,htmlwhen you need original markup, orscreenshotfor image output - Keep
cleaned: trueunless you specifically need the noisier raw page structure - Set
renderJs: falsefor faster, cheaper-style workflows when the page does not require client-side rendering
2. Crawl multiple pages from a site
UsePOST /api/v1/crawl when you need more than one page.
results array, where each entry contains the page url, status, and extracted content.
When to use crawl
- Use it for site audits, blog ingestion, and multi-page research
- Tune
depthto control how far DumplingAI follows links from the starting URL - Tune
limitto cap page count and keep usage predictable
3. Extract structured data from a page
UsePOST /api/v1/extract when you want JSON output instead of raw content.
results: the extracted JSON object that matches your schemascreenshotUrl: a screenshot of the analyzed page used during extraction
Recommended workflow
- Start with
scrapeto confirm the page is reachable and the output is useful. - Move to
crawlif you need multiple pages from the same site. - Use
extractonly when you need structured fields returned as JSON. - Use
screenshotwhen the visual layout matters more than the text.
Best practices
- Start with a small
limitwhen testing crawls - Disable
renderJsif a page loads correctly without JavaScript - Validate extracted JSON before sending it downstream
- Be deliberate about which pages you fetch to keep usage efficient
- Respect the target site’s policies before automating scraping
Next Steps
- Review the Scrape API for request and response details
- Review the Crawl API for multi-page scraping
- Review the Extract API for schema-based extraction
- Use MCP Server if you want to access these tools from an AI client