Web Scraping Tutorial

This tutorial will guide you through using DumplingAI’s web scraping capabilities to extract and process content from websites.

Overview

DumplingAI offers several endpoints for web data extraction:

Scrape: Extract content from a single webpage
Crawl: Recursively extract content from multiple pages
Screenshot: Capture visual screenshots of webpages

In this tutorial, we’ll focus on using the Scrape endpoint to extract content from a website and process it.

Prerequisites

Before you begin, make sure you have:

A DumplingAI account with an API key
Basic knowledge of HTTP requests
A code editor and environment for running examples (Node.js, Python, etc.)

Basic Scraping

Let’s start with a simple example of scraping a webpage:

const axios = require('axios');

async function scrapePage(url) {
  try {
    const response = await axios.post('https://app.dumplingai.com/api/v1/scrape', 
      {
        url: url,
        format: 'markdown',
        cleaned: true
      },
      {
        headers: {
          'Content-Type': 'application/json',
          'Authorization': 'Bearer YOUR_API_KEY'
        }
      }
    );
    
    console.log('Page Title:', response.data.title);
    console.log('Content:', response.data.content.substring(0, 500) + '...');
    return response.data;
  } catch (error) {
    console.error('Error:', error.response ? error.response.data : error.message);
  }
}

scrapePage('https://example.com');

This code will:

Send a POST request to the scrape endpoint
Extract the content from example.com in Markdown format
Apply content cleaning to remove unnecessary elements
Print the page title and the first 500 characters of content

Advanced Scraping

HTML Format

If you need the original HTML structure:

// Same as before, but with format set to 'html'
{
  url: 'https://example.com',
  format: 'html',
  cleaned: true
}

JavaScript Rendering

For websites that require JavaScript to load content:

{
  url: 'https://example.com',
  format: 'markdown',
  cleaned: true,
  renderJs: true
}

Processing Scraped Content

Once you’ve scraped content, you might want to process it further. For example, you could extract specific information or analyze the content.

Extracting Specific Information

Let’s say you want to extract all links from a scraped webpage:

const cheerio = require('cheerio');

async function extractLinks(url) {
  const scrapedData = await scrapePage(url);
  
  // For HTML format
  if (scrapedData.format === 'html') {
    const $ = cheerio.load(scrapedData.content);
    const links = [];
    
    $('a').each((i, element) => {
      const href = $(element).attr('href');
      const text = $(element).text().trim();
      if (href) {
        links.push({ href, text });
      }
    });
    
    return links;
  }
  
  // For Markdown format, you would need a Markdown parser
  return [];
}

extractLinks('https://example.com').then(links => {
  console.log(`Found ${links.length} links:`);
  links.slice(0, 10).forEach(link => {
    console.log(`- ${link.text}: ${link.href}`);
  });
});

Combining with Other DumplingAI Services

You can combine web scraping with other DumplingAI services for more advanced workflows:

Adding Scraped Content to a Knowledge Base

async function scrapeAndStoreInKnowledgeBase(url, knowledgeBaseId) {
  const scrapedData = await scrapePage(url);
  
  const response = await axios.post('https://app.dumplingai.com/api/v1/knowledge-bases/add', 
    {
      knowledgeBaseId: knowledgeBaseId,
      content: scrapedData.content,
      metadata: {
        source: url,
        title: scrapedData.title,
        scrapedAt: new Date().toISOString()
      }
    },
    {
      headers: {
        'Content-Type': 'application/json',
        'Authorization': 'Bearer YOUR_API_KEY'
      }
    }
  );
  
  return response.data;
}

Best Practices

Respect robots.txt: Always check if the site allows scraping before proceeding.
Rate limiting: Add delays between requests to avoid overwhelming websites.
Error handling: Implement proper error handling for network issues or API rate limits.
Data validation: Always validate and clean scraped data before using it.
Optimize credit usage: Only scrape pages you need to minimize API credit usage.

Conclusion

You’ve learned how to use DumplingAI’s web scraping capabilities to extract and process content from websites. This is just the beginning - you can combine these techniques with other DumplingAI services to build powerful applications.

Next Steps

Learn how to create and query knowledge bases
Explore AI agent capabilities to process your scraped data
Check out the API reference for detailed parameter options

Guides

​Web Scraping Tutorial

​Overview

​Prerequisites

​Basic Scraping

​Advanced Scraping

​HTML Format

​JavaScript Rendering

​Processing Scraped Content

​Extracting Specific Information

​Combining with Other DumplingAI Services

​Adding Scraped Content to a Knowledge Base

​Best Practices

​Conclusion

​Next Steps