Web Scraping Tutorial

This tutorial will guide you through using Dumpling AI’s web scraping capabilities to extract and process content from websites.

Overview

Dumpling AI offers several endpoints for web data extraction:

  • Scrape: Extract content from a single webpage
  • Crawl: Recursively extract content from multiple pages
  • Screenshot: Capture visual screenshots of webpages

In this tutorial, we’ll focus on using the Scrape endpoint to extract content from a website and process it.

Prerequisites

Before you begin, make sure you have:

  • A Dumpling AI account with an API key
  • Basic knowledge of HTTP requests
  • A code editor and environment for running examples (Node.js, Python, etc.)

Basic Scraping

Let’s start with a simple example of scraping a webpage:

const axios = require('axios');

async function scrapePage(url) {
  try {
    const response = await axios.post('https://app.dumplingai.com/api/v1/scrape', 
      {
        url: url,
        format: 'markdown',
        cleaned: true
      },
      {
        headers: {
          'Content-Type': 'application/json',
          'Authorization': 'Bearer YOUR_API_KEY'
        }
      }
    );
    
    console.log('Page Title:', response.data.title);
    console.log('Content:', response.data.content.substring(0, 500) + '...');
    return response.data;
  } catch (error) {
    console.error('Error:', error.response ? error.response.data : error.message);
  }
}

scrapePage('https://example.com');

This code will:

  1. Send a POST request to the scrape endpoint
  2. Extract the content from example.com in Markdown format
  3. Apply content cleaning to remove unnecessary elements
  4. Print the page title and the first 500 characters of content

Advanced Scraping

HTML Format

If you need the original HTML structure:

// Same as before, but with format set to 'html'
{
  url: 'https://example.com',
  format: 'html',
  cleaned: true
}

JavaScript Rendering

For websites that require JavaScript to load content:

{
  url: 'https://example.com',
  format: 'markdown',
  cleaned: true,
  renderJs: true
}

Processing Scraped Content

Once you’ve scraped content, you might want to process it further. For example, you could extract specific information or analyze the content.

Extracting Specific Information

Let’s say you want to extract all links from a scraped webpage:

const cheerio = require('cheerio');

async function extractLinks(url) {
  const scrapedData = await scrapePage(url);
  
  // For HTML format
  if (scrapedData.format === 'html') {
    const $ = cheerio.load(scrapedData.content);
    const links = [];
    
    $('a').each((i, element) => {
      const href = $(element).attr('href');
      const text = $(element).text().trim();
      if (href) {
        links.push({ href, text });
      }
    });
    
    return links;
  }
  
  // For Markdown format, you would need a Markdown parser
  return [];
}

extractLinks('https://example.com').then(links => {
  console.log(`Found ${links.length} links:`);
  links.slice(0, 10).forEach(link => {
    console.log(`- ${link.text}: ${link.href}`);
  });
});

Combining with Other Dumpling AI Services

You can combine web scraping with other Dumpling AI services for more advanced workflows:

Adding Scraped Content to a Knowledge Base

async function scrapeAndStoreInKnowledgeBase(url, knowledgeBaseId) {
  const scrapedData = await scrapePage(url);
  
  const response = await axios.post('https://app.dumplingai.com/api/v1/knowledge-bases/add', 
    {
      knowledgeBaseId: knowledgeBaseId,
      content: scrapedData.content,
      metadata: {
        source: url,
        title: scrapedData.title,
        scrapedAt: new Date().toISOString()
      }
    },
    {
      headers: {
        'Content-Type': 'application/json',
        'Authorization': 'Bearer YOUR_API_KEY'
      }
    }
  );
  
  return response.data;
}

Best Practices

  1. Respect robots.txt: Always check if the site allows scraping before proceeding.
  2. Rate limiting: Add delays between requests to avoid overwhelming websites.
  3. Error handling: Implement proper error handling for network issues or API rate limits.
  4. Data validation: Always validate and clean scraped data before using it.
  5. Optimize credit usage: Only scrape pages you need to minimize API credit usage.

Conclusion

You’ve learned how to use Dumpling AI’s web scraping capabilities to extract and process content from websites. This is just the beginning - you can combine these techniques with other Dumpling AI services to build powerful applications.

Next Steps