Free to read. Sign up to save your progress and take knowledge-check quizzes.

Sign up free
5 min read·Updated March 8, 2026

Diffbot

Diffbot logoBy Diffbot

Diffbot is an AI-powered web data extraction company that uses computer vision and machine learning to automatically understand and extract structured data from any web page without CSS selectors or XPath — and provides a Knowledge Graph of 10 billion entities built from continuously crawling and structuring the entire web.

Listen to this lesson

Free preview · first 0:30
0:00 / 0:30

Audio & video lessons are paid features

Plus unlocks audio streaming. Pro adds downloadable audio, video, certificates, and more.

Plus adds:
  • Audio streaming
  • Downloadable PDFs
  • All AI Playbooks
  • Personalized content
Pro also adds:
  • Certificates of completion
  • Audio MP3 downloads
  • Video lessonssoon
  • & More…soon

Watch this lesson

Video coming soon

Learning Objectives

  • Understand what Diffbot is and how its AI-based extraction approach differs from traditional web scraping
  • Identify Diffbot's two main products: Automatic APIs and the Knowledge Graph
  • Evaluate when Diffbot is the right choice vs. Firecrawl, Apify, or SerpAPI

What Is Diffbot?

Diffbot is a web data extraction company founded in 2012 that uses computer vision and natural language processing to automatically understand what type of content a web page contains — article, product, person profile, company page, review — and extract the relevant structured data without requiring selectors, patterns, or configuration.

Diffbot has built two complementary products:

  1. Automatic APIs: Extract structured data from any URL — article, product, image, video, discussion — without writing extraction code
  2. Knowledge Graph: A continuously updated graph database of 10 billion entities (people, companies, products, articles, locations) built from crawling and structuring the entire web

Tip

Try Diffbot: diffbot.com — free trial with limited API calls; plans starting at $299/month; Knowledge Graph has separate pricing; enterprise contracts for large-scale use

Core Products

Automatic APIs — Type-Based Extraction

Diffbot's Automatic APIs classify and extract web pages by content type:

Article API — for any news or editorial content:

  • Extracts: title, author, publication date, full text, images, tags, summary, sentiment
  • Works on any news site, blog, or editorial page without configuration
  • Returns clean structured JSON

Product API — for e-commerce and product pages:

  • Extracts: name, price, brand, specifications, availability, images, reviews, retailer
  • Handles product variations and pricing tiers
  • Works across different retailer layouts automatically

Company API — for company websites and business profiles:

  • Extracts: company name, description, employees, revenue, social profiles, technologies used, founding date
  • Sources from company websites, LinkedIn, Crunchbase, and other business data sources

Person API — for person profiles:

  • Extracts: name, job title, employer, bio, social profiles, location
  • Works on LinkedIn profiles, author pages, and professional bios

💡Key Concept

AI-based vs. selector-based scraping: Traditional web scrapers use CSS selectors or XPath — brittle rules that break whenever a website changes its HTML structure. Diffbot uses computer vision and ML to understand page content semantically — the way a human reading the page would understand it — without caring about the underlying HTML structure. This makes Diffbot's extractions more robust to site redesigns and more generalizable across different sites of the same type (all news articles, all product pages).

Knowledge Graph — The Structured Web

Diffbot's most distinctive product: a graph database of 10 billion entities representing the structured knowledge of the web:

  • 10 billion entities: People, companies, products, articles, locations, events
  • Continuously updated: Diffbot crawls the web continuously, updating entity records as information changes
  • Relationships: Entities are linked — CEO works for Company, Company founded in Location, Person authored Article
  • Natural language queries: Ask questions about entities in plain language: "Who are the board members of OpenAI?"
  • Organization search: Find companies by industry, location, headcount, revenue, funding stage

The Knowledge Graph is used by enterprises for due diligence research, competitive intelligence, and training datasets for AI models.

Enhance API — Entity Enrichment

Given partial information about a company or person, Diffbot's Enhance API fills in the gaps:

  • Input: company name → output: website, description, employees, revenue, location, social profiles, technologies used
  • Input: person name + company → output: job title, bio, contact information, social profiles
  • Bulk enrichment via CSV upload or API

Pricing

TrialFree
  • Limited
  • Automatic APIs
  • KG sandbox
  • Evaluation
Starter$299/month
  • 10,000 API calls
  • Automatic APIs
  • Developers
  • Small applications
Business$999/month
  • 50,000 API calls
  • All APIs
  • Basic KG access
  • Growing applications
EnterpriseCustom
  • Unlimited
  • All APIs
  • Full KG
  • SLAs

Diffbot's pricing is higher than tools like Firecrawl or SerpAPI — reflecting the product's enterprise positioning and the significantly higher value of semantic, type-based extraction and Knowledge Graph access.

Strengths

  • No selector maintenance: ML-based extraction that works without CSS selectors is significantly more robust than pattern-based scrapers
  • Knowledge Graph depth: 10 billion entities with relationship data is genuinely large-scale structured knowledge
  • Cross-site generalization: The same Article API works on any news site — no per-site configuration
  • Semantic understanding: Extracts meaning, not just structure — distinguishes article from product from person profile automatically
  • Long-standing reliability: One of the oldest AI scraping companies — stable and mature

Limitations & Considerations

  • Price: Significantly more expensive than Firecrawl, SerpAPI, or DIY scraping — justified for enterprise use but prohibitive for individuals
  • Less real-time than SerpAPI: Not optimized for real-time search result extraction
  • Overkill for simple use cases: If you need to scrape a few specific pages consistently, simpler tools are more cost-effective
  • Knowledge Graph coverage gaps: Some entities and organizations are less well-covered than major companies and people
  • Not open source: No self-hosting option; entirely cloud-dependent

Best Use Cases

TaskWhy Diffbot
Large-scale article extraction across news sitesArticle API works without per-site configuration
Product data across many retailersProduct API generalizes across retailer layouts
Company data enrichment at scaleKnowledge Graph + Enhance API for firmographic data
AI training dataset constructionStructured, semantically understood web content
Due diligence researchKnowledge Graph queries on companies, people, and relationships
Competitive product monitoringProduct API extracts pricing across competitors automatically

When to choose alternatives:

  • Simple LLM-optimized page scraping → Firecrawl (cheaper, simpler)
  • Production scraping platform with Actor marketplace → Apify
  • Search engine results → SerpAPI
  • Sales contact database → Apollo.io
  • Budget-conscious small-scale extraction → Firecrawl or BeautifulSoup

Getting Started

  1. Sign up for a free trial at diffbot.com — includes limited API calls
  2. Test the Article API with a news URL:
import requests

url = "https://api.diffbot.com/v3/article"
params = {
    "url": "https://techcrunch.com/some-article",
    "token": "your-api-key"
}

response = requests.get(url, params=params).json()
print(response["objects"][0]["text"])  # Full article text
print(response["objects"][0]["author"])  # Author name
  1. Try the Knowledge Graph interface at app.diffbot.com with a trial key
  2. Evaluate if the extraction quality and Knowledge Graph coverage match your specific use case before committing to a paid plan

📝Note

Diffbot for AI training data: Large AI research organizations use Diffbot's web crawl and extraction capabilities to construct structured datasets from web content. Diffbot's continuous crawl of the open web, combined with its semantic extraction, produces training data that is more structured than raw HTML scrapes — useful for building LLMs trained on typed, labeled content rather than raw web text.

Key Takeaways

  • Diffbot uses computer vision and ML to automatically classify and extract structured data from web pages without CSS selectors or maintenance
  • Its Automatic APIs (Article, Product, Company, Person) generalize across sites of the same type without per-site configuration
  • The Knowledge Graph contains 10 billion continuously updated entities with relationships — suitable for due diligence, AI training, and competitive intelligence
  • Pricing starts at $299/month — significantly more expensive than other web scraping tools, justified for enterprise applications
  • Best for large-scale, cross-site data extraction where the robustness of ML-based understanding justifies the higher cost

Save your progress & take the quiz

Sign up free to bookmark lessons, track which modules you've completed, and lock in what you learned with a quick knowledge-check quiz at the end of each lesson.

🧭Recommended for you