Name: Diffbot
Availability: InStock
Author: Diffbot

Learning Objectives

Understand what Diffbot is and how its AI-based extraction approach differs from traditional web scraping
Identify Diffbot's two main products: Automatic APIs and the Knowledge Graph
Evaluate when Diffbot is the right choice vs. Firecrawl, Apify, or SerpAPI

What Is Diffbot?

Diffbot is a web data extraction company founded in 2012 that uses computer vision and natural language processing to automatically understand what type of content a web page contains — article, product, person profile, company page, review — and extract the relevant structured data without requiring selectors, patterns, or configuration.

Diffbot has built two complementary products:

Automatic APIs: Extract structured data from any URL — article, product, image, video, discussion — without writing extraction code
Knowledge Graph: A continuously updated graph database of 10 billion entities (people, companies, products, articles, locations) built from crawling and structuring the entire web

✅Tip

Try Diffbot: diffbot.com — free trial with limited API calls; plans starting at $299/month; Knowledge Graph has separate pricing; enterprise contracts for large-scale use

Core Products

Automatic APIs — Type-Based Extraction

Diffbot's Automatic APIs classify and extract web pages by content type:

Article API — for any news or editorial content:

Extracts: title, author, publication date, full text, images, tags, summary, sentiment
Works on any news site, blog, or editorial page without configuration
Returns clean structured JSON

Product API — for e-commerce and product pages:

Extracts: name, price, brand, specifications, availability, images, reviews, retailer
Handles product variations and pricing tiers
Works across different retailer layouts automatically

Company API — for company websites and business profiles:

Extracts: company name, description, employees, revenue, social profiles, technologies used, founding date
Sources from company websites, LinkedIn, Crunchbase, and other business data sources

Person API — for person profiles:

Extracts: name, job title, employer, bio, social profiles, location
Works on LinkedIn profiles, author pages, and professional bios

💡Key Concept

AI-based vs. selector-based scraping: Traditional web scrapers use CSS selectors or XPath — brittle rules that break whenever a website changes its HTML structure. Diffbot uses computer vision and ML to understand page content semantically — the way a human reading the page would understand it — without caring about the underlying HTML structure. This makes Diffbot's extractions more robust to site redesigns and more generalizable across different sites of the same type (all news articles, all product pages).

Knowledge Graph — The Structured Web

Diffbot's most distinctive product: a graph database of 10 billion entities representing the structured knowledge of the web:

10 billion entities: People, companies, products, articles, locations, events
Continuously updated: Diffbot crawls the web continuously, updating entity records as information changes
Relationships: Entities are linked — CEO works for Company, Company founded in Location, Person authored Article
Natural language queries: Ask questions about entities in plain language: "Who are the board members of OpenAI?"
Organization search: Find companies by industry, location, headcount, revenue, funding stage

The Knowledge Graph is used by enterprises for due diligence research, competitive intelligence, and training datasets for AI models.

Enhance API — Entity Enrichment

Given partial information about a company or person, Diffbot's Enhance API fills in the gaps:

Input: company name → output: website, description, employees, revenue, location, social profiles, technologies used
Input: person name + company → output: job title, bio, contact information, social profiles
Bulk enrichment via CSV upload or API

Pricing

Plan	Price	Features
Trial	Free	Limited Automatic APIs KG sandbox Evaluation
Starter	$299/month	10,000 API calls Automatic APIs Developers Small applications
Business	$999/month	50,000 API calls All APIs Basic KG access Growing applications
Enterprise	Custom	Unlimited All APIs Full KG SLAs

TrialFree

Limited
Automatic APIs
KG sandbox
Evaluation

Starter$299/month

10,000 API calls
Automatic APIs
Developers
Small applications

Business$999/month

50,000 API calls
All APIs
Basic KG access
Growing applications

EnterpriseCustom

Unlimited
All APIs
Full KG
SLAs

Diffbot's pricing is higher than tools like Firecrawl or SerpAPI — reflecting the product's enterprise positioning and the significantly higher value of semantic, type-based extraction and Knowledge Graph access.

Strengths

No selector maintenance: ML-based extraction that works without CSS selectors is significantly more robust than pattern-based scrapers
Knowledge Graph depth: 10 billion entities with relationship data is genuinely large-scale structured knowledge
Cross-site generalization: The same Article API works on any news site — no per-site configuration
Semantic understanding: Extracts meaning, not just structure — distinguishes article from product from person profile automatically
Long-standing reliability: One of the oldest AI scraping companies — stable and mature

Limitations & Considerations

Price: Significantly more expensive than Firecrawl, SerpAPI, or DIY scraping — justified for enterprise use but prohibitive for individuals
Less real-time than SerpAPI: Not optimized for real-time search result extraction
Overkill for simple use cases: If you need to scrape a few specific pages consistently, simpler tools are more cost-effective
Knowledge Graph coverage gaps: Some entities and organizations are less well-covered than major companies and people
Not open source: No self-hosting option; entirely cloud-dependent

Best Use Cases

Task	Why Diffbot
Large-scale article extraction across news sites	Article API works without per-site configuration
Product data across many retailers	Product API generalizes across retailer layouts
Company data enrichment at scale	Knowledge Graph + Enhance API for firmographic data
AI training dataset construction	Structured, semantically understood web content
Due diligence research	Knowledge Graph queries on companies, people, and relationships
Competitive product monitoring	Product API extracts pricing across competitors automatically

When to choose alternatives:

Simple LLM-optimized page scraping → Firecrawl (cheaper, simpler)
Production scraping platform with Actor marketplace → Apify
Search engine results → SerpAPI
Sales contact database → Apollo.io
Budget-conscious small-scale extraction → Firecrawl or BeautifulSoup

Getting Started

Sign up for a free trial at diffbot.com — includes limited API calls
Test the Article API with a news URL:

import requests

url = "https://api.diffbot.com/v3/article"
params = {
    "url": "https://techcrunch.com/some-article",
    "token": "your-api-key"
}

response = requests.get(url, params=params).json()
print(response["objects"][0]["text"])  # Full article text
print(response["objects"][0]["author"])  # Author name

Try the Knowledge Graph interface at app.diffbot.com with a trial key
Evaluate if the extraction quality and Knowledge Graph coverage match your specific use case before committing to a paid plan

📝Note

Diffbot for AI training data: Large AI research organizations use Diffbot's web crawl and extraction capabilities to construct structured datasets from web content. Diffbot's continuous crawl of the open web, combined with its semantic extraction, produces training data that is more structured than raw HTML scrapes — useful for building LLMs trained on typed, labeled content rather than raw web text.

Key Takeaways

Diffbot uses computer vision and ML to automatically classify and extract structured data from web pages without CSS selectors or maintenance
Its Automatic APIs (Article, Product, Company, Person) generalize across sites of the same type without per-site configuration
The Knowledge Graph contains 10 billion continuously updated entities with relationships — suitable for due diligence, AI training, and competitive intelligence
Pricing starts at $299/month — significantly more expensive than other web scraping tools, justified for enterprise applications
Best for large-scale, cross-site data extraction where the robustness of ML-based understanding justifies the higher cost

Diffbot

Audio & video lessons are paid features