Learning Objectives
- Understand what Diffbot is and how its AI-based extraction approach differs from traditional web scraping
- Identify Diffbot's two main products: Automatic APIs and the Knowledge Graph
- Evaluate when Diffbot is the right choice vs. Firecrawl, Apify, or SerpAPI
What Is Diffbot?
Diffbot is a web data extraction company founded in 2012 that uses computer vision and natural language processing to automatically understand what type of content a web page contains — article, product, person profile, company page, review — and extract the relevant structured data without requiring selectors, patterns, or configuration.
Diffbot has built two complementary products:
- Automatic APIs: Extract structured data from any URL — article, product, image, video, discussion — without writing extraction code
- Knowledge Graph: A continuously updated graph database of 10 billion entities (people, companies, products, articles, locations) built from crawling and structuring the entire web
✅Tip
Try Diffbot: diffbot.com — free trial with limited API calls; plans starting at $299/month; Knowledge Graph has separate pricing; enterprise contracts for large-scale use
Core Products
Automatic APIs — Type-Based Extraction
Diffbot's Automatic APIs classify and extract web pages by content type:
Article API — for any news or editorial content:
- Extracts: title, author, publication date, full text, images, tags, summary, sentiment
- Works on any news site, blog, or editorial page without configuration
- Returns clean structured JSON
Product API — for e-commerce and product pages:
- Extracts: name, price, brand, specifications, availability, images, reviews, retailer
- Handles product variations and pricing tiers
- Works across different retailer layouts automatically
Company API — for company websites and business profiles:
- Extracts: company name, description, employees, revenue, social profiles, technologies used, founding date
- Sources from company websites, LinkedIn, Crunchbase, and other business data sources
Person API — for person profiles:
- Extracts: name, job title, employer, bio, social profiles, location
- Works on LinkedIn profiles, author pages, and professional bios
💡Key Concept
AI-based vs. selector-based scraping: Traditional web scrapers use CSS selectors or XPath — brittle rules that break whenever a website changes its HTML structure. Diffbot uses computer vision and ML to understand page content semantically — the way a human reading the page would understand it — without caring about the underlying HTML structure. This makes Diffbot's extractions more robust to site redesigns and more generalizable across different sites of the same type (all news articles, all product pages).
Knowledge Graph — The Structured Web
Diffbot's most distinctive product: a graph database of 10 billion entities representing the structured knowledge of the web:
- 10 billion entities: People, companies, products, articles, locations, events
- Continuously updated: Diffbot crawls the web continuously, updating entity records as information changes
- Relationships: Entities are linked — CEO works for Company, Company founded in Location, Person authored Article
- Natural language queries: Ask questions about entities in plain language: "Who are the board members of OpenAI?"
- Organization search: Find companies by industry, location, headcount, revenue, funding stage
The Knowledge Graph is used by enterprises for due diligence research, competitive intelligence, and training datasets for AI models.
Enhance API — Entity Enrichment
Given partial information about a company or person, Diffbot's Enhance API fills in the gaps:
- Input: company name → output: website, description, employees, revenue, location, social profiles, technologies used
- Input: person name + company → output: job title, bio, contact information, social profiles
- Bulk enrichment via CSV upload or API
Pricing
- Limited
- Automatic APIs
- KG sandbox
- Evaluation
- 10,000 API calls
- Automatic APIs
- Developers
- Small applications
- 50,000 API calls
- All APIs
- Basic KG access
- Growing applications
- Unlimited
- All APIs
- Full KG
- SLAs
Diffbot's pricing is higher than tools like Firecrawl or SerpAPI — reflecting the product's enterprise positioning and the significantly higher value of semantic, type-based extraction and Knowledge Graph access.
Strengths
- No selector maintenance: ML-based extraction that works without CSS selectors is significantly more robust than pattern-based scrapers
- Knowledge Graph depth: 10 billion entities with relationship data is genuinely large-scale structured knowledge
- Cross-site generalization: The same Article API works on any news site — no per-site configuration
- Semantic understanding: Extracts meaning, not just structure — distinguishes article from product from person profile automatically
- Long-standing reliability: One of the oldest AI scraping companies — stable and mature
Limitations & Considerations
- Price: Significantly more expensive than Firecrawl, SerpAPI, or DIY scraping — justified for enterprise use but prohibitive for individuals
- Less real-time than SerpAPI: Not optimized for real-time search result extraction
- Overkill for simple use cases: If you need to scrape a few specific pages consistently, simpler tools are more cost-effective
- Knowledge Graph coverage gaps: Some entities and organizations are less well-covered than major companies and people
- Not open source: No self-hosting option; entirely cloud-dependent
Best Use Cases
| Task | Why Diffbot |
|---|---|
| Large-scale article extraction across news sites | Article API works without per-site configuration |
| Product data across many retailers | Product API generalizes across retailer layouts |
| Company data enrichment at scale | Knowledge Graph + Enhance API for firmographic data |
| AI training dataset construction | Structured, semantically understood web content |
| Due diligence research | Knowledge Graph queries on companies, people, and relationships |
| Competitive product monitoring | Product API extracts pricing across competitors automatically |
When to choose alternatives:
- Simple LLM-optimized page scraping → Firecrawl (cheaper, simpler)
- Production scraping platform with Actor marketplace → Apify
- Search engine results → SerpAPI
- Sales contact database → Apollo.io
- Budget-conscious small-scale extraction → Firecrawl or BeautifulSoup
Getting Started
- Sign up for a free trial at diffbot.com — includes limited API calls
- Test the Article API with a news URL:
import requests
url = "https://api.diffbot.com/v3/article"
params = {
"url": "https://techcrunch.com/some-article",
"token": "your-api-key"
}
response = requests.get(url, params=params).json()
print(response["objects"][0]["text"]) # Full article text
print(response["objects"][0]["author"]) # Author name
- Try the Knowledge Graph interface at app.diffbot.com with a trial key
- Evaluate if the extraction quality and Knowledge Graph coverage match your specific use case before committing to a paid plan
📝Note
Diffbot for AI training data: Large AI research organizations use Diffbot's web crawl and extraction capabilities to construct structured datasets from web content. Diffbot's continuous crawl of the open web, combined with its semantic extraction, produces training data that is more structured than raw HTML scrapes — useful for building LLMs trained on typed, labeled content rather than raw web text.
Key Takeaways
- Diffbot uses computer vision and ML to automatically classify and extract structured data from web pages without CSS selectors or maintenance
- Its Automatic APIs (Article, Product, Company, Person) generalize across sites of the same type without per-site configuration
- The Knowledge Graph contains 10 billion continuously updated entities with relationships — suitable for due diligence, AI training, and competitive intelligence
- Pricing starts at $299/month — significantly more expensive than other web scraping tools, justified for enterprise applications
- Best for large-scale, cross-site data extraction where the robustness of ML-based understanding justifies the higher cost