Name: Baseten
Availability: InStock
Author: Baseten

Learning Objectives

Understand what Baseten is and how it fits into the AI infrastructure landscape
Compare Baseten's inference platform to alternatives like Together AI, Replicate, and Modal
Evaluate Baseten's pricing model and GPU options for production AI workloads

What Is Baseten?

Baseten is a production AI inference platform that makes it easy to deploy, serve, and scale AI models. Instead of managing GPU servers yourself, you deploy your model to Baseten and it handles containerization, GPU allocation, auto-scaling, and monitoring — billed by the minute with no upfront commitments.

Founded in 2019 and backed by a $150 million investment from NVIDIA, Baseten has become the go-to inference platform for AI-native companies like Cursor, Notion, Superhuman, and Quora. The platform processed over 1.3 quadrillion tokens per month by October 2025 — a 100 times volume increase over the course of that year — and by mid-2026 was handling more than 1 billion inference calls a day. In June 2026 the company raised a $1.5 billion Series F at a $13 billion valuation, one of the largest AI-infrastructure rounds on record and a sign that inference has become a contested category of its own.

💡Key Concept

AI Inference Infrastructure: Training an AI model teaches it to generate responses. Inference is actually running that model to serve real users. Inference infrastructure handles the GPU compute, networking, load balancing, and auto-scaling needed to serve millions of AI requests reliably and affordably. It is the invisible layer between the model and the user.

Core Capabilities

Model Deployment (Truss)

Baseten's open-source framework Truss (~6,000 GitHub stars) handles the complexity of deploying AI models:

Containerizes your model with all dependencies
Configures GPU allocation and memory
Supports PyTorch, TensorFlow, vLLM, SGLang, TensorRT-LLM, and more
Deploy with a single command — no Kubernetes or Docker expertise needed

Model APIs

Pre-optimized endpoints for popular models, ready to use without any deployment setup:

Tool	Best For
DeepSeek V3/R1	Reasoning-focused open-source models
Llama 4 Maverick	Latest Meta model; mixture-of-experts
Kimi K2	300 millisecond time-to-first-token; 140+ tokens per second
GPT-OSS-120B	OpenAI open-source; 0.25 second time-to-first-token

Training and Fine-Tuning

Multi-node fine-tuning infrastructure with seamless promotion to inference endpoints. Train a custom model and deploy it to production in the same platform.

Auto-Scaling

Dynamic scaling that adjusts GPU allocation based on traffic — including scale to zero for models with intermittent usage. No paying for idle GPUs.

Embeddings Inference

Optimized throughput and latency for embedding workloads used in RAG (Retrieval-Augmented Generation) and semantic search applications.

GPU Options and Pricing

Baseten offers pay-per-use pricing billed to the minute with no upfront commitments:

GPU	VRAM	Approximate Hourly Rate
NVIDIA T4	16 GB	Budget option for smaller models
NVIDIA A10G	24 GB	~$1.21/hour
NVIDIA A100	80 GB	~$4.00/hour
NVIDIA H100 MIG	40 GB	~$3.75/hour
NVIDIA H100	80 GB	~$6.50/hour
NVIDIA B200 (Blackwell)	180 GB	~$9.98/hour

Model APIs use per-token pricing that varies by model. Baseten reports 225% better cost-performance on Google Cloud Blackwell instances for high-throughput workloads compared to previous-generation GPUs.

Speed Benchmarks

Model	Time to First Token	Throughput
GPT-OSS-120B	0.25 seconds	High throughput
Kimi K2 Thinking	300 milliseconds	140+ tokens per second
Nemotron 3 Super	Fast	478.3 tokens per second

Baseten vs. Competitors

Platform	Best For	Key Difference
Baseten	Production inference for custom models	Truss open-source framework; auto-scaling; Blackwell GPU support; NVIDIA-backed
Together AI	Full-stack open-source AI (inference + training + fine-tuning)	200+ pre-built models; broader model catalog; more established
Replicate	Quick prototyping and model demos	Easiest setup; weaker for private production workloads
Modal	General Python compute and batch jobs	Developer-centric (Python decorators); better for workflows beyond just inference
RunPod	Cost-sensitive GPU rental	Lower-cost bare metal; less managed infrastructure
AWS SageMaker	Enterprise ML lifecycle on AWS	Full AWS ecosystem; heavier setup; Baseten deploys faster

Baseten's niche: Production-grade inference with maximum control over deployment. The Truss framework lets you bring any model with any configuration, while auto-scaling and per-minute billing keep costs aligned with actual usage.

Company Details

Detail	Info
Founded	2019
CEO	Tuhin Srivastava (co-founder)
Headquarters	San Francisco, California
Employees	~60+ (growing rapidly after 3 funding rounds in 12 months)
Valuation	$13 billion (June 2026 Series F; up from $5 billion in January 2026)
Latest Funding	$1.5 billion Series F (June 2026; led by Altimeter, Conviction, and Spark Capital)
Total Raised	~$2.1 billion
NVIDIA Investment	$150 million (participated in Series E)
Revenue Growth	~20x year over year (mid-2026); 10x in 2025
Inference Volume	More than 1 billion inference calls per day (2026); 1.3 quadrillion tokens per month (October 2025)
Notable Customers	Cursor; Notion; Superhuman; Quora; HeyGen; Writer; Clay
Open Source	Truss framework (~6,000 GitHub stars)
Website	baseten.co

Strengths

Production-grade inference — purpose-built for reliability, low latency, and auto-scaling at massive scale
NVIDIA-backed — $150 million investment and close hardware partnership; early access to Blackwell B200 GPUs
Truss open-source framework — deploy any model with any framework; full control over configuration without managing infrastructure
Scale to zero — no paying for idle GPUs; per-minute billing aligns costs with actual usage
Explosive growth — 10 times revenue and 100 times volume growth in 2025, with revenue up roughly 20 times year over year into mid-2026; a $1.5 billion Series F at a $13 billion valuation validates the platform

Limitations and Considerations

Inference-focused — while training was added in 2025, Baseten is primarily an inference platform; Together AI and Databricks offer more comprehensive AI development environments
Smaller model catalog — fewer pre-built Model API endpoints than Together AI (200+ models) or Replicate
Enterprise maturity — smaller company (~60+ employees) compared to established cloud providers
No permanent free tier confirmed — sign-up credits may be available, but no guaranteed free usage like Groq or Cerebras
Custom deployment complexity — while Truss simplifies things, deploying custom models still requires ML engineering expertise

Key Takeaways

Baseten is a high-performance AI inference platform backed by NVIDIA that processed 1.3 quadrillion tokens per month by October 2025 — a 100 times increase over the year
The Truss open-source framework lets you deploy any AI model to production with auto-scaling, scale-to-zero, and per-minute billing on GPUs from T4 to Blackwell B200
Valued at $13 billion after a $1.5 billion Series F in June 2026 (about $2.1 billion raised in total, including $150 million from NVIDIA); revenue up roughly 20 times year over year
Best suited for AI-native companies that need production-grade inference with maximum control over model deployment and GPU configuration

Baseten

Audio & video lessons are paid features