Learning Objectives
- Understand what Baseten is and how it fits into the AI infrastructure landscape
- Compare Baseten's inference platform to alternatives like Together AI, Replicate, and Modal
- Evaluate Baseten's pricing model and GPU options for production AI workloads
What Is Baseten?
Baseten is a production AI inference platform that makes it easy to deploy, serve, and scale AI models. Instead of managing GPU servers yourself, you deploy your model to Baseten and it handles containerization, GPU allocation, auto-scaling, and monitoring — billed by the minute with no upfront commitments.
Founded in 2019 and backed by a $150 million investment from NVIDIA, Baseten has become the go-to inference platform for AI-native companies like Cursor, Notion, Superhuman, and Quora. The platform processed over 1.3 quadrillion tokens per month by October 2025 — a 100 times volume increase over the course of that year.
💡Key Concept
AI Inference Infrastructure: Training an AI model teaches it to generate responses. Inference is actually running that model to serve real users. Inference infrastructure handles the GPU compute, networking, load balancing, and auto-scaling needed to serve millions of AI requests reliably and affordably. It is the invisible layer between the model and the user.
Core Capabilities
Model Deployment (Truss)
Baseten's open-source framework Truss (~6,000 GitHub stars) handles the complexity of deploying AI models:
- Containerizes your model with all dependencies
- Configures GPU allocation and memory
- Supports PyTorch, TensorFlow, vLLM, SGLang, TensorRT-LLM, and more
- Deploy with a single command — no Kubernetes or Docker expertise needed
Model APIs
Pre-optimized endpoints for popular models, ready to use without any deployment setup:
| Tool | Best For |
|---|
Training and Fine-Tuning
Multi-node fine-tuning infrastructure with seamless promotion to inference endpoints. Train a custom model and deploy it to production in the same platform.
Auto-Scaling
Dynamic scaling that adjusts GPU allocation based on traffic — including scale to zero for models with intermittent usage. No paying for idle GPUs.
Embeddings Inference
Optimized throughput and latency for embedding workloads used in RAG (Retrieval-Augmented Generation) and semantic search applications.
GPU Options and Pricing
Baseten offers pay-per-use pricing billed to the minute with no upfront commitments:
| GPU | VRAM | Approximate Hourly Rate |
|---|---|---|
| NVIDIA T4 | 16 GB | Budget option for smaller models |
| NVIDIA A10G | 24 GB | ~$1.21/hour |
| NVIDIA A100 | 80 GB | ~$4.00/hour |
| NVIDIA H100 MIG | 40 GB | ~$3.75/hour |
| NVIDIA H100 | 80 GB | ~$6.50/hour |
| NVIDIA B200 (Blackwell) | 180 GB | ~$9.98/hour |
Model APIs use per-token pricing that varies by model. Baseten reports 225% better cost-performance on Google Cloud Blackwell instances for high-throughput workloads compared to previous-generation GPUs.
Speed Benchmarks
| Model | Time to First Token | Throughput |
|---|---|---|
| GPT-OSS-120B | 0.25 seconds | High throughput |
| Kimi K2 Thinking | 300 milliseconds | 140+ tokens per second |
| Nemotron 3 Super | Fast | 478.3 tokens per second |
Baseten vs. Competitors
| Platform | Best For | Key Difference |
|---|---|---|
| Baseten | Production inference for custom models | Truss open-source framework; auto-scaling; Blackwell GPU support; NVIDIA-backed |
| Together AI | Full-stack open-source AI (inference + training + fine-tuning) | 200+ pre-built models; broader model catalog; more established |
| Replicate | Quick prototyping and model demos | Easiest setup; weaker for private production workloads |
| Modal | General Python compute and batch jobs | Developer-centric (Python decorators); better for workflows beyond just inference |
| RunPod | Cost-sensitive GPU rental | Lower-cost bare metal; less managed infrastructure |
| AWS SageMaker | Enterprise ML lifecycle on AWS | Full AWS ecosystem; heavier setup; Baseten deploys faster |
Baseten's niche: Production-grade inference with maximum control over deployment. The Truss framework lets you bring any model with any configuration, while auto-scaling and per-minute billing keep costs aligned with actual usage.
Company Details
| Detail | Info |
|---|---|
| Founded | 2019 |
| CEO | Tuhin Srivastava (co-founder) |
| Headquarters | San Francisco, California |
| Employees | ~60+ (growing rapidly after 3 funding rounds in 12 months) |
| Valuation | $5 billion (January 2026; up from $825 million in February 2025) |
| Latest Funding | $300 million Series E (January 2026; led by IVP and CapitalG) |
| Total Raised | ~$585 million |
| NVIDIA Investment | $150 million (participated in Series E) |
| Revenue Growth | 10x in 2025 |
| Inference Volume | 1.3 quadrillion tokens per month (October 2025) |
| Notable Customers | Cursor; Notion; Superhuman; Quora; HeyGen; Writer; Clay |
| Open Source | Truss framework (~6,000 GitHub stars) |
| Website | baseten.co |
Strengths
- Production-grade inference — purpose-built for reliability, low latency, and auto-scaling at massive scale
- NVIDIA-backed — $150 million investment and close hardware partnership; early access to Blackwell B200 GPUs
- Truss open-source framework — deploy any model with any framework; full control over configuration without managing infrastructure
- Scale to zero — no paying for idle GPUs; per-minute billing aligns costs with actual usage
- Explosive growth — 10x revenue and 100x volume growth in 2025; $5 billion valuation validates the platform
Limitations and Considerations
- Inference-focused — while training was added in 2025, Baseten is primarily an inference platform; Together AI and Databricks offer more comprehensive AI development environments
- Smaller model catalog — fewer pre-built Model API endpoints than Together AI (200+ models) or Replicate
- Enterprise maturity — smaller company (~60+ employees) compared to established cloud providers
- No permanent free tier confirmed — sign-up credits may be available, but no guaranteed free usage like Groq or Cerebras
- Custom deployment complexity — while Truss simplifies things, deploying custom models still requires ML engineering expertise
Key Takeaways
- Baseten is a high-performance AI inference platform backed by NVIDIA that processed 1.3 quadrillion tokens per month by October 2025 — a 100 times increase over the year
- The Truss open-source framework lets you deploy any AI model to production with auto-scaling, scale-to-zero, and per-minute billing on GPUs from T4 to Blackwell B200
- Valued at $5 billion after raising $585 million total (including $150 million from NVIDIA); 10x revenue growth in 2025
- Best suited for AI-native companies that need production-grade inference with maximum control over model deployment and GPU configuration