Learning Objectives
- Understand what Amazon S3 is and how it differs from consumer cloud storage like Drive or Dropbox
- Identify S3's core features: object storage, buckets, storage classes, and event-driven integrations
- Recognize S3's role in AI and ML workflows — data lakes, training data, and model artifact storage
What Is Amazon S3?
Amazon S3 (Simple Storage Service) is AWS's object storage service, launched in 2006 as one of the first AWS services and still the most widely used. Unlike consumer-facing platforms such as Google Drive or Dropbox, S3 is designed for programmatic, application-level storage — developers, DevOps engineers, and data engineers use S3 to store and retrieve any amount of data from anywhere.
S3 stores objects (files, blobs, binary data) in buckets (containers), addressed by keys (file paths). It scales from a single file to exabytes of data with no pre-provisioning required. S3 is the storage layer beneath much of the modern internet — website assets, application files, log archives, database backups, and AI training datasets are all commonly stored in S3.
✅Tip
Try Amazon S3: aws.amazon.com/s3 — first 12 months free tier includes 5GB standard storage, 20,000 GET requests, and 2,000 PUT requests; after that, pricing is per-GB stored and per-request
Pricing
S3 pricing is usage-based with no minimum commitment:
- First 50TB
- Lower at higher volumes
- Write operations
- Read operations
- Free within same AWS region
- Auto-moves data to cheaper tiers based on access
- Lowest-cost archival
- Retrieval takes hours
For most application use cases, S3 is extremely cost-effective — storing 100GB of infrequently accessed data costs roughly $2–3/month. Data transfer out to the internet is the most significant cost driver for high-traffic applications.
Core Features
Object Storage and Buckets
The fundamental model:
- Bucket: A named container for objects. Bucket names are globally unique across AWS.
- Object: Any file — from a text file to a multi-TB video — stored with metadata and an access key
- Key: The path/filename for the object (e.g.,
datasets/training/batch-001.jsonl) - No folder hierarchy: S3 uses flat storage with key prefixes that simulate folder paths
Storage Classes — Cost Optimization
S3 offers multiple storage classes with different cost/access tradeoff profiles:
| Storage Class | Use Case | Retrieval | Monthly Cost (per GB) |
|---|---|---|---|
| S3 Standard | Frequently accessed data; web assets; active datasets | Milliseconds | $0.023 |
| S3 Standard-IA | Infrequently accessed; backups kept warm | Milliseconds | $0.0125 |
| S3 One Zone-IA | Infrequently accessed; single availability zone | Milliseconds | $0.01 |
| S3 Glacier Instant Retrieval | Archives needing millisecond access | Milliseconds | $0.004 |
| S3 Glacier Flexible Retrieval | Archives; restore in minutes to hours | Minutes to hours | $0.0036 |
| S3 Glacier Deep Archive | Long-term archival; 7–10 year retention | Hours | $0.00099 |
| S3 Intelligent-Tiering | Unpredictable access patterns | Varies | Varies + $0.0025/1K objects |
S3 in AI and ML Workflows
S3 is foundational infrastructure for AI development:
- Training data storage: Store datasets (images, text, JSON, Parquet) that SageMaker, Bedrock, or custom training scripts pull from
- Model artifact storage: Trained model weights and checkpoints are saved to S3 between training runs
- Data lakes: S3 serves as the storage layer for AWS Glue (ETL), Athena (SQL queries on S3 data), and Amazon Bedrock Knowledge Bases (RAG)
- Vector store inputs: Raw documents pre-processed and stored in S3 before ingestion into vector databases
- Feature stores: Pre-computed ML features stored as Parquet files in S3, accessible to training and inference pipelines
💡Key Concept
S3 as the AI data backbone: When an AI company trains a large language model on web-crawled text, they typically store the raw corpus, cleaned datasets, tokenized batches, and model checkpoints in S3 or an equivalent object store. S3's virtually unlimited scale, high durability (99.999999999% — "11 nines"), and tight integration with AWS compute (EC2, SageMaker) make it the de facto standard for AI/ML data infrastructure.
Event-Driven Integrations
S3 supports event notifications that trigger downstream processing automatically:
- S3 → Lambda: A new file uploaded to a bucket triggers a Lambda function (resizing images, processing CSV, running inference)
- S3 → SQS/SNS: Fan out notifications to processing queues
- S3 → Bedrock: New documents in S3 can trigger re-indexing into a Bedrock Knowledge Base for RAG
Access Control and Security
- Bucket policies: JSON-based policies controlling who can access which objects
- IAM roles: Attach permissions to EC2 instances or Lambda functions without hard-coding credentials
- S3 Block Public Access: Setting to prevent any accidental public exposure — critical for sensitive data
- Versioning: Maintain all versions of every object — essential for AI dataset lineage and rollback
- Server-side encryption: Automatic AES-256 encryption at rest (S3-SSE) or customer-managed keys (SSE-KMS)
Strengths
- Scale: No pre-provisioning; store from bytes to exabytes
- Durability: 99.999999999% durability (objects are replicated across multiple availability zones)
- AWS ecosystem integration: Native integration with every AWS service — compute, analytics, AI/ML, streaming
- Cost efficiency at scale: Very low cost for cold data (Glacier Deep Archive under $1/TB/month)
- Ecosystem de facto standard: S3-compatible APIs are implemented by every major cloud storage provider and many on-premise systems
Limitations & Considerations
- Not for end users: S3 has no consumer UI for everyday file management — it requires AWS console knowledge or programmatic access via SDK/CLI
- Data transfer egress costs: Downloading large amounts of data out of AWS to the internet can become expensive
- No collaborative editing: S3 stores files; it has no Docs-style editing, search, or preview
- Complexity for newcomers: AWS IAM policies, bucket permissions, and lifecycle rules have a learning curve
Best Use Cases
| Task | Why S3 |
|---|---|
| AI training dataset storage | Scalable, durable, directly accessible by SageMaker and custom training code |
| Application asset hosting | Images, videos, static files served via CloudFront CDN |
| Database and system backups | Automated backup to S3 Standard-IA or Glacier |
| Data lake foundation | Store raw and processed data for Athena SQL queries and analytics |
| Model checkpoint storage | Save large model weights between training runs |
| Log archival | Compress and archive application logs to Glacier Deep Archive cheaply |
When to choose alternatives:
- Personal or team productivity file storage → Google Drive or OneDrive
- Simple cloud backup for consumers → Backblaze B2 (simpler and cheaper egress)
- Non-AWS developer storage → Google Cloud Storage or Azure Blob Storage
- Team file collaboration → Dropbox or SharePoint
Getting Started
- Create an AWS account at aws.amazon.com — free tier includes 5GB S3 Standard for 12 months
- Open the S3 console and click Create Bucket — choose a globally unique name and a region close to your users
- Enable Block Public Access for private buckets (default for new buckets)
- Upload a file via the console or use the AWS CLI:
aws s3 cp myfile.txt s3://my-bucket/ - Configure a Lifecycle Rule to transition objects to Glacier after 90 days if they won't be frequently accessed
📝Note
S3-compatible storage: Many storage systems implement the S3 API — including Cloudflare R2 (no egress fees), Backblaze B2, MinIO (self-hosted), and DigitalOcean Spaces. This means code written against S3 can often be redirected to these alternatives with minimal changes, which is useful for cost optimization or compliance requirements.
Key Takeaways
- Amazon S3 is object storage infrastructure — not a consumer productivity tool but the storage backbone of the web and AI ecosystem
- It stores objects in buckets, scales to unlimited size, and integrates natively with every AWS service including SageMaker, Bedrock, Lambda, and Athena
- Multiple storage classes (Standard, Intelligent-Tiering, Glacier) allow dramatic cost optimization based on how frequently data is accessed
- S3 is foundational to AI/ML workflows: training datasets, model checkpoints, data lakes, and RAG document stores are all typically S3-backed
- For AI developers, comfort with S3 (and AWS IAM for access control) is an essential infrastructure skill