Name: Amazon S3
Availability: InStock
Author: Amazon

Learning Objectives

Understand what Amazon S3 is and how it differs from consumer cloud storage like Drive or Dropbox
Identify S3's core features: object storage, buckets, storage classes, and event-driven integrations
Recognize S3's role in AI and ML workflows — data lakes, training data, and model artifact storage

What Is Amazon S3?

Amazon S3 (Simple Storage Service) is AWS's object storage service, launched in 2006 as one of the first AWS services and still the most widely used. Unlike consumer-facing platforms such as Google Drive or Dropbox, S3 is designed for programmatic, application-level storage — developers, DevOps engineers, and data engineers use S3 to store and retrieve any amount of data from anywhere.

S3 stores objects (files, blobs, binary data) in buckets (containers), addressed by keys (file paths). It scales from a single file to exabytes of data with no pre-provisioning required. S3 is the storage layer beneath much of the modern internet — website assets, application files, log archives, database backups, and AI training datasets are all commonly stored in S3.

✅Tip

Try Amazon S3: aws.amazon.com/s3 — first 12 months free tier includes 5GB standard storage, 20,000 GET requests, and 2,000 PUT requests; after that, pricing is per-GB stored and per-request

Pricing

S3 pricing is usage-based with no minimum commitment:

Plan	Price	Features
Storage	$0.023 per GB/month	First 50TB Lower at higher volumes
PUT/COPY/POST/LIST requests	$0.005 per 1,000 requests	Write operations
GET and all other requests	$0.0004 per 1,000 requests	Read operations
Data transfer out to internet	$0.09 per GB (first 10TB)	Free within same AWS region
S3 Intelligent-Tiering	$0.023 per GB + monitoring fee	Auto-moves data to cheaper tiers based on access
S3 Glacier Deep Archive	$0.00099 per GB/month	Lowest-cost archival Retrieval takes hours

Storage$0.023 per GB/month

First 50TB
Lower at higher volumes

PUT/COPY/POST/LIST requests$0.005 per 1,000 requests

Write operations

GET and all other requests$0.0004 per 1,000 requests

Read operations

Data transfer out to internet$0.09 per GB (first 10TB)

Free within same AWS region

S3 Intelligent-Tiering$0.023 per GB + monitoring fee

Auto-moves data to cheaper tiers based on access

S3 Glacier Deep Archive$0.00099 per GB/month

Lowest-cost archival
Retrieval takes hours

For most application use cases, S3 is extremely cost-effective — storing 100GB of infrequently accessed data costs roughly $2–3/month. Data transfer out to the internet is the most significant cost driver for high-traffic applications.

Core Features

Object Storage and Buckets

The fundamental model:

Bucket: A named container for objects. Bucket names are globally unique across AWS.
Object: Any file — from a text file to a multi-TB video — stored with metadata and an access key
Key: The path/filename for the object (e.g., datasets/training/batch-001.jsonl)
No folder hierarchy: S3 uses flat storage with key prefixes that simulate folder paths

Storage Classes — Cost Optimization

S3 offers multiple storage classes with different cost/access tradeoff profiles:

Storage Class	Use Case	Retrieval	Monthly Cost (per GB)
S3 Standard	Frequently accessed data; web assets; active datasets	Milliseconds	$0.023
S3 Standard-IA	Infrequently accessed; backups kept warm	Milliseconds	$0.0125
S3 One Zone-IA	Infrequently accessed; single availability zone	Milliseconds	$0.01
S3 Glacier Instant Retrieval	Archives needing millisecond access	Milliseconds	$0.004
S3 Glacier Flexible Retrieval	Archives; restore in minutes to hours	Minutes to hours	$0.0036
S3 Glacier Deep Archive	Long-term archival; 7–10 year retention	Hours	$0.00099
S3 Intelligent-Tiering	Unpredictable access patterns	Varies	Varies + $0.0025/1K objects

S3 in AI and ML Workflows

S3 is foundational infrastructure for AI development:

Training data storage: Store datasets (images, text, JSON, Parquet) that SageMaker, Bedrock, or custom training scripts pull from
Model artifact storage: Trained model weights and checkpoints are saved to S3 between training runs
Data lakes: S3 serves as the storage layer for AWS Glue (ETL), Athena (SQL queries on S3 data), and Amazon Bedrock Knowledge Bases (RAG)
Vector store inputs: Raw documents pre-processed and stored in S3 before ingestion into vector databases
Feature stores: Pre-computed ML features stored as Parquet files in S3, accessible to training and inference pipelines

💡Key Concept

S3 as the AI data backbone: When an AI company trains a large language model on web-crawled text, they typically store the raw corpus, cleaned datasets, tokenized batches, and model checkpoints in S3 or an equivalent object store. S3's virtually unlimited scale, high durability (99.999999999% — "11 nines"), and tight integration with AWS compute (EC2, SageMaker) make it the de facto standard for AI/ML data infrastructure.

Event-Driven Integrations

S3 supports event notifications that trigger downstream processing automatically:

S3 → Lambda: A new file uploaded to a bucket triggers a Lambda function (resizing images, processing CSV, running inference)
S3 → SQS/SNS: Fan out notifications to processing queues
S3 → Bedrock: New documents in S3 can trigger re-indexing into a Bedrock Knowledge Base for RAG

Access Control and Security

Bucket policies: JSON-based policies controlling who can access which objects
IAM roles: Attach permissions to EC2 instances or Lambda functions without hard-coding credentials
S3 Block Public Access: Setting to prevent any accidental public exposure — critical for sensitive data
Versioning: Maintain all versions of every object — essential for AI dataset lineage and rollback
Server-side encryption: Automatic AES-256 encryption at rest (S3-SSE) or customer-managed keys (SSE-KMS)

Strengths

Scale: No pre-provisioning; store from bytes to exabytes
Durability: 99.999999999% durability (objects are replicated across multiple availability zones)
AWS ecosystem integration: Native integration with every AWS service — compute, analytics, AI/ML, streaming
Cost efficiency at scale: Very low cost for cold data (Glacier Deep Archive under $1/TB/month)
Ecosystem de facto standard: S3-compatible APIs are implemented by every major cloud storage provider and many on-premise systems

Limitations & Considerations

Not for end users: S3 has no consumer UI for everyday file management — it requires AWS console knowledge or programmatic access via SDK/CLI
Data transfer egress costs: Downloading large amounts of data out of AWS to the internet can become expensive
No collaborative editing: S3 stores files; it has no Docs-style editing, search, or preview
Complexity for newcomers: AWS IAM policies, bucket permissions, and lifecycle rules have a learning curve

Best Use Cases

Task	Why S3
AI training dataset storage	Scalable, durable, directly accessible by SageMaker and custom training code
Application asset hosting	Images, videos, static files served via CloudFront CDN
Database and system backups	Automated backup to S3 Standard-IA or Glacier
Data lake foundation	Store raw and processed data for Athena SQL queries and analytics
Model checkpoint storage	Save large model weights between training runs
Log archival	Compress and archive application logs to Glacier Deep Archive cheaply

When to choose alternatives:

Personal or team productivity file storage → Google Drive or OneDrive
Simple cloud backup for consumers → Backblaze B2 (simpler and cheaper egress)
Non-AWS developer storage → Google Cloud Storage or Azure Blob Storage
Team file collaboration → Dropbox or SharePoint

Getting Started

Create an AWS account at aws.amazon.com — free tier includes 5GB S3 Standard for 12 months
Open the S3 console and click Create Bucket — choose a globally unique name and a region close to your users
Enable Block Public Access for private buckets (default for new buckets)
Upload a file via the console or use the AWS CLI: aws s3 cp myfile.txt s3://my-bucket/
Configure a Lifecycle Rule to transition objects to Glacier after 90 days if they won't be frequently accessed

📝Note

S3-compatible storage: Many storage systems implement the S3 API — including Cloudflare R2 (no egress fees), Backblaze B2, MinIO (self-hosted), and DigitalOcean Spaces. This means code written against S3 can often be redirected to these alternatives with minimal changes, which is useful for cost optimization or compliance requirements.

Key Takeaways

Amazon S3 is object storage infrastructure — not a consumer productivity tool but the storage backbone of the web and AI ecosystem
It stores objects in buckets, scales to unlimited size, and integrates natively with every AWS service including SageMaker, Bedrock, Lambda, and Athena
Multiple storage classes (Standard, Intelligent-Tiering, Glacier) allow dramatic cost optimization based on how frequently data is accessed
S3 is foundational to AI/ML workflows: training datasets, model checkpoints, data lakes, and RAG document stores are all typically S3-backed
For AI developers, comfort with S3 (and AWS IAM for access control) is an essential infrastructure skill

Amazon S3

Audio & video lessons are paid features