- Overview
- Architecture Diagram
- Core Components
- 1. Data Ingestion Layer
- 2. Document Processing & Extraction
- 3. Chunking & Text Processing
- 4. Embedding Generation
- 5. Vector Database
- 6. LLM Integration
- 7. API & Application Layer
- 8. Caching Layer
- 9. Monitoring & Observability
- 10. Security & Governance
- 11. Deployment & CI/CD
- Architecture Flow
- Design Patterns
- Cost Optimization
This document outlines a complete, production-ready RAG (Retrieval Augmented Generation) system built entirely using AWS-native services. The architecture is designed to be:
- Serverless-first: Leveraging Lambda, API Gateway, and managed services
- Scalable: Auto-scaling capabilities across all layers
- Cost-effective: Pay-per-use model with intelligent caching
- Secure: End-to-end encryption and fine-grained access control
- Observable: Comprehensive monitoring and tracing
┌─────────────────────────────────────────────────────────────────────┐
│ CLIENT LAYER │
│ Web App │ Mobile App │ API Clients │ Chatbot Interface │
└────────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Amazon API Gateway + AWS WAF │
│ (Rate Limiting, Authentication, DDoS Protection) │
└────────────────────────────────┬────────────────────────────────────┘
│
┌────────────────┴────────────────┐
│ │
▼ ▼
┌────────────────────┐ ┌────────────────────────┐
│ INGESTION PIPELINE │ │ QUERY PIPELINE │
│ (Offline) │ │ (Real-time) │
└────────────────────┘ └────────────────────────┘
┌──────────────────┐
│ Amazon S3 │ ← Document Upload
│ (Raw Storage) │
└────────┬─────────┘
│ (S3 Event Notification)
▼
┌────────────────────┐
│ EventBridge Rule │
└────────┬───────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ AWS Step Functions (Orchestration) │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Step 1: Document Processing │ │
│ │ Lambda + Textract → Extract Text │ │
│ └─────────────────┬───────────────────────────────────┘ │
│ │ │
│ ┌─────────────────▼───────────────────────────────────┐ │
│ │ Step 2: Metadata Extraction │ │
│ │ Lambda + Comprehend → NER, Entities, Key Phrases │ │
│ └─────────────────┬───────────────────────────────────┘ │
│ │ │
│ ┌─────────────────▼───────────────────────────────────┐ │
│ │ Step 3: Text Chunking │ │
│ │ Lambda → Semantic/Recursive/Fixed Chunking │ │
│ └─────────────────┬───────────────────────────────────┘ │
│ │ │
│ ┌─────────────────▼───────────────────────────────────┐ │
│ │ Step 4: Embedding Generation │ │
│ │ Lambda + Bedrock → Generate Vector Embeddings │ │
│ └─────────────────┬───────────────────────────────────┘ │
│ │ │
│ ┌─────────────────▼───────────────────────────────────┐ │
│ │ Step 5: Vector Storage │ │
│ │ Lambda → Index in OpenSearch Service │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────┐
│ Amazon OpenSearch Service │
│ - Vector Index │
│ - BM25 Lexical Index │
│ - Metadata Filters │
└────────────────────────────┘
┌──────────────────┐
│ User Query │
└────────┬─────────┘
│
▼
┌────────────────────────────────┐
│ Amazon API Gateway │
│ - Cognito Authorization │
│ - Request Validation │
│ - Rate Limiting │
└────────┬───────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ AWS Lambda (Query Handler) │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Step 1: Cache Check │ │
│ │ ElastiCache Redis → Check for cached response │ │
│ └─────────────────┬───────────────────────────────────┘ │
│ │ (Cache Miss) │
│ ┌─────────────────▼───────────────────────────────────┐ │
│ │ Step 2: Query Embedding │ │
│ │ Bedrock → Generate query vector │ │
│ └─────────────────┬───────────────────────────────────┘ │
│ │ │
│ ┌─────────────────▼───────────────────────────────────┐ │
│ │ Step 3: Hybrid Search │ │
│ │ OpenSearch → Vector + BM25 + Metadata Filters │ │
│ └─────────────────┬───────────────────────────────────┘ │
│ │ │
│ ┌─────────────────▼───────────────────────────────────┐ │
│ │ Step 4: Reranking (Optional) │ │
│ │ Bedrock/SageMaker → Cohere Rerank │ │
│ └─────────────────┬───────────────────────────────────┘ │
│ │ │
│ ┌─────────────────▼───────────────────────────────────┐ │
│ │ Step 5: Context Assembly │ │
│ │ Retrieve chunks + Apply MMR/Diversity │ │
│ └─────────────────┬───────────────────────────────────┘ │
│ │ │
│ ┌─────────────────▼───────────────────────────────────┐ │
│ │ Step 6: LLM Generation │ │
│ │ Bedrock → Claude/Llama/Mistral with Context │ │
│ └─────────────────┬───────────────────────────────────┘ │
│ │ │
│ ┌─────────────────▼───────────────────────────────────┐ │
│ │ Step 7: Cache Response │ │
│ │ ElastiCache → Store for future queries │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────┐
│ Response to Client │
└──────────────────────┘
| Component | Purpose | AWS Services |
|---|---|---|
| Storage | Document and data persistence | S3, DynamoDB |
| Processing | Document transformation and enrichment | Lambda, Step Functions, Textract, Comprehend |
| Embedding | Vector generation | Bedrock, SageMaker |
| Retrieval | Vector and hybrid search | OpenSearch Service, Aurora PostgreSQL |
| Generation | Answer synthesis | Bedrock (Claude, Llama, etc.) |
| Serving | API endpoints and routing | API Gateway, Lambda, App Runner |
| Caching | Performance optimization | ElastiCache (Redis) |
| Monitoring | System observability | CloudWatch, X-Ray, Grafana |
| Security | Access control and encryption | IAM, Cognito, KMS, WAF, Secrets Manager |
Collect, receive, and route documents from various sources into the processing pipeline.
- Role: Primary storage for all documents
- Configuration:
- Bucket structure:
raw/,processed/,failed/,archives/ - Versioning enabled for audit trail
- Lifecycle policies for cost optimization
- Server-side encryption (SSE-KMS)
- Event notifications to EventBridge
- Bucket structure:
- Storage Classes:
- S3 Standard for active documents
- S3 Intelligent-Tiering for variable access patterns
- S3 Glacier for archival
- Role: Event-driven orchestration trigger
- Configuration:
- Rules to detect S3 object creation events
- Pattern matching by file type/prefix
- Route events to Step Functions
- Dead letter queue for failed events
- Event Patterns:
{ "source": ["aws.s3"], "detail-type": ["Object Created"], "detail": { "bucket": {"name": ["rag-documents-bucket"]} } }
- Role: Automated data transfer from on-premises/other clouds
- Use Cases:
- Scheduled sync from file servers
- Migration of existing document repositories
- Continuous sync from network shares
- Role: SFTP/FTP/FTPS endpoint for document upload
- Use Cases:
- Third-party integrations requiring FTP
- Legacy systems integration
- Secure file transfer workflows
External Sources → S3 Bucket → EventBridge → Step Functions
↓
Versioning
↓
Lifecycle Rules
Extract text, metadata, and structured information from various document formats.
- Role: Serverless compute for document processing logic
- Functions:
document-classifier: Detect document type and routetext-extractor: Extract text using appropriate methodmetadata-enricher: Add document metadataerror-handler: Process failed documents
- Configuration:
- Runtime: Python 3.11+
- Memory: 1024-3008 MB (based on document size)
- Timeout: 5-15 minutes (for large documents)
- Layers: Common libraries (boto3, PyPDF2, python-docx)
- VPC: Configured if accessing OpenSearch/ElastiCache
- Environment variables: S3 buckets, service endpoints
- Role: Extract text and structured data from documents
- Capabilities:
- Text detection and OCR
- Table extraction
- Form data extraction
- Handwriting recognition
- Document layout analysis
- Supported Formats:
- PDF, PNG, JPEG, TIFF
- Scanned documents
- Multi-page documents
- API Operations:
DetectDocumentText: Simple text extractionAnalyzeDocument: Tables and formsStartDocumentAnalysis: Async for large docs
- Role: Natural language processing and metadata extraction
- Capabilities:
- Entity recognition (people, places, organizations)
- Key phrase extraction
- Sentiment analysis
- Language detection
- PII detection and redaction
- Topic modeling
- Custom entity recognition
- Use Cases:
- Enrich chunks with entities for filtering
- Detect document categories
- Extract key topics for metadata
- Identify sensitive information
- Role: Orchestrate multi-step document processing workflow
- State Machine Components:
- Parallel States: Process multiple documents concurrently
- Choice States: Route based on document type
- Task States: Invoke Lambda functions
- Retry/Catch: Error handling and retries
- Map State: Batch processing
- Workflow Definition:
Start → Validate → Extract Text → Extract Metadata → Chunk → Embed → Index → Success ↓ ↓ ↓ ↓ ↓ ↓ ↓ Error Error Error Error Error Error Error ↓ ↓ ↓ ↓ ↓ ↓ ↓ Retry/DLQ Retry/DLQ Retry/DLQ Retry/DLQ Retry Retry/DLQ DLQ
| Step | Input | Processing | Output | AWS Service |
|---|---|---|---|---|
| Validation | Raw file | Check format, size, corruption | Valid file or error | Lambda |
| Text Extraction | PDF/Image | OCR, text extraction | Plain text | Lambda + Textract |
| Metadata Extraction | Text | NER, key phrases, entities | JSON metadata | Lambda + Comprehend |
| Quality Check | Text | Validate extraction quality | Approved text | Lambda |
| Storage | Processed data | Store intermediate results | S3 location | S3 |
- Retry Logic: 3 attempts with exponential backoff
- Dead Letter Queue: SNS topic for failed documents
- Fallback: Alternative extraction methods
- Logging: CloudWatch Logs for debugging
Split documents into optimal-sized chunks for embedding and retrieval.
- Role: Execute chunking algorithms
- Functions:
semantic-chunker: Content-aware splittingrecursive-chunker: Hierarchical splittingfixed-chunker: Simple size-based splittingsentence-window-chunker: Sentence-based with context windows
- Configuration:
- Runtime: Python 3.11
- Memory: 512-1024 MB
- Libraries: langchain, llama-index, spaCy, nltk
- Method: Split by character/token count
- Parameters:
- Chunk size: 500-1000 tokens
- Overlap: 50-200 tokens (10-20%)
- Use Case: Simple documents, consistent structure
- Implementation:
# Pseudocode chunk_size = 1000 overlap = 200 chunks = split_text(text, chunk_size, overlap)
- Method: Split by semantic boundaries (paragraphs, sections)
- Parameters:
- Min chunk size: 200 tokens
- Max chunk size: 1500 tokens
- Boundary markers: headers, paragraph breaks
- Use Case: Structured documents with clear sections
- Implementation:
- Detect section headers
- Respect paragraph boundaries
- Maintain semantic coherence
- Method: Hierarchical splitting with multiple separators
- Separators Priority:
\n\n(paragraph breaks)\n(line breaks).(sentences)(words)
- Use Case: Mixed content types
- Library: LangChain RecursiveCharacterTextSplitter
- Method: Each chunk is a sentence with surrounding context
- Parameters:
- Window size: 1-3 sentences before/after
- Overlap: Natural sentence boundaries
- Use Case: Q&A systems, precise retrieval
- Advantage: Better context preservation
Each chunk includes:
{
"chunk_id": "doc123_chunk_5",
"document_id": "doc123",
"chunk_index": 5,
"content": "chunk text...",
"chunk_size": 523,
"start_position": 2500,
"end_position": 3023,
"metadata": {
"document_title": "...",
"section": "Introduction",
"page_number": 3,
"entities": ["AWS", "Lambda"],
"key_phrases": ["serverless computing"],
"parent_chunk": "doc123_chunk_4",
"child_chunks": ["doc123_chunk_5_1", "doc123_chunk_5_2"]
}
}- Intermediate: S3 bucket (
processed/chunks/) - Final: Included in OpenSearch document
- Format: JSON with metadata
Convert text chunks into dense vector representations for semantic search.
-
Role: Managed service for foundation model APIs
-
Embedding Models Available:
Model Dimensions Max Input Use Case Titan Embeddings v2 256, 512, 1024 8192 tokens General purpose, multilingual Titan Multimodal Embeddings 384, 1024 Text + Images Multi-modal RAG Cohere Embed English v3 1024 512 tokens English text, high quality Cohere Embed Multilingual v3 1024 512 tokens 100+ languages -
Configuration:
- Model ID:
amazon.titan-embed-text-v2:0 - Dimensions: 1024 (configurable)
- Batch size: Up to 100 texts per API call
- Input type:
search_documentfor indexing,search_queryfor queries
- Model ID:
- Role: Deploy custom embedding models
- Use Cases:
- Fine-tuned embeddings for domain-specific data
- Custom models (BERT, sentence-transformers)
- Models not available in Bedrock
- Deployment:
- Real-time endpoint for synchronous inference
- Asynchronous endpoint for batch processing
- Serverless inference for variable workloads
- Popular Models:
sentence-transformers/all-MiniLM-L6-v2sentence-transformers/all-mpnet-base-v2- Custom fine-tuned BERT variants
- Role: Batch processing for millions of documents
- Configuration:
- Compute environments: Spot instances for cost savings
- Job queues: Priority-based processing
- Job definitions: Container with embedding logic
- Use Case: Initial bulk indexing, reindexing
# Pseudocode structure
def lambda_handler(event, context):
# Input: chunks from previous step
chunks = event['chunks']
# Batch chunks (up to 100)
batches = create_batches(chunks, batch_size=100)
# Generate embeddings
embeddings = []
for batch in batches:
response = bedrock_client.invoke_model(
modelId='amazon.titan-embed-text-v2:0',
body=json.dumps({
'inputText': batch,
'dimensions': 1024,
'normalize': True
})
)
embeddings.extend(response['embeddings'])
# Return chunks with embeddings
return {
'chunks_with_embeddings': combine(chunks, embeddings)
}- Cache Embeddings: Store in ElastiCache
- Key: Hash of chunk text
- TTL: Long-lived (weeks/months)
- Benefit: Avoid regeneration for duplicate content
- Batch Size: 50-100 chunks per API call
- Parallel Batches: Multiple concurrent Lambda invocations
- Rate Limiting: Implement exponential backoff
- SQS Queue: Queue chunks for embedding
- Worker Lambdas: Poll queue and process
- Scalability: Auto-scale based on queue depth
- Bedrock Pricing: Pay per input token
- Titan v2: ~$0.0001 per 1K tokens
- Optimization:
- Cache embeddings
- Use smaller dimensions if sufficient (512 vs 1024)
- Batch requests to reduce API calls
Store and efficiently search vector embeddings with metadata filtering.
-
Role: Primary vector database with hybrid search capabilities
-
Key Features:
- Vector Engine: k-NN plugin with HNSW/IVF algorithms
- Lexical Search: BM25 for keyword matching
- Hybrid Search: Combine vector + keyword results
- Metadata Filtering: Pre-filter before vector search
- Scalability: Auto-scaling, multi-AZ deployment
-
Index Configuration:
{ "mappings": { "properties": { "chunk_id": {"type": "keyword"}, "content": {"type": "text", "analyzer": "standard"}, "embedding": { "type": "knn_vector", "dimension": 1024, "method": { "name": "hnsw", "space_type": "cosinesimil", "engine": "nmslib", "parameters": { "ef_construction": 512, "m": 16 } } }, "metadata": { "properties": { "document_id": {"type": "keyword"}, "source": {"type": "keyword"}, "timestamp": {"type": "date"}, "category": {"type": "keyword"}, "entities": {"type": "keyword"} } } } }, "settings": { "index": { "knn": true, "number_of_shards": 3, "number_of_replicas": 2 } } } -
Search Types:
1. Vector Search (k-NN)
{ "size": 10, "query": { "knn": { "embedding": { "vector": [0.1, 0.2, ...], "k": 10 } } } }2. Hybrid Search (Vector + BM25)
{ "size": 10, "query": { "hybrid": { "queries": [ {"knn": {"embedding": {"vector": [...], "k": 50}}}, {"match": {"content": "query text"}} ] } } }3. Filtered Search
{ "size": 10, "query": { "bool": { "must": [ {"knn": {"embedding": {"vector": [...], "k": 10}}} ], "filter": [ {"term": {"metadata.category": "technical"}}, {"range": {"metadata.timestamp": {"gte": "2024-01-01"}}} ] } } } -
Cluster Sizing:
- Development: 1 master, 2 data nodes (t3.small)
- Production: 3 masters, 3+ data nodes (r6g.xlarge)
- Storage: 100GB - 10TB+ depending on corpus size
- Instance types: r6g (memory optimized) for vector search
-
Role: Relational database with vector extension
-
Key Features:
- Native SQL queries
- ACID transactions
- Familiar database paradigm
- Vector similarity search via pgvector extension
-
Configuration:
-- Enable extension CREATE EXTENSION vector; -- Create table CREATE TABLE embeddings ( id SERIAL PRIMARY KEY, chunk_id VARCHAR(255), content TEXT, embedding vector(1024), metadata JSONB ); -- Create index CREATE INDEX ON embeddings USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100); -- Search query SELECT chunk_id, content, 1 - (embedding <=> query_vector) as similarity FROM embeddings WHERE metadata->>'category' = 'technical' ORDER BY embedding <=> query_vector LIMIT 10;
-
Use Case:
- Smaller datasets (<1M vectors)
- Need for ACID transactions
- SQL-first organizations
-
Role: Ultra-low latency vector search
-
Key Features:
- Redis 7.0+ with vector similarity search
- In-memory performance
- Persistence and durability
- Microsecond latency
-
Use Case:
- Real-time applications requiring <10ms latency
- High-throughput scenarios
- Smaller vector datasets
Step Functions → Lambda (Batch Index) → OpenSearch
- Batch size: 500-1000 documents
- Parallel indexing: 5-10 concurrent Lambdas
- Refresh interval: 30s during bulk load
New Document → Process → Embed → Lambda → OpenSearch
- Real-time indexing
- Refresh interval: 1s
- Immediate availability
- Create new index with updated mapping
- Reindex alias pointing to old index
- Bulk reindex to new index
- Atomic alias switch
- Delete old index
| Parameter | Development | Production |
|---|---|---|
| ef_construction | 256 | 512 |
| M | 8 | 16 |
| Shards | 1 | 3-5 |
| Replicas | 1 | 2-3 |
| Refresh Interval | 5s | 1s |
Generate natural language responses using retrieved context.
-
Role: Managed LLM inference service
-
Available Models:
Model Context Window Use Case Cost (per 1M tokens) Claude 3.5 Sonnet 200K tokens General purpose, best quality Input: $3, Output: $15 Claude 3.5 Haiku 200K tokens Fast, cost-effective Input: $0.25, Output: $1.25 Claude 3 Opus 200K tokens Complex reasoning Input: $15, Output: $75 Llama 3.1 70B 128K tokens Open source alternative Input: $0.99, Output: $0.99 Mistral Large 128K tokens Multilingual, technical Input: $4, Output: $12 Amazon Titan Text 32K tokens AWS-native, cost-effective Input: $0.50, Output: $1.50 -
API Configuration:
# Pseudocode response = bedrock_runtime.invoke_model( modelId='anthropic.claude-3-5-sonnet-20241022-v2:0', body=json.dumps({ 'anthropic_version': 'bedrock-2023-05-31', 'max_tokens': 4096, 'temperature': 0.7, 'top_p': 0.9, 'messages': [ { 'role': 'user', 'content': prompt } ], 'system': system_prompt }) )
-
Features:
- Streaming responses
- System prompts for behavior control
- Guardrails for content safety
- Prompt caching (reduced cost for repeated context)
- Multi-modal support (Claude 3.5 Sonnet)
-
Role: Deploy custom or open-source LLMs
-
Use Cases:
- Fine-tuned models on proprietary data
- Models not available in Bedrock
- Custom inference logic
- Cost optimization with reserved capacity
-
Deployment Options:
- Real-time Endpoints: ml.g5.xlarge, ml.g5.2xlarge
- Serverless Inference: Auto-scaling, pay-per-use
- Async Endpoints: Batch inference, queue-based
-
Popular Models:
- Llama 2/3 (7B, 13B, 70B)
- Mistral (7B, 8x7B)
- Falcon (7B, 40B)
- Fine-tuned GPT variants
You are a helpful AI assistant. Answer questions based ONLY on the provided context.
Guidelines:
- If the context doesn't contain the answer, say "I don't have enough information"
- Cite sources by mentioning the document name
- Be concise and accurate
- Do not make up information
Context:
{retrieved_chunks}
Question: {user_question}
Instructions:
- Answer based only on the context above
- Cite the relevant document sections
- If uncertain, acknowledge limitations
Answer:
- Few-Shot Prompting: Include examples in system prompt
- Chain-of-Thought: Request step-by-step reasoning
- Citations: Force structured source attribution
- Hedging: Encourage uncertainty expression
Retrieve → Build Prompt → LLM → Response
- Single API call
- Fastest approach
- Best for simple queries
Retrieve → Generate Draft → Re-retrieve → Refine → Final Response
- Multiple LLM calls
- Better accuracy
- Higher latency
Query → Agent → [Tool: Search] → [Tool: Calculator] → [Tool: Database] → Response
- Multi-step reasoning
- External tool integration
- Complex workflows
- Bedrock Feature: Cache system prompt and context
- Benefit: 90% cost reduction on cached tokens
- TTL: 5 minutes
- Use Case: Repeated queries with same context
- Simple queries: Use Haiku (25x cheaper than Opus)
- Complex reasoning: Use Sonnet
- Fallback strategy: Try Haiku first, escalate if needed
- Limit retrieved chunks: 5-10 most relevant
- Summarize long documents: Pre-process with Haiku
- Compress context: Remove redundant information
Expose RAG functionality through secure, scalable APIs.
-
Role: Managed API endpoint
-
API Types:
REST API
- Traditional request-response
- Full API management features
- Custom domains, API keys
HTTP API
- Lower latency, lower cost
- Simpler configuration
- JWT/OAuth 2.0 support
WebSocket API
- Bidirectional communication
- Streaming responses
- Chat applications
-
Configuration:
Endpoints: - POST /query - POST /ingest - GET /document/{id} - GET /health - GET /metrics Features: - Request validation - Response transformation - CORS configuration - API keys / Usage plans - Rate limiting (1000 req/sec per key) - Request/response logging
-
Integration Types:
- Lambda Proxy: Full request/response control
- Lambda Non-Proxy: Custom mapping templates
- HTTP Proxy: Forward to other services
- VPC Link: Private integrations
-
Security:
- AWS WAF integration
- Cognito authorizers
- IAM authorization
- API keys and usage plans
- Resource policies
-
Role: Business logic execution
-
Key Functions:
Function Purpose Trigger Timeout query-handlerProcess user queries API Gateway 30s ingest-triggerInitiate document ingestion API Gateway 5s document-statusCheck processing status API Gateway 3s feedback-collectorStore user feedback API Gateway 5s -
Configuration Best Practices:
- Memory: 1024-3008 MB (proportional CPU)
- Timeout: Query: 30s, Ingest: 5s
- Concurrency: Reserved for production (avoid cold starts)
- VPC: Only if accessing VPC resources (OpenSearch, ElastiCache)
- Environment Variables: Service endpoints, config
- Layers: Shared dependencies (opensearch-py, redis-py)
-
Query Handler Lambda Flow:
def lambda_handler(event, context): # 1. Parse request query = event['body']['query'] filters = event['body'].get('filters', {}) # 2. Check cache cached = check_redis_cache(query) if cached: return cached # 3. Generate query embedding query_vector = bedrock_embed(query) # 4. Search OpenSearch results = opensearch_hybrid_search( vector=query_vector, text=query, filters=filters, top_k=10 ) # 5. Build context context = build_context(results) # 6. Generate response response = bedrock_generate( query=query, context=context ) # 7. Cache result cache_response(query, response) # 8. Return return { 'statusCode': 200, 'body': json.dumps({ 'answer': response, 'sources': results, 'cached': False }) }
-
Role: Deploy containerized RAG application
-
Use Cases:
- FastAPI/Flask applications
- Complex dependencies
- Persistent connections
- Custom runtime requirements
-
Configuration:
Service: Name: rag-api-service Source: Type: ECR Image: account.dkr.ecr.region.amazonaws.com/rag-api:latest Instance: CPU: 1 vCPU Memory: 2 GB AutoScaling: MinSize: 1 MaxSize: 10 TargetConcurrency: 100 HealthCheck: Path: /health Interval: 10s
-
Advantages:
- No cold starts
- WebSocket support
- Long-running connections
- Easier debugging
- Role: Container orchestration without servers
- Use Cases:
- Microservices architecture
- Complex multi-container applications
- Need for load balancing
- Blue-green deployments
Request:
{
"query": "What is AWS Lambda?",
"filters": {
"category": "documentation",
"date_range": {"start": "2024-01-01", "end": "2024-12-31"}
},
"options": {
"top_k": 5,
"rerank": true,
"stream": false
}
}
Response:
{
"answer": "AWS Lambda is a serverless compute service...",
"sources": [
{
"chunk_id": "doc123_chunk_5",
"content": "...",
"document": "aws-lambda-guide.pdf",
"score": 0.92,
"page": 3
}
],
"metadata": {
"latency_ms": 234,
"cached": false,
"model": "claude-3-5-sonnet",
"tokens_used": 1523
}
}Request:
{
"document_url": "s3://bucket/document.pdf",
"metadata": {
"category": "technical",
"author": "John Doe",
"tags": ["AWS", "serverless"]
}
}
Response:
{
"job_id": "ingest-12345",
"status": "processing",
"estimated_completion": "2024-01-30T10:30:00Z"
}Response:
{
"document_id": "doc123",
"status": "completed",
"progress": {
"extraction": "completed",
"chunking": "completed",
"embedding": "completed",
"indexing": "completed"
},
"chunks_created": 45,
"processing_time_seconds": 12.3
}Reduce latency and costs by caching embeddings, queries, and responses.
-
Role: In-memory caching for low-latency access
-
Cache Types:
1. Query Embedding Cache
Key: hash(query_text) Value: embedding_vector TTL: 7 days Purpose: Avoid regenerating embeddings for repeat queries2. Query Result Cache
Key: hash(query + filters) Value: {results, context, metadata} TTL: 1 hour Purpose: Return cached results for identical queries3. LLM Response Cache
Key: hash(query + context + model) Value: generated_response TTL: 24 hours Purpose: Avoid LLM calls for identical questions4. Session Cache
Key: session_id Value: conversation_history TTL: 30 minutes Purpose: Multi-turn conversations -
Cluster Configuration:
Development:
Node Type: cache.t3.micro Nodes: 1 Engine: Redis 7.0 Multi-AZ: false Encryption: In-transit only
Production:
Node Type: cache.r6g.large Nodes: 3 (1 primary, 2 replicas) Engine: Redis 7.0 Multi-AZ: true Encryption: At-rest + In-transit Backup: Automated snapshots
-
Eviction Policy:
allkeys-lru(Least Recently Used)
-
Role: CDN for API responses and static assets
-
Use Cases:
- Cache GET endpoints globally
- Reduce API Gateway costs
- Improve response times for geographically distributed users
- Serve static assets (UI, documentation)
-
Configuration:
Origin: API Gateway Behaviors: - PathPattern: /query CachePolicy: Disabled (dynamic content) - PathPattern: /health CachePolicy: CachingOptimized (5 min TTL) - PathPattern: /static/* CachePolicy: CachingOptimized (24 hour TTL)
Request → CloudFront (if enabled) → API Gateway → Lambda
↓
Check ElastiCache
↓
Cache Hit → Return
↓
Cache Miss → Process
↓
Update Cache → Return
Query Cache Key:
cache_key = f"query:{hash(query_text)}:{hash(filters)}:{model}"
# Example: "query:a3f5e9:b2c4d6:claude-3-5-sonnet"Embedding Cache Key:
cache_key = f"embedding:{hash(text)}:{model}:{dimensions}"
# Example: "embedding:d4e6f8:titan-v2:1024"1. Preload Common Queries
# During deployment or scheduled job
common_queries = [
"What is AWS Lambda?",
"How to deploy containers?",
# ... top 100 queries
]
for query in common_queries:
result = process_query(query)
cache.set(cache_key(query), result, ttl=86400)2. Analytics-Driven Warming
Daily Job:
- Analyze query logs (CloudWatch Insights)
- Identify top 100 queries
- Pre-compute and cache results
- Update cache before peak hours
Strategies:
-
TTL-Based: Automatic expiration
- Query results: 1 hour
- Embeddings: 7 days
- LLM responses: 24 hours
-
Event-Based: Invalidate on data changes
# When document is reindexed invalidate_pattern("query:*") # Clear all query caches
-
Manual: Admin endpoint for cache clearing
POST /admin/cache/clear { "pattern": "query:*", "reason": "Index updated" }
| Metric | Without Cache | With Cache | Improvement |
|---|---|---|---|
| Avg Latency | 2.5s | 150ms | 94% faster |
| P99 Latency | 5.2s | 200ms | 96% faster |
| LLM Cost | $0.05/query | $0.005/query | 90% cheaper |
| Throughput | 100 QPS | 1000 QPS | 10x increase |
Track system health, performance, and usage for debugging and optimization.
-
Role: Central monitoring and logging service
-
Components:
1. CloudWatch Logs
- Log Groups per Lambda function
- API Gateway access logs
- Step Functions execution logs
- OpenSearch slow query logs
- Application logs (structured JSON)
Retention Policy:
Development: 7 days Production: 90 days Compliance: 1-7 years2. CloudWatch Metrics
Standard Metrics:
- Lambda: Invocations, Duration, Errors, Throttles
- API Gateway: Count, Latency, 4xx/5xx errors
- OpenSearch: CPU, Memory, Disk, Indexing rate
- ElastiCache: CPU, Memory, Cache hits/misses
- S3: Storage, Requests
- Bedrock: Model invocations, Throttles
Custom Metrics:
cloudwatch.put_metric_data( Namespace='RAG/Application', MetricData=[ { 'MetricName': 'QueryLatency', 'Value': latency_ms, 'Unit': 'Milliseconds', 'Dimensions': [ {'Name': 'Environment', 'Value': 'production'}, {'Name': 'ModelName', 'Value': 'claude-3-5-sonnet'} ] }, { 'MetricName': 'RetrievalQuality', 'Value': relevance_score, 'Unit': 'None' }, { 'MetricName': 'CacheHitRate', 'Value': cache_hits / total_requests, 'Unit': 'Percent' } ] )
3. CloudWatch Dashboards
System Overview Dashboard:
Widgets: - Total Queries (Count) - Average Latency (ms) - Error Rate (%) - Active Users - Cost per Query ($) - Cache Hit Rate (%)Performance Dashboard:
Widgets: - Query Pipeline Latency (breakdown) - Embedding Generation Time - OpenSearch Query Time - LLM Inference Time - End-to-End Latency (P50, P95, P99)Infrastructure Dashboard:
Widgets: - Lambda Concurrent Executions - OpenSearch CPU/Memory - ElastiCache Hit Rate - S3 Storage Used - API Gateway Throttles4. CloudWatch Alarms
Critical Alarms:
- Name: HighErrorRate Metric: Errors Threshold: > 5% Period: 5 minutes Action: SNS → PagerDuty - Name: HighLatency Metric: Duration Threshold: > 5000ms Period: 5 minutes Action: SNS → Email - Name: OpenSearchClusterRed Metric: ClusterStatus.red Threshold: >= 1 Period: 1 minute Action: SNS → PagerDuty - Name: LambdaThrottling Metric: Throttles Threshold: > 10 Period: 5 minutes Action: SNS → Slack
Cost Alarms:
- Name: DailyBudgetExceeded Metric: EstimatedCharges Threshold: > $100/day Action: SNS → Email + Slack
-
Role: Distributed tracing for request flows
-
Instrumentation:
from aws_xray_sdk.core import xray_recorder from aws_xray_sdk.core import patch_all # Patch all supported libraries patch_all() @xray_recorder.capture('query_handler') def lambda_handler(event, context): with xray_recorder.capture('check_cache'): cached = redis_client.get(cache_key) with xray_recorder.capture('opensearch_search'): results = opensearch.search(...) with xray_recorder.capture('bedrock_generate'): response = bedrock.invoke_model(...) return response
-
Service Map: Visual representation of request flow
API Gateway → Lambda → [ElastiCache, OpenSearch, Bedrock] -
Trace Analysis:
- Identify bottlenecks (which service is slow)
- Detect errors and exceptions
- Analyze cold start impact
- Measure service dependencies
-
Role: Advanced visualization and analytics
-
Features:
- Multi-source dashboards (CloudWatch + custom)
- Advanced querying and filtering
- Alerting and notification
- Team collaboration
-
Use Cases:
- Executive dashboards
- Real-time operations monitoring
- Historical trend analysis
- Cost attribution
-
Role: Interactive log analytics
-
Query Examples:
Find slow queries:
fields @timestamp, query, latency | filter latency > 2000 | sort latency desc | limit 20Error analysis:
fields @timestamp, @message | filter @message like /ERROR/ | stats count() by error_typeCache hit rate:
fields cache_hit | stats count() by cache_hitCost per query:
fields query, bedrock_cost, opensearch_cost | stats sum(bedrock_cost + opensearch_cost) as total_cost by bin(5m)
| Metric | Target | Alerting Threshold |
|---|---|---|
| Query Latency (P95) | < 2s | > 5s |
| Error Rate | < 1% | > 5% |
| Cache Hit Rate | > 60% | < 40% |
| Retrieval Precision | > 0.8 | < 0.6 |
| LLM Hallucination Rate | < 5% | > 10% |
| Metric | Target | Alerting Threshold |
|---|---|---|
| Lambda Cold Starts | < 5% | > 15% |
| OpenSearch CPU | < 70% | > 85% |
| ElastiCache Memory | < 80% | > 90% |
| API Gateway 5xx | < 0.1% | > 1% |
| Metric | Target | Review Frequency |
|---|---|---|
| Daily Active Users | Track growth | Daily |
| Queries per User | > 5/session | Weekly |
| User Satisfaction | > 4/5 | Weekly |
| Cost per Query | < $0.05 | Daily |
import json
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def log_query(query, latency, cached, error=None):
log_entry = {
'timestamp': datetime.utcnow().isoformat(),
'event': 'query_processed',
'query': query,
'latency_ms': latency,
'cached': cached,
'model': 'claude-3-5-sonnet',
'environment': 'production'
}
if error:
log_entry['error'] = str(error)
log_entry['level'] = 'ERROR'
else:
log_entry['level'] = 'INFO'
logger.info(json.dumps(log_entry))# API Gateway passes request ID
request_id = event['requestContext']['requestId']
# Add to all logs
logger.info(json.dumps({
'request_id': request_id,
'message': 'Processing query'
}))
# Pass to downstream services
headers = {'X-Request-ID': request_id}Ensure data protection, access control, and compliance.
-
Role: Access control and permissions
-
IAM Policies:
Lambda Execution Role:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": "arn:aws:s3:::rag-documents/*" }, { "Effect": "Allow", "Action": [ "bedrock:InvokeModel" ], "Resource": "arn:aws:bedrock:*:*:model/*" }, { "Effect": "Allow", "Action": [ "es:ESHttpPost", "es:ESHttpPut", "es:ESHttpGet" ], "Resource": "arn:aws:es:region:account:domain/rag-vectors/*" }, { "Effect": "Allow", "Action": [ "logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents" ], "Resource": "*" } ] }Principle of Least Privilege: Each service gets only necessary permissions
-
Role: User authentication and authorization
-
Features:
- User pools for authentication
- Identity pools for AWS resource access
- OAuth 2.0 / SAML / Social identity providers
- Multi-factor authentication (MFA)
- Custom authentication flows
-
Configuration:
UserPool: Name: rag-users MFA: Optional PasswordPolicy: MinimumLength: 12 RequireUppercase: true RequireNumbers: true RequireSymbols: true CustomAttributes: - organization - access_level AppClients: - Name: web-app AuthFlows: [USER_PASSWORD_AUTH] - Name: mobile-app AuthFlows: [USER_SRP_AUTH]
-
API Gateway Integration:
Authorizer: Type: COGNITO_USER_POOLS UserPoolArn: arn:aws:cognito-idp:region:account:userpool/pool-id IdentitySource: method.request.header.Authorization
-
Role: Store and rotate sensitive credentials
-
Stored Secrets:
- Third-party API keys
- Database passwords (if using Aurora)
- OpenSearch credentials
- External service tokens
-
Secret Rotation:
Secret: Name: /rag/opensearch/credentials RotationSchedule: 90 days RotationLambda: rotate-opensearch-credentials
-
Lambda Access:
import boto3 secrets_client = boto3.client('secretsmanager') def get_secret(secret_name): response = secrets_client.get_secret_value( SecretId=secret_name ) return json.loads(response['SecretString'])
-
Role: Encryption key management
-
Encryption Scope:
- S3 buckets (server-side encryption)
- OpenSearch domain (at-rest encryption)
- ElastiCache (at-rest + in-transit)
- EBS volumes (Lambda in VPC)
- Secrets Manager
-
Key Configuration:
CustomerManagedKey: Alias: rag/encryption-key Description: Master key for RAG system encryption KeyPolicy: - Sid: Enable IAM policies Principal: {AWS: "arn:aws:iam::account:root"} Action: kms:* - Sid: Allow services to use key Principal: Service: - s3.amazonaws.com - es.amazonaws.com - lambda.amazonaws.com Action: - kms:Decrypt - kms:GenerateDataKey KeyRotation: Enabled (automatic annual rotation)
-
Role: Protect API Gateway from attacks
-
Rules:
1. Rate Limiting:
- Name: RateLimitRule Priority: 1 Statement: RateBasedStatement: Limit: 1000 # requests per 5 minutes AggregateKeyType: IP Action: Block
2. SQL Injection Protection:
- Name: SQLiProtection Priority: 2 Statement: SqliMatchStatement: FieldToMatch: {Body: {}} TextTransformations: - {Priority: 0, Type: URL_DECODE} Action: Block
3. Geo-Blocking (if needed):
- Name: GeoBlock Priority: 3 Statement: GeoMatchStatement: CountryCodes: [CN, RU] # Example Action: Block
4. IP Whitelist (for admin endpoints):
- Name: AdminIPWhitelist Priority: 4 Statement: NotStatement: Statement: IPSetReferenceStatement: Arn: arn:aws:wafv2:region:account:ipset/admin-ips Action: Block Scope: /admin/*
-
Role: Discover and protect sensitive data
-
Use Cases:
- Scan S3 buckets for PII/PHI
- Classify documents by sensitivity
- Alert on sensitive data exposure
- Compliance reporting
-
Sensitive Data Types:
- Credit card numbers
- Social security numbers
- Email addresses
- Phone numbers
- Medical records
- Credentials
-
Role: Audit logging for all AWS API calls
-
Configuration:
Trail: Name: rag-audit-trail S3BucketName: rag-audit-logs IncludeGlobalEvents: true IsMultiRegion: true EnableLogFileValidation: true EventSelectors: - ReadWriteType: All IncludeManagementEvents: true DataResources: - Type: AWS::S3::Object Values: ["arn:aws:s3:::rag-documents/*"] - Type: AWS::Lambda::Function Values: ["arn:aws:lambda:*:*:function/rag-*"]
-
Monitoring:
- Track all document access
- Monitor configuration changes
- Detect unauthorized API calls
- Compliance auditing
- Role: Resource configuration tracking and compliance
- Rules:
ConfigRules: - s3-bucket-encryption-enabled - lambda-function-settings-check - opensearch-encrypted-at-rest - iam-password-policy - approved-amis-by-tag
At Rest:
- S3: SSE-KMS with customer-managed keys
- OpenSearch: Node-to-node encryption + at-rest
- ElastiCache: At-rest encryption enabled
- EBS (Lambda): Encrypted volumes
- Secrets Manager: KMS encryption
In Transit:
- API Gateway: TLS 1.2+
- Lambda → OpenSearch: HTTPS
- Lambda → Bedrock: HTTPS
- Lambda → ElastiCache: TLS
- All inter-service: AWS PrivateLink (optional)
VPC Configuration (if needed):
VPC:
CIDR: 10.0.0.0/16
Subnets:
Public:
- 10.0.1.0/24 # AZ1
- 10.0.2.0/24 # AZ2
Private:
- 10.0.10.0/24 # AZ1 - Lambda
- 10.0.11.0/24 # AZ2 - Lambda
- 10.0.20.0/24 # AZ1 - OpenSearch
- 10.0.21.0/24 # AZ2 - OpenSearch
SecurityGroups:
Lambda:
Ingress: [] # No inbound
Egress:
- Protocol: HTTPS
Destination: 0.0.0.0/0
OpenSearch:
Ingress:
- Protocol: HTTPS
Source: Lambda-SG
Egress: []
ElastiCache:
Ingress:
- Protocol: Redis (6379)
Source: Lambda-SG
Egress: []VPC Endpoints (reduce NAT costs):
VPCEndpoints:
- Service: com.amazonaws.region.s3 (Gateway)
- Service: com.amazonaws.region.bedrock (Interface)
- Service: com.amazonaws.region.secretsmanager (Interface)Layer 1: AWS WAF → Block malicious traffic
Layer 2: API Gateway → API key / Cognito auth
Layer 3: IAM Roles → Service-level permissions
Layer 4: Resource Policies → Fine-grained access
Layer 5: Encryption → Data protection
- Data minimization in logs
- User consent for data processing
- Right to be forgotten (delete user data)
- Data portability (export user data)
- Data retention policies
- BAA with AWS
- Enable CloudTrail logging
- Encrypt all data at rest and in transit
- Access controls and audit logs
- Incident response procedures
- Security monitoring and alerting
- Access control and authentication
- Change management procedures
- Incident response and disaster recovery
- Regular security assessments
Automate infrastructure provisioning and application deployment.
- Role: Infrastructure as Code
- AWS CDK (Recommended):
from aws_cdk import ( Stack, aws_s3 as s3, aws_lambda as lambda_, aws_apigateway as apigw, aws_opensearch as opensearch, aws_iam as iam ) class RagStack(Stack): def __init__(self, scope, id, **kwargs): super().__init__(scope, id, **kwargs) # S3 Bucket documents_bucket = s3.Bucket( self, "DocumentsBucket", encryption=s3.BucketEncryption.KMS, versioned=True, lifecycle_rules=[...] ) # Lambda Functions query_handler = lambda_.Function( self, "QueryHandler", runtime=lambda_.Runtime.PYTHON_3_11, handler="index.handler", code=lambda_.Code.from_asset("lambda/query"), memory_size=1024, timeout=Duration.seconds(30), environment={ "OPENSEARCH_ENDPOINT": opensearch_domain.domain_endpoint, "BEDROCK_MODEL": "claude-3-5-sonnet" } ) # OpenSearch Domain opensearch_domain = opensearch.Domain( self, "VectorDB", version=opensearch.EngineVersion.OPENSEARCH_2_11, capacity={ "data_nodes": 3, "data_node_instance_type": "r6g.large.search" }, ebs={ "volume_size": 100, "volume_type": ec2.EbsDeviceVolumeType.GP3 }, encryption_at_rest={"enabled": True}, node_to_node_encryption=True ) # API Gateway api = apigw.RestApi( self, "RagAPI", rest_api_name="RAG Service API" ) query_integration = apigw.LambdaIntegration(query_handler) api.root.add_resource("query").add_method("POST", query_integration)
-
Role: Continuous deployment orchestration
-
Pipeline Stages:
Pipeline: Name: rag-deployment-pipeline Stages: - Name: Source Actions: - Name: SourceCode ActionType: Source Provider: GitHub / CodeCommit Configuration: Repository: rag-system Branch: main - Name: Build Actions: - Name: BuildAndTest ActionType: Build Provider: CodeBuild Configuration: ProjectName: rag-build-project - Name: DeployDev Actions: - Name: DeployToDev ActionType: Deploy Provider: CloudFormation Configuration: StackName: rag-stack-dev TemplatePath: BuildArtifact::template.yaml RoleArn: arn:aws:iam::account:role/CloudFormationRole - Name: IntegrationTests Actions: - Name: RunTests ActionType: Build Provider: CodeBuild Configuration: ProjectName: rag-integration-tests - Name: ApprovalForProd Actions: - Name: ManualApproval ActionType: Approval - Name: DeployProd Actions: - Name: DeployToProduction ActionType: Deploy Provider: CloudFormation Configuration: StackName: rag-stack-prod TemplatePath: BuildArtifact::template.yaml
- Role: Build and test execution
- buildspec.yml:
version: 0.2 phases: install: runtime-versions: python: 3.11 commands: - pip install -r requirements.txt - pip install pytest pytest-cov pre_build: commands: - echo "Running linters..." - black --check . - flake8 . - mypy . build: commands: - echo "Running unit tests..." - pytest tests/unit --cov=src --cov-report=xml - echo "Building Lambda deployment packages..." - cd lambda/query && zip -r ../../query-handler.zip . - cd lambda/ingest && zip -r ../../ingest-handler.zip . - echo "Synthesizing CDK stack..." - cdk synth --output cdk.out post_build: commands: - echo "Build completed" artifacts: files: - '**/*' name: BuildArtifact reports: coverage: files: - 'coverage.xml' file-format: 'COBERTURAXML'
1. Blue-Green Deployment (Lambda):
Lambda Function:
Alias: Production
Versions:
- Version: $LATEST (Blue - current)
- Version: 2 (Green - new)
Deployment:
Type: AllAtOnce / Linear / Canary
Linear10PercentEvery10Minutes:
- 0 min: 10% traffic to Green
- 10 min: 20% traffic to Green
- ...
- 90 min: 100% traffic to Green
Rollback:
Trigger: CloudWatch Alarm (error rate > 5%)
Action: Route 100% traffic back to Blue2. Canary Deployment (API Gateway):
API Gateway Stage:
Name: production
DeploymentId: latest-deployment
CanarySettings:
PercentTraffic: 10
UseStageCache: false
Progression:
- Deploy new version with 10% traffic
- Monitor metrics for 30 minutes
- If healthy: Promote canary to 100%
- If unhealthy: Rollback to previous versionEnvironments:
Development:
Purpose: Developer testing
Infrastructure: Minimal (1 OpenSearch node, t3.small)
Data: Synthetic test data
Access: Development team
Staging:
Purpose: Pre-production validation
Infrastructure: Production-like (3 OpenSearch nodes, r6g.large)
Data: Sanitized production data
Access: QA team + stakeholders
Production:
Purpose: Live customer traffic
Infrastructure: Fully redundant, multi-AZ
Data: Real customer data
Access: Operations team onlyLambda Rollback:
# Automated rollback via CodeDeploy
aws deploy stop-deployment --deployment-id d-XXXXX
# Manual version rollback
aws lambda update-alias \
--function-name query-handler \
--name production \
--function-version 5 # Previous stable versionInfrastructure Rollback:
# CloudFormation rollback
aws cloudformation rollback-stack --stack-name rag-stack-prod
# Or deploy previous template version
cdk deploy --version-tag v1.2.31. User uploads document to S3
↓
2. S3 triggers EventBridge rule
↓
3. EventBridge starts Step Functions workflow
↓
4. Step Functions orchestrates:
a. Lambda + Textract: Extract text
b. Lambda + Comprehend: Extract metadata
c. Lambda: Chunk text
d. Lambda + Bedrock: Generate embeddings
e. Lambda: Index in OpenSearch
↓
5. Document indexed and searchable
↓
6. SNS notification: "Document XYZ processed successfully"
1. User sends query via web/mobile app
↓
2. API Gateway receives request
↓
3. Cognito validates JWT token
↓
4. WAF checks for malicious patterns
↓
5. API Gateway invokes Lambda
↓
6. Lambda checks ElastiCache for cached response
↓ (Cache miss)
7. Lambda calls Bedrock to embed query
↓
8. Lambda searches OpenSearch (hybrid: vector + BM25)
↓
9. Lambda optionally reranks results
↓
10. Lambda assembles context from top results
↓
11. Lambda calls Bedrock to generate response
↓
12. Lambda caches response in ElastiCache
↓
13. Lambda returns response to API Gateway
↓
14. API Gateway returns to client
↓
15. CloudWatch logs all metrics
- Pattern: S3 events trigger processing pipelines
- Benefits: Decoupled components, automatic scaling
- Implementation: EventBridge + Step Functions + Lambda
- Pattern: Use managed services and Lambda over EC2
- Benefits: No server management, pay-per-use, auto-scaling
- Trade-offs: Cold starts, execution time limits
- Pattern: Multi-layer caching (CloudFront → ElastiCache → Source)
- Benefits: Reduced latency, lower costs
- Implementation: Cache frequently accessed data at each layer
- Pattern: Fail fast when dependencies are down
- Benefits: Prevent cascade failures
- Implementation: Lambda error handling + exponential backoff
- Pattern: Automatically retry failed operations with increasing delays
- Benefits: Handle transient failures
- Implementation: Step Functions retry config, Lambda custom logic
- Pattern: Route failed messages to separate queue for analysis
- Benefits: Don't lose failed requests, debug issues
- Implementation: SQS DLQ for Lambda, SNS for notifications
| Service | Usage | Monthly Cost |
|---|---|---|
| Bedrock (Embeddings) | 10M tokens | $100 |
| Bedrock (LLM - Sonnet) | 50M input tokens, 10M output | $300 |
| OpenSearch | 3x r6g.large nodes | $900 |
| ElastiCache | 3x r6g.large nodes | $600 |
| Lambda | 1M invocations, 512MB, 2s avg | $40 |
| API Gateway | 10M requests | $35 |
| S3 Storage | 1TB documents | $23 |
| Data Transfer | 500GB outbound | $45 |
| CloudWatch | Logs + Metrics | $50 |
| Step Functions | 100K executions | $25 |
| Total | ~$2,118/month |
- Titan v2: 1024 → 512 dimensions
- Benefit: 50% smaller storage, faster search
- Trade-off: Slightly lower accuracy
- Cache hit rate 60% → 90%
- Reduce Bedrock calls by 30%
- Save ~$100/month on LLM costs
- AWS Batch with Spot for bulk embedding
- Save up to 90% on compute costs
- Suitable for non-time-sensitive workloads
- Automatically move infrequent access data
- Save 40-70% on storage costs
- Zero retrieval fees
- Start small (r6g.large), monitor usage
- Use UltraWarm for historical data
- Consider Aurora pgvector for smaller workloads
- Cache system prompts and context
- Save 90% on cached token costs
- Especially valuable for repetitive queries
- Use AWS Compute Optimizer recommendations
- Right-size memory allocation
- Over-provisioning wastes money, under-provisioning increases duration
- OpenSearch Reserved Instances: 35-75% savings
- ElastiCache Reserved Nodes: 35-55% savings
- Commit to 1-year term for predictable workloads
- Set up AWS account and basic infrastructure
- Implement simple text ingestion pipeline
- Deploy basic vector search with OpenSearch
- Create minimal query API
- Add hybrid search (vector + BM25)
- Implement caching layer
- Set up monitoring and alerting
- Deploy to production
- Fine-tune retrieval quality
- Optimize costs
- Implement advanced features (reranking, multi-modal)
- Comprehensive testing and evaluation
- Handle increased traffic
- Add new document types
- Improve user experience
- Continuous optimization
This AWS-native RAG architecture provides:
✅ Fully Serverless: No servers to manage
✅ Scalable: Handles from 10 to 10M queries/day
✅ Cost-Effective: Pay only for what you use
✅ Secure: End-to-end encryption, compliance-ready
✅ Observable: Comprehensive monitoring and tracing
✅ Maintainable: Infrastructure as Code, automated deployments
Ready to build with confidence on AWS!
- Title: AWS-Native RAG Service Architecture
- Version: 1.0
- Last Updated: January 30, 2026
- Author: Paul - Senior AI Consultant at KrishAI Technologies
- Purpose: Complete architectural documentation for building production-ready RAG systems on AWS