AWS-Native RAG Service Architecture

Overview
Architecture Diagram
Core Components
1. Data Ingestion Layer
2. Document Processing & Extraction
3. Chunking & Text Processing
4. Embedding Generation
5. Vector Database
6. LLM Integration
7. API & Application Layer
8. Caching Layer
9. Monitoring & Observability
10. Security & Governance
11. Deployment & CI/CD
Architecture Flow
Design Patterns
Cost Optimization

Overview

This document outlines a complete, production-ready RAG (Retrieval Augmented Generation) system built entirely using AWS-native services. The architecture is designed to be:

Serverless-first: Leveraging Lambda, API Gateway, and managed services
Scalable: Auto-scaling capabilities across all layers
Cost-effective: Pay-per-use model with intelligent caching
Secure: End-to-end encryption and fine-grained access control
Observable: Comprehensive monitoring and tracing

Architecture Diagram

High-Level System Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                           CLIENT LAYER                               │
│         Web App │ Mobile App │ API Clients │ Chatbot Interface      │
└────────────────────────────────┬────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Amazon API Gateway + AWS WAF                      │
│              (Rate Limiting, Authentication, DDoS Protection)        │
└────────────────────────────────┬────────────────────────────────────┘
                                 │
                ┌────────────────┴────────────────┐
                │                                  │
                ▼                                  ▼
    ┌────────────────────┐            ┌────────────────────────┐
    │  INGESTION PIPELINE │            │    QUERY PIPELINE      │
    │     (Offline)       │            │      (Real-time)       │
    └────────────────────┘            └────────────────────────┘

Detailed Ingestion Pipeline

┌──────────────────┐
│   Amazon S3      │ ← Document Upload
│  (Raw Storage)   │
└────────┬─────────┘
         │ (S3 Event Notification)
         ▼
┌────────────────────┐
│  EventBridge Rule  │
└────────┬───────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────┐
│              AWS Step Functions (Orchestration)              │
│                                                               │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Step 1: Document Processing                         │   │
│  │  Lambda + Textract → Extract Text                   │   │
│  └─────────────────┬───────────────────────────────────┘   │
│                    │                                         │
│  ┌─────────────────▼───────────────────────────────────┐   │
│  │ Step 2: Metadata Extraction                         │   │
│  │  Lambda + Comprehend → NER, Entities, Key Phrases   │   │
│  └─────────────────┬───────────────────────────────────┘   │
│                    │                                         │
│  ┌─────────────────▼───────────────────────────────────┐   │
│  │ Step 3: Text Chunking                               │   │
│  │  Lambda → Semantic/Recursive/Fixed Chunking         │   │
│  └─────────────────┬───────────────────────────────────┘   │
│                    │                                         │
│  ┌─────────────────▼───────────────────────────────────┐   │
│  │ Step 4: Embedding Generation                        │   │
│  │  Lambda + Bedrock → Generate Vector Embeddings      │   │
│  └─────────────────┬───────────────────────────────────┘   │
│                    │                                         │
│  ┌─────────────────▼───────────────────────────────────┐   │
│  │ Step 5: Vector Storage                              │   │
│  │  Lambda → Index in OpenSearch Service               │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
         │
         ▼
┌────────────────────────────┐
│  Amazon OpenSearch Service │
│  - Vector Index            │
│  - BM25 Lexical Index      │
│  - Metadata Filters        │
└────────────────────────────┘

Detailed Query Pipeline

┌──────────────────┐
│   User Query     │
└────────┬─────────┘
         │
         ▼
┌────────────────────────────────┐
│    Amazon API Gateway          │
│  - Cognito Authorization       │
│  - Request Validation          │
│  - Rate Limiting               │
└────────┬───────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────┐
│              AWS Lambda (Query Handler)                      │
│                                                               │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Step 1: Cache Check                                 │   │
│  │  ElastiCache Redis → Check for cached response      │   │
│  └─────────────────┬───────────────────────────────────┘   │
│                    │ (Cache Miss)                            │
│  ┌─────────────────▼───────────────────────────────────┐   │
│  │ Step 2: Query Embedding                             │   │
│  │  Bedrock → Generate query vector                    │   │
│  └─────────────────┬───────────────────────────────────┘   │
│                    │                                         │
│  ┌─────────────────▼───────────────────────────────────┐   │
│  │ Step 3: Hybrid Search                               │   │
│  │  OpenSearch → Vector + BM25 + Metadata Filters      │   │
│  └─────────────────┬───────────────────────────────────┘   │
│                    │                                         │
│  ┌─────────────────▼───────────────────────────────────┐   │
│  │ Step 4: Reranking (Optional)                        │   │
│  │  Bedrock/SageMaker → Cohere Rerank                  │   │
│  └─────────────────┬───────────────────────────────────┘   │
│                    │                                         │
│  ┌─────────────────▼───────────────────────────────────┐   │
│  │ Step 5: Context Assembly                            │   │
│  │  Retrieve chunks + Apply MMR/Diversity              │   │
│  └─────────────────┬───────────────────────────────────┘   │
│                    │                                         │
│  ┌─────────────────▼───────────────────────────────────┐   │
│  │ Step 6: LLM Generation                              │   │
│  │  Bedrock → Claude/Llama/Mistral with Context        │   │
│  └─────────────────┬───────────────────────────────────┘   │
│                    │                                         │
│  ┌─────────────────▼───────────────────────────────────┐   │
│  │ Step 7: Cache Response                              │   │
│  │  ElastiCache → Store for future queries             │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
         │
         ▼
┌──────────────────────┐
│  Response to Client  │
└──────────────────────┘

Core Components

System Components Overview

Component	Purpose	AWS Services
Storage	Document and data persistence	S3, DynamoDB
Processing	Document transformation and enrichment	Lambda, Step Functions, Textract, Comprehend
Embedding	Vector generation	Bedrock, SageMaker
Retrieval	Vector and hybrid search	OpenSearch Service, Aurora PostgreSQL
Generation	Answer synthesis	Bedrock (Claude, Llama, etc.)
Serving	API endpoints and routing	API Gateway, Lambda, App Runner
Caching	Performance optimization	ElastiCache (Redis)
Monitoring	System observability	CloudWatch, X-Ray, Grafana
Security	Access control and encryption	IAM, Cognito, KMS, WAF, Secrets Manager

1. Data Ingestion Layer

Purpose

Collect, receive, and route documents from various sources into the processing pipeline.

AWS Services

Amazon S3 (Simple Storage Service)

Role: Primary storage for all documents
Configuration:
- Bucket structure: raw/, processed/, failed/, archives/
- Versioning enabled for audit trail
- Lifecycle policies for cost optimization
- Server-side encryption (SSE-KMS)
- Event notifications to EventBridge
Storage Classes:
- S3 Standard for active documents
- S3 Intelligent-Tiering for variable access patterns
- S3 Glacier for archival

Amazon EventBridge

Role: Event-driven orchestration trigger
Configuration:
- Rules to detect S3 object creation events
- Pattern matching by file type/prefix
- Route events to Step Functions
- Dead letter queue for failed events

Event Patterns:

{
  "source": ["aws.s3"],
  "detail-type": ["Object Created"],
  "detail": {
    "bucket": {"name": ["rag-documents-bucket"]}
  }
}

AWS DataSync (Optional)

Role: Automated data transfer from on-premises/other clouds
Use Cases:
- Scheduled sync from file servers
- Migration of existing document repositories
- Continuous sync from network shares

AWS Transfer Family (Optional)

Role: SFTP/FTP/FTPS endpoint for document upload
Use Cases:
- Third-party integrations requiring FTP
- Legacy systems integration
- Secure file transfer workflows

Data Flow

External Sources → S3 Bucket → EventBridge → Step Functions
                      ↓
                  Versioning
                      ↓
                 Lifecycle Rules

2. Document Processing & Extraction

Purpose

Extract text, metadata, and structured information from various document formats.

AWS Services

AWS Lambda

Role: Serverless compute for document processing logic
Functions:
- document-classifier: Detect document type and route
- text-extractor: Extract text using appropriate method
- metadata-enricher: Add document metadata
- error-handler: Process failed documents
Configuration:
- Runtime: Python 3.11+
- Memory: 1024-3008 MB (based on document size)
- Timeout: 5-15 minutes (for large documents)
- Layers: Common libraries (boto3, PyPDF2, python-docx)
- VPC: Configured if accessing OpenSearch/ElastiCache
- Environment variables: S3 buckets, service endpoints

Amazon Textract

Role: Extract text and structured data from documents
Capabilities:
- Text detection and OCR
- Table extraction
- Form data extraction
- Handwriting recognition
- Document layout analysis
Supported Formats:
- PDF, PNG, JPEG, TIFF
- Scanned documents
- Multi-page documents
API Operations:
- DetectDocumentText: Simple text extraction
- AnalyzeDocument: Tables and forms
- StartDocumentAnalysis: Async for large docs

Amazon Comprehend

Role: Natural language processing and metadata extraction
Capabilities:
- Entity recognition (people, places, organizations)
- Key phrase extraction
- Sentiment analysis
- Language detection
- PII detection and redaction
- Topic modeling
- Custom entity recognition
Use Cases:
- Enrich chunks with entities for filtering
- Detect document categories
- Extract key topics for metadata
- Identify sensitive information

AWS Step Functions

Role: Orchestrate multi-step document processing workflow
State Machine Components:
- Parallel States: Process multiple documents concurrently
- Choice States: Route based on document type
- Task States: Invoke Lambda functions
- Retry/Catch: Error handling and retries
- Map State: Batch processing

Workflow Definition:

Start → Validate → Extract Text → Extract Metadata → Chunk → Embed → Index → Success
  ↓         ↓           ↓              ↓              ↓        ↓       ↓
Error   Error       Error          Error          Error   Error   Error
  ↓         ↓           ↓              ↓              ↓        ↓       ↓
Retry/DLQ  Retry/DLQ  Retry/DLQ     Retry/DLQ     Retry  Retry/DLQ  DLQ

Processing Pipeline Components

Step	Input	Processing	Output	AWS Service
Validation	Raw file	Check format, size, corruption	Valid file or error	Lambda
Text Extraction	PDF/Image	OCR, text extraction	Plain text	Lambda + Textract
Metadata Extraction	Text	NER, key phrases, entities	JSON metadata	Lambda + Comprehend
Quality Check	Text	Validate extraction quality	Approved text	Lambda
Storage	Processed data	Store intermediate results	S3 location	S3

Error Handling Strategy

Retry Logic: 3 attempts with exponential backoff
Dead Letter Queue: SNS topic for failed documents
Fallback: Alternative extraction methods
Logging: CloudWatch Logs for debugging

3. Chunking & Text Processing

Purpose

Split documents into optimal-sized chunks for embedding and retrieval.

AWS Services

AWS Lambda

Role: Execute chunking algorithms
Functions:
- semantic-chunker: Content-aware splitting
- recursive-chunker: Hierarchical splitting
- fixed-chunker: Simple size-based splitting
- sentence-window-chunker: Sentence-based with context windows
Configuration:
- Runtime: Python 3.11
- Memory: 512-1024 MB
- Libraries: langchain, llama-index, spaCy, nltk

Chunking Strategies

1. Fixed-Size Chunking

Method: Split by character/token count
Parameters:
- Chunk size: 500-1000 tokens
- Overlap: 50-200 tokens (10-20%)
Use Case: Simple documents, consistent structure

Implementation:

# Pseudocode
chunk_size = 1000
overlap = 200
chunks = split_text(text, chunk_size, overlap)

2. Semantic Chunking

Method: Split by semantic boundaries (paragraphs, sections)
Parameters:
- Min chunk size: 200 tokens
- Max chunk size: 1500 tokens
- Boundary markers: headers, paragraph breaks
Use Case: Structured documents with clear sections
Implementation:
- Detect section headers
- Respect paragraph boundaries
- Maintain semantic coherence

3. Recursive Character Splitting

Method: Hierarchical splitting with multiple separators
Separators Priority:
1. \n\n (paragraph breaks)
2. \n (line breaks)
3. . (sentences)
4. (words)
Use Case: Mixed content types
Library: LangChain RecursiveCharacterTextSplitter

4. Sentence Window Chunking

Method: Each chunk is a sentence with surrounding context
Parameters:
- Window size: 1-3 sentences before/after
- Overlap: Natural sentence boundaries
Use Case: Q&A systems, precise retrieval
Advantage: Better context preservation

Chunk Metadata

Each chunk includes:

{
  "chunk_id": "doc123_chunk_5",
  "document_id": "doc123",
  "chunk_index": 5,
  "content": "chunk text...",
  "chunk_size": 523,
  "start_position": 2500,
  "end_position": 3023,
  "metadata": {
    "document_title": "...",
    "section": "Introduction",
    "page_number": 3,
    "entities": ["AWS", "Lambda"],
    "key_phrases": ["serverless computing"],
    "parent_chunk": "doc123_chunk_4",
    "child_chunks": ["doc123_chunk_5_1", "doc123_chunk_5_2"]
  }
}

Storage

Intermediate: S3 bucket (processed/chunks/)
Final: Included in OpenSearch document
Format: JSON with metadata

4. Embedding Generation

Purpose

Convert text chunks into dense vector representations for semantic search.

AWS Services

Amazon Bedrock

Role: Managed service for foundation model APIs

Embedding Models Available:

Model	Dimensions	Max Input	Use Case
Titan Embeddings v2	256, 512, 1024	8192 tokens	General purpose, multilingual
Titan Multimodal Embeddings	384, 1024	Text + Images	Multi-modal RAG
Cohere Embed English v3	1024	512 tokens	English text, high quality
Cohere Embed Multilingual v3	1024	512 tokens	100+ languages

Configuration:
- Model ID: amazon.titan-embed-text-v2:0
- Dimensions: 1024 (configurable)
- Batch size: Up to 100 texts per API call
- Input type: search_document for indexing, search_query for queries

Amazon SageMaker (Alternative)

Role: Deploy custom embedding models
Use Cases:
- Fine-tuned embeddings for domain-specific data
- Custom models (BERT, sentence-transformers)
- Models not available in Bedrock
Deployment:
- Real-time endpoint for synchronous inference
- Asynchronous endpoint for batch processing
- Serverless inference for variable workloads
Popular Models:
- sentence-transformers/all-MiniLM-L6-v2
- sentence-transformers/all-mpnet-base-v2
- Custom fine-tuned BERT variants

AWS Batch (For Large-Scale Embedding)

Role: Batch processing for millions of documents
Configuration:
- Compute environments: Spot instances for cost savings
- Job queues: Priority-based processing
- Job definitions: Container with embedding logic
Use Case: Initial bulk indexing, reindexing

Embedding Pipeline

Lambda Function: `generate-embeddings`

# Pseudocode structure
def lambda_handler(event, context):
    # Input: chunks from previous step
    chunks = event['chunks']
    
    # Batch chunks (up to 100)
    batches = create_batches(chunks, batch_size=100)
    
    # Generate embeddings
    embeddings = []
    for batch in batches:
        response = bedrock_client.invoke_model(
            modelId='amazon.titan-embed-text-v2:0',
            body=json.dumps({
                'inputText': batch,
                'dimensions': 1024,
                'normalize': True
            })
        )
        embeddings.extend(response['embeddings'])
    
    # Return chunks with embeddings
    return {
        'chunks_with_embeddings': combine(chunks, embeddings)
    }

Optimization Strategies

1. Caching

Cache Embeddings: Store in ElastiCache
Key: Hash of chunk text
TTL: Long-lived (weeks/months)
Benefit: Avoid regeneration for duplicate content

2. Batch Processing

Batch Size: 50-100 chunks per API call
Parallel Batches: Multiple concurrent Lambda invocations
Rate Limiting: Implement exponential backoff

3. Async Processing

SQS Queue: Queue chunks for embedding
Worker Lambdas: Poll queue and process
Scalability: Auto-scale based on queue depth

Cost Optimization

Bedrock Pricing: Pay per input token
Titan v2: ~$0.0001 per 1K tokens
Optimization:
- Cache embeddings
- Use smaller dimensions if sufficient (512 vs 1024)
- Batch requests to reduce API calls

5. Vector Database

Purpose

Store and efficiently search vector embeddings with metadata filtering.

AWS Services

Amazon OpenSearch Service (Recommended)

Role: Primary vector database with hybrid search capabilities
Key Features:
- Vector Engine: k-NN plugin with HNSW/IVF algorithms
- Lexical Search: BM25 for keyword matching
- Hybrid Search: Combine vector + keyword results
- Metadata Filtering: Pre-filter before vector search
- Scalability: Auto-scaling, multi-AZ deployment

Index Configuration:

{
  "mappings": {
    "properties": {
      "chunk_id": {"type": "keyword"},
      "content": {"type": "text", "analyzer": "standard"},
      "embedding": {
        "type": "knn_vector",
        "dimension": 1024,
        "method": {
          "name": "hnsw",
          "space_type": "cosinesimil",
          "engine": "nmslib",
          "parameters": {
            "ef_construction": 512,
            "m": 16
          }
        }
      },
      "metadata": {
        "properties": {
          "document_id": {"type": "keyword"},
          "source": {"type": "keyword"},
          "timestamp": {"type": "date"},
          "category": {"type": "keyword"},
          "entities": {"type": "keyword"}
        }
      }
    }
  },
  "settings": {
    "index": {
      "knn": true,
      "number_of_shards": 3,
      "number_of_replicas": 2
    }
  }
}

Search Types:

1. Vector Search (k-NN)

{
  "size": 10,
  "query": {
    "knn": {
      "embedding": {
        "vector": [0.1, 0.2, ...],
        "k": 10
      }
    }
  }
}

2. Hybrid Search (Vector + BM25)

{
  "size": 10,
  "query": {
    "hybrid": {
      "queries": [
        {"knn": {"embedding": {"vector": [...], "k": 50}}},
        {"match": {"content": "query text"}}
      ]
    }
  }
}

3. Filtered Search

{
  "size": 10,
  "query": {
    "bool": {
      "must": [
        {"knn": {"embedding": {"vector": [...], "k": 10}}}
      ],
      "filter": [
        {"term": {"metadata.category": "technical"}},
        {"range": {"metadata.timestamp": {"gte": "2024-01-01"}}}
      ]
    }
  }
}

Cluster Sizing:
- Development: 1 master, 2 data nodes (t3.small)
- Production: 3 masters, 3+ data nodes (r6g.xlarge)
- Storage: 100GB - 10TB+ depending on corpus size
- Instance types: r6g (memory optimized) for vector search

Amazon Aurora PostgreSQL with pgvector (Alternative)

Role: Relational database with vector extension
Key Features:
- Native SQL queries
- ACID transactions
- Familiar database paradigm
- Vector similarity search via pgvector extension

Configuration:

-- Enable extension
CREATE EXTENSION vector;

-- Create table
CREATE TABLE embeddings (
  id SERIAL PRIMARY KEY,
  chunk_id VARCHAR(255),
  content TEXT,
  embedding vector(1024),
  metadata JSONB
);

-- Create index
CREATE INDEX ON embeddings 
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Search query
SELECT chunk_id, content, 
       1 - (embedding <=> query_vector) as similarity
FROM embeddings
WHERE metadata->>'category' = 'technical'
ORDER BY embedding <=> query_vector
LIMIT 10;

Use Case:
- Smaller datasets (<1M vectors)
- Need for ACID transactions
- SQL-first organizations

Amazon MemoryDB for Redis (Alternative)

Role: Ultra-low latency vector search
Key Features:
- Redis 7.0+ with vector similarity search
- In-memory performance
- Persistence and durability
- Microsecond latency
Use Case:
- Real-time applications requiring <10ms latency
- High-throughput scenarios
- Smaller vector datasets

Indexing Strategy

Initial Bulk Indexing

Step Functions → Lambda (Batch Index) → OpenSearch
- Batch size: 500-1000 documents
- Parallel indexing: 5-10 concurrent Lambdas
- Refresh interval: 30s during bulk load

Incremental Indexing

New Document → Process → Embed → Lambda → OpenSearch
- Real-time indexing
- Refresh interval: 1s
- Immediate availability

Reindexing Strategy

- Create new index with updated mapping
- Reindex alias pointing to old index
- Bulk reindex to new index
- Atomic alias switch
- Delete old index

Performance Tuning

Parameter	Development	Production
ef_construction	256	512
M	8	16
Shards	1	3-5
Replicas	1	2-3
Refresh Interval	5s	1s

6. LLM Integration

Purpose

Generate natural language responses using retrieved context.

AWS Services

Amazon Bedrock

Role: Managed LLM inference service

Available Models:

Model	Context Window	Use Case	Cost (per 1M tokens)
Claude 3.5 Sonnet	200K tokens	General purpose, best quality	Input: $3, Output: $15
Claude 3.5 Haiku	200K tokens	Fast, cost-effective	Input: $0.25, Output: $1.25
Claude 3 Opus	200K tokens	Complex reasoning	Input: $15, Output: $75
Llama 3.1 70B	128K tokens	Open source alternative	Input: $0.99, Output: $0.99
Mistral Large	128K tokens	Multilingual, technical	Input: $4, Output: $12
Amazon Titan Text	32K tokens	AWS-native, cost-effective	Input: $0.50, Output: $1.50

API Configuration:

# Pseudocode
response = bedrock_runtime.invoke_model(
    modelId='anthropic.claude-3-5-sonnet-20241022-v2:0',
    body=json.dumps({
        'anthropic_version': 'bedrock-2023-05-31',
        'max_tokens': 4096,
        'temperature': 0.7,
        'top_p': 0.9,
        'messages': [
            {
                'role': 'user',
                'content': prompt
            }
        ],
        'system': system_prompt
    })
)

Features:
- Streaming responses
- System prompts for behavior control
- Guardrails for content safety
- Prompt caching (reduced cost for repeated context)
- Multi-modal support (Claude 3.5 Sonnet)

Amazon SageMaker (Alternative)

Role: Deploy custom or open-source LLMs
Use Cases:
- Fine-tuned models on proprietary data
- Models not available in Bedrock
- Custom inference logic
- Cost optimization with reserved capacity
Deployment Options:
- Real-time Endpoints: ml.g5.xlarge, ml.g5.2xlarge
- Serverless Inference: Auto-scaling, pay-per-use
- Async Endpoints: Batch inference, queue-based
Popular Models:
- Llama 2/3 (7B, 13B, 70B)
- Mistral (7B, 8x7B)
- Falcon (7B, 40B)
- Fine-tuned GPT variants

Prompt Engineering

System Prompt Template

You are a helpful AI assistant. Answer questions based ONLY on the provided context.

Guidelines:
- If the context doesn't contain the answer, say "I don't have enough information"
- Cite sources by mentioning the document name
- Be concise and accurate
- Do not make up information

RAG Prompt Template

Context:
{retrieved_chunks}

Question: {user_question}

Instructions:
- Answer based only on the context above
- Cite the relevant document sections
- If uncertain, acknowledge limitations

Answer:

Advanced Techniques

Few-Shot Prompting: Include examples in system prompt
Chain-of-Thought: Request step-by-step reasoning
Citations: Force structured source attribution
Hedging: Encourage uncertainty expression

Response Generation Strategies

1. Standard Generation

Retrieve → Build Prompt → LLM → Response
- Single API call
- Fastest approach
- Best for simple queries

2. Iterative Refinement

Retrieve → Generate Draft → Re-retrieve → Refine → Final Response
- Multiple LLM calls
- Better accuracy
- Higher latency

3. Agentic Approach (Bedrock Agents)

Query → Agent → [Tool: Search] → [Tool: Calculator] → [Tool: Database] → Response
- Multi-step reasoning
- External tool integration
- Complex workflows

Cost Optimization

1. Prompt Caching

Bedrock Feature: Cache system prompt and context
Benefit: 90% cost reduction on cached tokens
TTL: 5 minutes
Use Case: Repeated queries with same context

2. Model Selection

Simple queries: Use Haiku (25x cheaper than Opus)
Complex reasoning: Use Sonnet
Fallback strategy: Try Haiku first, escalate if needed

3. Context Window Management

Limit retrieved chunks: 5-10 most relevant
Summarize long documents: Pre-process with Haiku
Compress context: Remove redundant information

7. API & Application Layer

Purpose

Expose RAG functionality through secure, scalable APIs.

AWS Services

Amazon API Gateway

Role: Managed API endpoint
API Types:

REST API
- Traditional request-response
- Full API management features
- Custom domains, API keys
HTTP API
- Lower latency, lower cost
- Simpler configuration
- JWT/OAuth 2.0 support
WebSocket API
- Bidirectional communication
- Streaming responses
- Chat applications

Configuration:

Endpoints:
  - POST /query
  - POST /ingest
  - GET /document/{id}
  - GET /health
  - GET /metrics

Features:
  - Request validation
  - Response transformation
  - CORS configuration
  - API keys / Usage plans
  - Rate limiting (1000 req/sec per key)
  - Request/response logging

Integration Types:
- Lambda Proxy: Full request/response control
- Lambda Non-Proxy: Custom mapping templates
- HTTP Proxy: Forward to other services
- VPC Link: Private integrations
Security:
- AWS WAF integration
- Cognito authorizers
- IAM authorization
- API keys and usage plans
- Resource policies

AWS Lambda

Role: Business logic execution

Key Functions:

Function	Purpose	Trigger	Timeout
`query-handler`	Process user queries	API Gateway	30s
`ingest-trigger`	Initiate document ingestion	API Gateway	5s
`document-status`	Check processing status	API Gateway	3s
`feedback-collector`	Store user feedback	API Gateway	5s

Configuration Best Practices:
- Memory: 1024-3008 MB (proportional CPU)
- Timeout: Query: 30s, Ingest: 5s
- Concurrency: Reserved for production (avoid cold starts)
- VPC: Only if accessing VPC resources (OpenSearch, ElastiCache)
- Environment Variables: Service endpoints, config
- Layers: Shared dependencies (opensearch-py, redis-py)

Query Handler Lambda Flow:

def lambda_handler(event, context):
    # 1. Parse request
    query = event['body']['query']
    filters = event['body'].get('filters', {})
    
    # 2. Check cache
    cached = check_redis_cache(query)
    if cached:
        return cached
    
    # 3. Generate query embedding
    query_vector = bedrock_embed(query)
    
    # 4. Search OpenSearch
    results = opensearch_hybrid_search(
        vector=query_vector,
        text=query,
        filters=filters,
        top_k=10
    )
    
    # 5. Build context
    context = build_context(results)
    
    # 6. Generate response
    response = bedrock_generate(
        query=query,
        context=context
    )
    
    # 7. Cache result
    cache_response(query, response)
    
    # 8. Return
    return {
        'statusCode': 200,
        'body': json.dumps({
            'answer': response,
            'sources': results,
            'cached': False
        })
    }

AWS App Runner (Alternative for Containers)

Role: Deploy containerized RAG application
Use Cases:
- FastAPI/Flask applications
- Complex dependencies
- Persistent connections
- Custom runtime requirements

Configuration:

Service:
  Name: rag-api-service
  Source:
    Type: ECR
    Image: account.dkr.ecr.region.amazonaws.com/rag-api:latest
  Instance:
    CPU: 1 vCPU
    Memory: 2 GB
  AutoScaling:
    MinSize: 1
    MaxSize: 10
    TargetConcurrency: 100
  HealthCheck:
    Path: /health
    Interval: 10s

Advantages:
- No cold starts
- WebSocket support
- Long-running connections
- Easier debugging

Amazon ECS Fargate (Alternative)

Role: Container orchestration without servers
Use Cases:
- Microservices architecture
- Complex multi-container applications
- Need for load balancing
- Blue-green deployments

API Endpoints Design

POST /query

Request:
{
  "query": "What is AWS Lambda?",
  "filters": {
    "category": "documentation",
    "date_range": {"start": "2024-01-01", "end": "2024-12-31"}
  },
  "options": {
    "top_k": 5,
    "rerank": true,
    "stream": false
  }
}

Response:
{
  "answer": "AWS Lambda is a serverless compute service...",
  "sources": [
    {
      "chunk_id": "doc123_chunk_5",
      "content": "...",
      "document": "aws-lambda-guide.pdf",
      "score": 0.92,
      "page": 3
    }
  ],
  "metadata": {
    "latency_ms": 234,
    "cached": false,
    "model": "claude-3-5-sonnet",
    "tokens_used": 1523
  }
}

POST /ingest

Request:
{
  "document_url": "s3://bucket/document.pdf",
  "metadata": {
    "category": "technical",
    "author": "John Doe",
    "tags": ["AWS", "serverless"]
  }
}

Response:
{
  "job_id": "ingest-12345",
  "status": "processing",
  "estimated_completion": "2024-01-30T10:30:00Z"
}

GET /document/{id}/status

Response:
{
  "document_id": "doc123",
  "status": "completed",
  "progress": {
    "extraction": "completed",
    "chunking": "completed",
    "embedding": "completed",
    "indexing": "completed"
  },
  "chunks_created": 45,
  "processing_time_seconds": 12.3
}

8. Caching Layer

Purpose

Reduce latency and costs by caching embeddings, queries, and responses.

AWS Services

Amazon ElastiCache for Redis

Role: In-memory caching for low-latency access

Cache Types:

1. Query Embedding Cache

Key: hash(query_text)
Value: embedding_vector
TTL: 7 days
Purpose: Avoid regenerating embeddings for repeat queries

2. Query Result Cache

Key: hash(query + filters)
Value: {results, context, metadata}
TTL: 1 hour
Purpose: Return cached results for identical queries

3. LLM Response Cache

Key: hash(query + context + model)
Value: generated_response
TTL: 24 hours
Purpose: Avoid LLM calls for identical questions

4. Session Cache

Key: session_id
Value: conversation_history
TTL: 30 minutes
Purpose: Multi-turn conversations

Cluster Configuration:

Development:

Node Type: cache.t3.micro
Nodes: 1
Engine: Redis 7.0
Multi-AZ: false
Encryption: In-transit only

Production:

Node Type: cache.r6g.large
Nodes: 3 (1 primary, 2 replicas)
Engine: Redis 7.0
Multi-AZ: true
Encryption: At-rest + In-transit
Backup: Automated snapshots

Eviction Policy: allkeys-lru (Least Recently Used)

Amazon CloudFront (Optional)

Role: CDN for API responses and static assets
Use Cases:
- Cache GET endpoints globally
- Reduce API Gateway costs
- Improve response times for geographically distributed users
- Serve static assets (UI, documentation)

Configuration:

Origin: API Gateway
Behaviors:
  - PathPattern: /query
    CachePolicy: Disabled (dynamic content)
  - PathPattern: /health
    CachePolicy: CachingOptimized (5 min TTL)
  - PathPattern: /static/*
    CachePolicy: CachingOptimized (24 hour TTL)

Caching Strategy

Cache Hierarchy

Request → CloudFront (if enabled) → API Gateway → Lambda
                                                      ↓
                                               Check ElastiCache
                                                      ↓
                                              Cache Hit → Return
                                                      ↓
                                            Cache Miss → Process
                                                      ↓
                                              Update Cache → Return

Cache Key Design

Query Cache Key:

cache_key = f"query:{hash(query_text)}:{hash(filters)}:{model}"
# Example: "query:a3f5e9:b2c4d6:claude-3-5-sonnet"

Embedding Cache Key:

cache_key = f"embedding:{hash(text)}:{model}:{dimensions}"
# Example: "embedding:d4e6f8:titan-v2:1024"

Cache Warming Strategy

1. Preload Common Queries

# During deployment or scheduled job
common_queries = [
    "What is AWS Lambda?",
    "How to deploy containers?",
    # ... top 100 queries
]

for query in common_queries:
    result = process_query(query)
    cache.set(cache_key(query), result, ttl=86400)

2. Analytics-Driven Warming

Daily Job:
  - Analyze query logs (CloudWatch Insights)
  - Identify top 100 queries
  - Pre-compute and cache results
  - Update cache before peak hours

Cache Invalidation

Strategies:

TTL-Based: Automatic expiration
- Query results: 1 hour
- Embeddings: 7 days
- LLM responses: 24 hours

Event-Based: Invalidate on data changes

# When document is reindexed
invalidate_pattern("query:*")  # Clear all query caches

Manual: Admin endpoint for cache clearing

POST /admin/cache/clear
{
  "pattern": "query:*",
  "reason": "Index updated"
}

Performance Impact

Metric	Without Cache	With Cache	Improvement
Avg Latency	2.5s	150ms	94% faster
P99 Latency	5.2s	200ms	96% faster
LLM Cost	$0.05/query	$0.005/query	90% cheaper
Throughput	100 QPS	1000 QPS	10x increase

9. Monitoring & Observability

Purpose

Track system health, performance, and usage for debugging and optimization.

AWS Services

Amazon CloudWatch

Role: Central monitoring and logging service

Components:

1. CloudWatch Logs

Log Groups per Lambda function
API Gateway access logs
Step Functions execution logs
OpenSearch slow query logs
Application logs (structured JSON)

Retention Policy:

Development: 7 days
Production: 90 days
Compliance: 1-7 years

2. CloudWatch Metrics

Standard Metrics:

Lambda: Invocations, Duration, Errors, Throttles
API Gateway: Count, Latency, 4xx/5xx errors
OpenSearch: CPU, Memory, Disk, Indexing rate
ElastiCache: CPU, Memory, Cache hits/misses
S3: Storage, Requests
Bedrock: Model invocations, Throttles

Custom Metrics:

cloudwatch.put_metric_data(
    Namespace='RAG/Application',
    MetricData=[
        {
            'MetricName': 'QueryLatency',
            'Value': latency_ms,
            'Unit': 'Milliseconds',
            'Dimensions': [
                {'Name': 'Environment', 'Value': 'production'},
                {'Name': 'ModelName', 'Value': 'claude-3-5-sonnet'}
            ]
        },
        {
            'MetricName': 'RetrievalQuality',
            'Value': relevance_score,
            'Unit': 'None'
        },
        {
            'MetricName': 'CacheHitRate',
            'Value': cache_hits / total_requests,
            'Unit': 'Percent'
        }
    ]
)

3. CloudWatch Dashboards

System Overview Dashboard:

Widgets:
- Total Queries (Count)
- Average Latency (ms)
- Error Rate (%)
- Active Users
- Cost per Query ($)
- Cache Hit Rate (%)

Performance Dashboard:

Widgets:
- Query Pipeline Latency (breakdown)
- Embedding Generation Time
- OpenSearch Query Time
- LLM Inference Time
- End-to-End Latency (P50, P95, P99)

Infrastructure Dashboard:

Widgets:
- Lambda Concurrent Executions
- OpenSearch CPU/Memory
- ElastiCache Hit Rate
- S3 Storage Used
- API Gateway Throttles

4. CloudWatch Alarms

Critical Alarms:

- Name: HighErrorRate
  Metric: Errors
  Threshold: > 5%
  Period: 5 minutes
  Action: SNS → PagerDuty

- Name: HighLatency
  Metric: Duration
  Threshold: > 5000ms
  Period: 5 minutes
  Action: SNS → Email

- Name: OpenSearchClusterRed
  Metric: ClusterStatus.red
  Threshold: >= 1
  Period: 1 minute
  Action: SNS → PagerDuty

- Name: LambdaThrottling
  Metric: Throttles
  Threshold: > 10
  Period: 5 minutes
  Action: SNS → Slack

Cost Alarms:

- Name: DailyBudgetExceeded
  Metric: EstimatedCharges
  Threshold: > $100/day
  Action: SNS → Email + Slack

AWS X-Ray

Role: Distributed tracing for request flows

Instrumentation:

from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all

# Patch all supported libraries
patch_all()

@xray_recorder.capture('query_handler')
def lambda_handler(event, context):
    
    with xray_recorder.capture('check_cache'):
        cached = redis_client.get(cache_key)
    
    with xray_recorder.capture('opensearch_search'):
        results = opensearch.search(...)
    
    with xray_recorder.capture('bedrock_generate'):
        response = bedrock.invoke_model(...)
    
    return response

Service Map: Visual representation of request flow

API Gateway → Lambda → [ElastiCache, OpenSearch, Bedrock]

Trace Analysis:
- Identify bottlenecks (which service is slow)
- Detect errors and exceptions
- Analyze cold start impact
- Measure service dependencies

Amazon Managed Grafana (Optional)

Role: Advanced visualization and analytics
Features:
- Multi-source dashboards (CloudWatch + custom)
- Advanced querying and filtering
- Alerting and notification
- Team collaboration
Use Cases:
- Executive dashboards
- Real-time operations monitoring
- Historical trend analysis
- Cost attribution

CloudWatch Logs Insights

Role: Interactive log analytics

Query Examples:

Find slow queries:

fields @timestamp, query, latency
| filter latency > 2000
| sort latency desc
| limit 20

Error analysis:

fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by error_type

Cache hit rate:

fields cache_hit
| stats count() by cache_hit

Cost per query:

fields query, bedrock_cost, opensearch_cost
| stats sum(bedrock_cost + opensearch_cost) as total_cost by bin(5m)

Key Metrics to Track

Application Metrics

Metric	Target	Alerting Threshold
Query Latency (P95)	< 2s	> 5s
Error Rate	< 1%	> 5%
Cache Hit Rate	> 60%	< 40%
Retrieval Precision	> 0.8	< 0.6
LLM Hallucination Rate	< 5%	> 10%

Infrastructure Metrics

Metric	Target	Alerting Threshold
Lambda Cold Starts	< 5%	> 15%
OpenSearch CPU	< 70%	> 85%
ElastiCache Memory	< 80%	> 90%
API Gateway 5xx	< 0.1%	> 1%

Business Metrics

Metric	Target	Review Frequency
Daily Active Users	Track growth	Daily
Queries per User	> 5/session	Weekly
User Satisfaction	> 4/5	Weekly
Cost per Query	< $0.05	Daily

Logging Best Practices

Structured Logging

import json
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def log_query(query, latency, cached, error=None):
    log_entry = {
        'timestamp': datetime.utcnow().isoformat(),
        'event': 'query_processed',
        'query': query,
        'latency_ms': latency,
        'cached': cached,
        'model': 'claude-3-5-sonnet',
        'environment': 'production'
    }
    
    if error:
        log_entry['error'] = str(error)
        log_entry['level'] = 'ERROR'
    else:
        log_entry['level'] = 'INFO'
    
    logger.info(json.dumps(log_entry))

Correlation IDs

# API Gateway passes request ID
request_id = event['requestContext']['requestId']

# Add to all logs
logger.info(json.dumps({
    'request_id': request_id,
    'message': 'Processing query'
}))

# Pass to downstream services
headers = {'X-Request-ID': request_id}

10. Security & Governance

Purpose

Ensure data protection, access control, and compliance.

AWS Services

AWS Identity and Access Management (IAM)

Role: Access control and permissions

IAM Policies:

Lambda Execution Role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::rag-documents/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel"
      ],
      "Resource": "arn:aws:bedrock:*:*:model/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "es:ESHttpPost",
        "es:ESHttpPut",
        "es:ESHttpGet"
      ],
      "Resource": "arn:aws:es:region:account:domain/rag-vectors/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "*"
    }
  ]
}

Principle of Least Privilege: Each service gets only necessary permissions

Amazon Cognito

Role: User authentication and authorization
Features:
- User pools for authentication
- Identity pools for AWS resource access
- OAuth 2.0 / SAML / Social identity providers
- Multi-factor authentication (MFA)
- Custom authentication flows

Configuration:

UserPool:
  Name: rag-users
  MFA: Optional
  PasswordPolicy:
    MinimumLength: 12
    RequireUppercase: true
    RequireNumbers: true
    RequireSymbols: true
  
  CustomAttributes:
    - organization
    - access_level
  
  AppClients:
    - Name: web-app
      AuthFlows: [USER_PASSWORD_AUTH]
    - Name: mobile-app
      AuthFlows: [USER_SRP_AUTH]

API Gateway Integration:

Authorizer:
  Type: COGNITO_USER_POOLS
  UserPoolArn: arn:aws:cognito-idp:region:account:userpool/pool-id
  IdentitySource: method.request.header.Authorization

AWS Secrets Manager

Role: Store and rotate sensitive credentials
Stored Secrets:
- Third-party API keys
- Database passwords (if using Aurora)
- OpenSearch credentials
- External service tokens

Secret Rotation:

Secret:
  Name: /rag/opensearch/credentials
  RotationSchedule: 90 days
  RotationLambda: rotate-opensearch-credentials

Lambda Access:

import boto3

secrets_client = boto3.client('secretsmanager')

def get_secret(secret_name):
    response = secrets_client.get_secret_value(
        SecretId=secret_name
    )
    return json.loads(response['SecretString'])

AWS Key Management Service (KMS)

Role: Encryption key management
Encryption Scope:
- S3 buckets (server-side encryption)
- OpenSearch domain (at-rest encryption)
- ElastiCache (at-rest + in-transit)
- EBS volumes (Lambda in VPC)
- Secrets Manager

Key Configuration:

CustomerManagedKey:
  Alias: rag/encryption-key
  Description: Master key for RAG system encryption
  KeyPolicy:
    - Sid: Enable IAM policies
      Principal: {AWS: "arn:aws:iam::account:root"}
      Action: kms:*
    
    - Sid: Allow services to use key
      Principal:
        Service: 
          - s3.amazonaws.com
          - es.amazonaws.com
          - lambda.amazonaws.com
      Action:
        - kms:Decrypt
        - kms:GenerateDataKey
  
  KeyRotation: Enabled (automatic annual rotation)

AWS WAF (Web Application Firewall)

Role: Protect API Gateway from attacks

Rules:

1. Rate Limiting:

- Name: RateLimitRule
  Priority: 1
  Statement:
    RateBasedStatement:
      Limit: 1000  # requests per 5 minutes
      AggregateKeyType: IP
  Action: Block

2. SQL Injection Protection:

- Name: SQLiProtection
  Priority: 2
  Statement:
    SqliMatchStatement:
      FieldToMatch: {Body: {}}
      TextTransformations:
        - {Priority: 0, Type: URL_DECODE}
  Action: Block

3. Geo-Blocking (if needed):

- Name: GeoBlock
  Priority: 3
  Statement:
    GeoMatchStatement:
      CountryCodes: [CN, RU]  # Example
  Action: Block

4. IP Whitelist (for admin endpoints):

- Name: AdminIPWhitelist
  Priority: 4
  Statement:
    NotStatement:
      Statement:
        IPSetReferenceStatement:
          Arn: arn:aws:wafv2:region:account:ipset/admin-ips
  Action: Block
  Scope: /admin/*

Amazon Macie (Optional)

Role: Discover and protect sensitive data
Use Cases:
- Scan S3 buckets for PII/PHI
- Classify documents by sensitivity
- Alert on sensitive data exposure
- Compliance reporting
Sensitive Data Types:
- Credit card numbers
- Social security numbers
- Email addresses
- Phone numbers
- Medical records
- Credentials

AWS CloudTrail

Role: Audit logging for all AWS API calls

Configuration:

Trail:
  Name: rag-audit-trail
  S3BucketName: rag-audit-logs
  IncludeGlobalEvents: true
  IsMultiRegion: true
  EnableLogFileValidation: true
  
  EventSelectors:
    - ReadWriteType: All
      IncludeManagementEvents: true
      DataResources:
        - Type: AWS::S3::Object
          Values: ["arn:aws:s3:::rag-documents/*"]
        - Type: AWS::Lambda::Function
          Values: ["arn:aws:lambda:*:*:function/rag-*"]

Monitoring:
- Track all document access
- Monitor configuration changes
- Detect unauthorized API calls
- Compliance auditing

AWS Config

Role: Resource configuration tracking and compliance

Rules:

ConfigRules:
  - s3-bucket-encryption-enabled
  - lambda-function-settings-check
  - opensearch-encrypted-at-rest
  - iam-password-policy
  - approved-amis-by-tag

Data Protection

Encryption Strategy

At Rest:

S3: SSE-KMS with customer-managed keys
OpenSearch: Node-to-node encryption + at-rest
ElastiCache: At-rest encryption enabled
EBS (Lambda): Encrypted volumes
Secrets Manager: KMS encryption

In Transit:

API Gateway: TLS 1.2+
Lambda → OpenSearch: HTTPS
Lambda → Bedrock: HTTPS
Lambda → ElastiCache: TLS
All inter-service: AWS PrivateLink (optional)

Network Security

VPC Configuration (if needed):

VPC:
  CIDR: 10.0.0.0/16
  Subnets:
    Public:
      - 10.0.1.0/24  # AZ1
      - 10.0.2.0/24  # AZ2
    Private:
      - 10.0.10.0/24  # AZ1 - Lambda
      - 10.0.11.0/24  # AZ2 - Lambda
      - 10.0.20.0/24  # AZ1 - OpenSearch
      - 10.0.21.0/24  # AZ2 - OpenSearch
  
  SecurityGroups:
    Lambda:
      Ingress: []  # No inbound
      Egress:
        - Protocol: HTTPS
          Destination: 0.0.0.0/0
    
    OpenSearch:
      Ingress:
        - Protocol: HTTPS
          Source: Lambda-SG
      Egress: []
    
    ElastiCache:
      Ingress:
        - Protocol: Redis (6379)
          Source: Lambda-SG
      Egress: []

VPC Endpoints (reduce NAT costs):

VPCEndpoints:
  - Service: com.amazonaws.region.s3 (Gateway)
  - Service: com.amazonaws.region.bedrock (Interface)
  - Service: com.amazonaws.region.secretsmanager (Interface)

Access Control Layers

Layer 1: AWS WAF → Block malicious traffic
Layer 2: API Gateway → API key / Cognito auth
Layer 3: IAM Roles → Service-level permissions
Layer 4: Resource Policies → Fine-grained access
Layer 5: Encryption → Data protection

Compliance Considerations

GDPR

Data minimization in logs
User consent for data processing
Right to be forgotten (delete user data)
Data portability (export user data)
Data retention policies

HIPAA (if applicable)

BAA with AWS
Enable CloudTrail logging
Encrypt all data at rest and in transit
Access controls and audit logs
Incident response procedures

SOC 2

Security monitoring and alerting
Access control and authentication
Change management procedures
Incident response and disaster recovery
Regular security assessments

11. Deployment & CI/CD

Purpose

Automate infrastructure provisioning and application deployment.

AWS Services

AWS CloudFormation / AWS CDK

Role: Infrastructure as Code

AWS CDK (Recommended):

from aws_cdk import (
    Stack,
    aws_s3 as s3,
    aws_lambda as lambda_,
    aws_apigateway as apigw,
    aws_opensearch as opensearch,
    aws_iam as iam
)

class RagStack(Stack):
    def __init__(self, scope, id, **kwargs):
        super().__init__(scope, id, **kwargs)
        
        # S3 Bucket
        documents_bucket = s3.Bucket(
            self, "DocumentsBucket",
            encryption=s3.BucketEncryption.KMS,
            versioned=True,
            lifecycle_rules=[...]
        )
        
        # Lambda Functions
        query_handler = lambda_.Function(
            self, "QueryHandler",
            runtime=lambda_.Runtime.PYTHON_3_11,
            handler="index.handler",
            code=lambda_.Code.from_asset("lambda/query"),
            memory_size=1024,
            timeout=Duration.seconds(30),
            environment={
                "OPENSEARCH_ENDPOINT": opensearch_domain.domain_endpoint,
                "BEDROCK_MODEL": "claude-3-5-sonnet"
            }
        )
        
        # OpenSearch Domain
        opensearch_domain = opensearch.Domain(
            self, "VectorDB",
            version=opensearch.EngineVersion.OPENSEARCH_2_11,
            capacity={
                "data_nodes": 3,
                "data_node_instance_type": "r6g.large.search"
            },
            ebs={
                "volume_size": 100,
                "volume_type": ec2.EbsDeviceVolumeType.GP3
            },
            encryption_at_rest={"enabled": True},
            node_to_node_encryption=True
        )
        
        # API Gateway
        api = apigw.RestApi(
            self, "RagAPI",
            rest_api_name="RAG Service API"
        )
        
        query_integration = apigw.LambdaIntegration(query_handler)
        api.root.add_resource("query").add_method("POST", query_integration)

AWS CodePipeline

Role: Continuous deployment orchestration

Pipeline Stages:

Pipeline:
  Name: rag-deployment-pipeline
  
  Stages:
    - Name: Source
      Actions:
        - Name: SourceCode
          ActionType: Source
          Provider: GitHub / CodeCommit
          Configuration:
            Repository: rag-system
            Branch: main
    
    - Name: Build
      Actions:
        - Name: BuildAndTest
          ActionType: Build
          Provider: CodeBuild
          Configuration:
            ProjectName: rag-build-project
    
    - Name: DeployDev
      Actions:
        - Name: DeployToDev
          ActionType: Deploy
          Provider: CloudFormation
          Configuration:
            StackName: rag-stack-dev
            TemplatePath: BuildArtifact::template.yaml
            RoleArn: arn:aws:iam::account:role/CloudFormationRole
    
    - Name: IntegrationTests
      Actions:
        - Name: RunTests
          ActionType: Build
          Provider: CodeBuild
          Configuration:
            ProjectName: rag-integration-tests
    
    - Name: ApprovalForProd
      Actions:
        - Name: ManualApproval
          ActionType: Approval
    
    - Name: DeployProd
      Actions:
        - Name: DeployToProduction
          ActionType: Deploy
          Provider: CloudFormation
          Configuration:
            StackName: rag-stack-prod
            TemplatePath: BuildArtifact::template.yaml

AWS CodeBuild

Role: Build and test execution

buildspec.yml:

version: 0.2

phases:
  install:
    runtime-versions:
      python: 3.11
    commands:
      - pip install -r requirements.txt
      - pip install pytest pytest-cov
  
  pre_build:
    commands:
      - echo "Running linters..."
      - black --check .
      - flake8 .
      - mypy .
  
  build:
    commands:
      - echo "Running unit tests..."
      - pytest tests/unit --cov=src --cov-report=xml
      
      - echo "Building Lambda deployment packages..."
      - cd lambda/query && zip -r ../../query-handler.zip .
      - cd lambda/ingest && zip -r ../../ingest-handler.zip .
      
      - echo "Synthesizing CDK stack..."
      - cdk synth --output cdk.out
  
  post_build:
    commands:
      - echo "Build completed"

artifacts:
  files:
    - '**/*'
  name: BuildArtifact

reports:
  coverage:
    files:
      - 'coverage.xml'
    file-format: 'COBERTURAXML'

Deployment Strategies

1. Blue-Green Deployment (Lambda):

Lambda Function:
  Alias: Production
  Versions:
    - Version: $LATEST (Blue - current)
    - Version: 2 (Green - new)
  
  Deployment:
    Type: AllAtOnce / Linear / Canary
    
  Linear10PercentEvery10Minutes:
    - 0 min: 10% traffic to Green
    - 10 min: 20% traffic to Green
    - ...
    - 90 min: 100% traffic to Green
  
  Rollback:
    Trigger: CloudWatch Alarm (error rate > 5%)
    Action: Route 100% traffic back to Blue

2. Canary Deployment (API Gateway):

API Gateway Stage:
  Name: production
  DeploymentId: latest-deployment
  
  CanarySettings:
    PercentTraffic: 10
    UseStageCache: false
  
  Progression:
    - Deploy new version with 10% traffic
    - Monitor metrics for 30 minutes
    - If healthy: Promote canary to 100%
    - If unhealthy: Rollback to previous version

Environment Strategy

Environments:
  Development:
    Purpose: Developer testing
    Infrastructure: Minimal (1 OpenSearch node, t3.small)
    Data: Synthetic test data
    Access: Development team
  
  Staging:
    Purpose: Pre-production validation
    Infrastructure: Production-like (3 OpenSearch nodes, r6g.large)
    Data: Sanitized production data
    Access: QA team + stakeholders
  
  Production:
    Purpose: Live customer traffic
    Infrastructure: Fully redundant, multi-AZ
    Data: Real customer data
    Access: Operations team only

Rollback Procedures

Lambda Rollback:

# Automated rollback via CodeDeploy
aws deploy stop-deployment --deployment-id d-XXXXX

# Manual version rollback
aws lambda update-alias \
  --function-name query-handler \
  --name production \
  --function-version 5  # Previous stable version

Infrastructure Rollback:

# CloudFormation rollback
aws cloudformation rollback-stack --stack-name rag-stack-prod

# Or deploy previous template version
cdk deploy --version-tag v1.2.3

Architecture Flow

End-to-End Request Flow

Ingestion Flow

1. User uploads document to S3
   ↓
2. S3 triggers EventBridge rule
   ↓
3. EventBridge starts Step Functions workflow
   ↓
4. Step Functions orchestrates:
   a. Lambda + Textract: Extract text
   b. Lambda + Comprehend: Extract metadata
   c. Lambda: Chunk text
   d. Lambda + Bedrock: Generate embeddings
   e. Lambda: Index in OpenSearch
   ↓
5. Document indexed and searchable
   ↓
6. SNS notification: "Document XYZ processed successfully"

Query Flow

1. User sends query via web/mobile app
   ↓
2. API Gateway receives request
   ↓
3. Cognito validates JWT token
   ↓
4. WAF checks for malicious patterns
   ↓
5. API Gateway invokes Lambda
   ↓
6. Lambda checks ElastiCache for cached response
   ↓ (Cache miss)
7. Lambda calls Bedrock to embed query
   ↓
8. Lambda searches OpenSearch (hybrid: vector + BM25)
   ↓
9. Lambda optionally reranks results
   ↓
10. Lambda assembles context from top results
   ↓
11. Lambda calls Bedrock to generate response
   ↓
12. Lambda caches response in ElastiCache
   ↓
13. Lambda returns response to API Gateway
   ↓
14. API Gateway returns to client
   ↓
15. CloudWatch logs all metrics

Design Patterns

1. Event-Driven Architecture

Pattern: S3 events trigger processing pipelines
Benefits: Decoupled components, automatic scaling
Implementation: EventBridge + Step Functions + Lambda

2. Serverless First

Pattern: Use managed services and Lambda over EC2
Benefits: No server management, pay-per-use, auto-scaling
Trade-offs: Cold starts, execution time limits

3. Caching Hierarchy

Pattern: Multi-layer caching (CloudFront → ElastiCache → Source)
Benefits: Reduced latency, lower costs
Implementation: Cache frequently accessed data at each layer

4. Circuit Breaker

Pattern: Fail fast when dependencies are down
Benefits: Prevent cascade failures
Implementation: Lambda error handling + exponential backoff

5. Retry with Exponential Backoff

Pattern: Automatically retry failed operations with increasing delays
Benefits: Handle transient failures
Implementation: Step Functions retry config, Lambda custom logic

6. Dead Letter Queue

Pattern: Route failed messages to separate queue for analysis
Benefits: Don't lose failed requests, debug issues
Implementation: SQS DLQ for Lambda, SNS for notifications

Cost Optimization

Cost Breakdown (Estimated Monthly - Production)

Service	Usage	Monthly Cost
Bedrock (Embeddings)	10M tokens	$100
Bedrock (LLM - Sonnet)	50M input tokens, 10M output	$300
OpenSearch	3x r6g.large nodes	$900
ElastiCache	3x r6g.large nodes	$600
Lambda	1M invocations, 512MB, 2s avg	$40
API Gateway	10M requests	$35
S3 Storage	1TB documents	$23
Data Transfer	500GB outbound	$45
CloudWatch	Logs + Metrics	$50
Step Functions	100K executions	$25
Total		~$2,118/month

Optimization Strategies

1. Use Smaller Embedding Dimensions

Titan v2: 1024 → 512 dimensions
Benefit: 50% smaller storage, faster search
Trade-off: Slightly lower accuracy

2. Implement Aggressive Caching

Cache hit rate 60% → 90%
Reduce Bedrock calls by 30%
Save ~$100/month on LLM costs

3. Use Spot Instances for Batch Processing

AWS Batch with Spot for bulk embedding
Save up to 90% on compute costs
Suitable for non-time-sensitive workloads

4. S3 Intelligent-Tiering

Automatically move infrequent access data
Save 40-70% on storage costs
Zero retrieval fees

5. Right-Size OpenSearch Cluster

Start small (r6g.large), monitor usage
Use UltraWarm for historical data
Consider Aurora pgvector for smaller workloads

6. Use Bedrock Prompt Caching

Cache system prompts and context
Save 90% on cached token costs
Especially valuable for repetitive queries

7. Lambda Memory Optimization

Use AWS Compute Optimizer recommendations
Right-size memory allocation
Over-provisioning wastes money, under-provisioning increases duration

8. Reserved Capacity (for stable workloads)

OpenSearch Reserved Instances: 35-75% savings
ElastiCache Reserved Nodes: 35-55% savings
Commit to 1-year term for predictable workloads

Next Steps

Phase 1: MVP (Weeks 1-4)

Set up AWS account and basic infrastructure
Implement simple text ingestion pipeline
Deploy basic vector search with OpenSearch
Create minimal query API

Phase 2: Production Features (Weeks 5-8)

Add hybrid search (vector + BM25)
Implement caching layer
Set up monitoring and alerting
Deploy to production

Phase 3: Optimization (Weeks 9-12)

Fine-tune retrieval quality
Optimize costs
Implement advanced features (reranking, multi-modal)
Comprehensive testing and evaluation

Phase 4: Scale & Enhance (Weeks 13+)

Handle increased traffic
Add new document types
Improve user experience
Continuous optimization

Conclusion

This AWS-native RAG architecture provides:

✅ Fully Serverless: No servers to manage
✅ Scalable: Handles from 10 to 10M queries/day
✅ Cost-Effective: Pay only for what you use
✅ Secure: End-to-end encryption, compliance-ready
✅ Observable: Comprehensive monitoring and tracing
✅ Maintainable: Infrastructure as Code, automated deployments

Ready to build with confidence on AWS!

Document Information

Title: AWS-Native RAG Service Architecture
Version: 1.0
Last Updated: January 30, 2026
Author: Paul - Senior AI Consultant at KrishAI Technologies
Purpose: Complete architectural documentation for building production-ready RAG systems on AWS

sourangshupal/aws_rag_infra.md

AWS-Native RAG Service Architecture

Table of Contents

Overview

Architecture Diagram

High-Level System Architecture

Detailed Ingestion Pipeline

Detailed Query Pipeline

Core Components

System Components Overview

1. Data Ingestion Layer

Purpose

AWS Services

Amazon S3 (Simple Storage Service)

Amazon EventBridge

AWS DataSync (Optional)

AWS Transfer Family (Optional)

Data Flow

2. Document Processing & Extraction

Purpose

AWS Services

AWS Lambda

Amazon Textract

Amazon Comprehend

AWS Step Functions

Processing Pipeline Components

Error Handling Strategy

3. Chunking & Text Processing

Purpose

AWS Services

AWS Lambda

Chunking Strategies

1. Fixed-Size Chunking

2. Semantic Chunking

3. Recursive Character Splitting

4. Sentence Window Chunking

Chunk Metadata

Storage

4. Embedding Generation

Purpose

AWS Services

Amazon Bedrock

Amazon SageMaker (Alternative)

AWS Batch (For Large-Scale Embedding)

Embedding Pipeline

Lambda Function: generate-embeddings

Optimization Strategies

1. Caching

2. Batch Processing

3. Async Processing

Cost Optimization

5. Vector Database

Purpose

AWS Services

Amazon OpenSearch Service (Recommended)

Amazon Aurora PostgreSQL with pgvector (Alternative)

Amazon MemoryDB for Redis (Alternative)

Indexing Strategy

Initial Bulk Indexing

Incremental Indexing

Reindexing Strategy

Performance Tuning

6. LLM Integration

Purpose

AWS Services

Amazon Bedrock

Amazon SageMaker (Alternative)

Prompt Engineering

System Prompt Template

RAG Prompt Template

Advanced Techniques

Response Generation Strategies

1. Standard Generation

2. Iterative Refinement

3. Agentic Approach (Bedrock Agents)

Cost Optimization

1. Prompt Caching

2. Model Selection

3. Context Window Management

7. API & Application Layer

Lambda Function: `generate-embeddings`