Building Production RAG Pipelines on AWS with Bedrock and OpenSearch

Introduction: Why RAG Pipeline on AWS Bedrock

Building a production RAG pipeline on AWS Bedrock has become the standard approach for enterprises that need to augment large language models with proprietary knowledge. Retrieval augmented generation solves the fundamental limitation of foundation models: they can only generate responses based on their training data, which quickly becomes stale and never includes your organization's private documents, policies, or domain-specific knowledge.

AWS Bedrock provides a fully managed foundation model service that eliminates the infrastructure complexity of hosting LLMs. Combined with OpenSearch Serverless for vector search, you get a production-grade RAG stack that scales automatically, integrates with AWS security services, and requires no ML infrastructure expertise. This is a decisive advantage over self-managed solutions that demand GPU cluster management, model serving infrastructure, and vector database operations.

In this guide, we walk through the complete architecture for building production RAG pipelines on AWS. We cover data ingestion, chunking strategies, embedding model selection, vector store configuration, retrieval optimization, and hallucination mitigation. The Terraform modules and Python code referenced here are available in our open-source repositories, including terraform-aws-rag-pipeline for infrastructure provisioning.

The patterns described are drawn from production deployments at Citadel Cloud Management, where we have built RAG systems processing millions of documents for enterprise knowledge management, customer support automation, and regulatory compliance applications. As documented in the AWS Bedrock documentation, the service supports multiple foundation models from Anthropic, Amazon, Meta, Cohere, and others, giving you flexibility to choose the best model for your use case.

Production RAG Architecture Overview

A production RAG pipeline on AWS consists of two primary workflows: an ingestion pipeline that processes documents into searchable vectors, and a query pipeline that retrieves relevant context and generates responses. Understanding the separation of these workflows is critical for building systems that scale independently and can be optimized separately.

Ingestion Pipeline

The ingestion pipeline follows this flow: source documents are collected from S3, databases, or APIs. A preprocessing step cleans and normalizes the content. A chunking strategy splits documents into semantically coherent segments. Each chunk is passed through an embedding model (Amazon Titan Embeddings or Cohere Embed via Bedrock) to produce vector representations. These vectors, along with their source text and metadata, are indexed in OpenSearch Serverless using a vector index with k-NN enabled.

For document processing, AWS Bedrock Knowledge Bases provide a managed ingestion pipeline that handles chunking, embedding, and indexing automatically. For more control, you can build a custom pipeline using Lambda, Step Functions, or ECS tasks. The terraform-aws-bedrock-platform module provisions the complete infrastructure for both approaches.

Query Pipeline

The query pipeline processes user queries through the following stages: the user's question is embedded using the same embedding model used during ingestion. The query vector is used to search OpenSearch Serverless for the most semantically similar document chunks (typically top-k where k ranges from 3 to 10). Retrieved chunks are assembled into a prompt template that provides context to the foundation model. The foundation model generates a response grounded in the retrieved context. Post-processing applies guardrails, citation extraction, and confidence scoring before returning the response to the user.

Infrastructure Components

The production architecture requires several AWS services working in concert: Amazon Bedrock for foundation model inference and embeddings, OpenSearch Serverless (vector search collection type) for vector storage and retrieval, S3 for document storage, Lambda or ECS for compute orchestration, Step Functions for pipeline orchestration, CloudWatch for monitoring and alerting, and IAM for fine-grained access control. Each component must be provisioned with production-grade configurations including encryption, VPC endpoints, and least-privilege access.

Data Ingestion and Chunking Strategies

The quality of your RAG pipeline depends heavily on how you process and chunk your source documents. Poor chunking leads to poor retrieval, which leads to poor generation. This is the single most impactful optimization you can make in a RAG system.

Chunking Approaches

Fixed-size chunking splits documents into segments of a predetermined token count (typically 256-512 tokens) with an overlap window (typically 50-100 tokens). This is the simplest approach and works reasonably well for homogeneous document types. However, it frequently splits information across chunk boundaries, losing semantic coherence.

Semantic chunking uses natural boundaries in the document structure: paragraphs, sections, headers, or topic changes. This preserves semantic coherence within each chunk but produces variable-size chunks that may require padding or truncation at the embedding stage. For structured documents like technical documentation, policies, or knowledge base articles, semantic chunking consistently outperforms fixed-size approaches.

Hierarchical chunking creates multiple representations of the same content at different granularity levels: document-level summaries, section-level chunks, and paragraph-level chunks. At query time, the system can retrieve at the appropriate level of detail. This approach is more complex to implement but excels for large document collections where queries range from high-level questions to specific detail lookups.

Metadata Enrichment

Each chunk should be stored with rich metadata that enables filtered retrieval. Essential metadata includes: document source and URL, document type and category, creation and modification dates, section headings and hierarchy path, author and department, access control tags, and any domain-specific attributes. During retrieval, metadata filters narrow the search space before vector similarity is computed, dramatically improving both relevance and performance.

Embedding Models and Vector Storage

The embedding model transforms text into dense vector representations that capture semantic meaning. Choosing the right embedding model and configuring your vector store correctly are critical decisions that affect retrieval quality throughout the life of your RAG system.

Embedding Model Selection

Amazon Titan Embeddings V2 is the default choice for AWS-native RAG pipelines. It supports up to 8,192 input tokens, produces 1,024-dimensional vectors, and is optimized for low-latency inference through Bedrock. Cohere Embed (available through Bedrock) offers multilingual support with 1,024 dimensions and strong performance on retrieval benchmarks. For specialized domains, you may need to fine-tune an embedding model, though this adds significant complexity.

A critical consideration is embedding model lock-in: once you embed your document corpus with a specific model, switching models requires re-embedding the entire corpus. Choose your embedding model carefully and plan for the re-embedding cost if you need to upgrade in the future.

OpenSearch Serverless Vector Configuration

OpenSearch Serverless with the vector search collection type provides a managed vector database that scales automatically. Key configuration decisions include the vector dimension (must match your embedding model output), the similarity algorithm (cosine similarity is standard for normalized embeddings, L2 for unnormalized), and the engine type (FAISS for high-throughput workloads, nmslib for balanced performance). The OpenSearch vector search documentation provides detailed guidance on index configuration and tuning.

Semantic Search and Retrieval Optimization

Retrieval quality directly determines generation quality. A RAG system that retrieves irrelevant documents will generate irrelevant or hallucinated responses regardless of how capable the foundation model is. Optimizing retrieval is therefore the highest-leverage activity in RAG pipeline development.

Hybrid Search

Pure vector search excels at semantic similarity but can miss exact keyword matches that are critical for technical queries (error codes, product names, configuration parameters). Hybrid search combines vector search with traditional BM25 keyword search and uses reciprocal rank fusion (RRF) to merge results. OpenSearch supports this natively through its compound query types, and Bedrock Knowledge Bases offer hybrid search as a built-in option.

Query Transformation

User queries are often ambiguous, incomplete, or poorly formulated for retrieval. Query transformation techniques improve retrieval by reformulating the query before search. HyDE (Hypothetical Document Embeddings) generates a hypothetical answer to the query and uses that answer's embedding for retrieval, which often produces better matches than the raw query. Query expansion adds related terms and synonyms. Multi-query retrieval generates multiple reformulations and takes the union of results.

Re-ranking

After initial retrieval, a re-ranking step can dramatically improve the relevance of the top-k results. Cross-encoder re-rankers (such as those available through Cohere Rerank via Bedrock) evaluate each query-document pair jointly, producing more accurate relevance scores than the bi-encoder embedding models used for initial retrieval. The trade-off is latency: re-ranking adds 100-300ms to the query pipeline, so it should be applied only to the top candidates (typically top-20 narrowed to top-5).

Python Implementation: RAG Pipeline with Bedrock and OpenSearch

The following Python implementation demonstrates a production-grade RAG pipeline using AWS Bedrock for embeddings and generation, with OpenSearch Serverless as the vector store. This code handles document embedding, vector storage, semantic retrieval, and augmented generation with citation tracking.

# Production RAG Pipeline with AWS Bedrock + OpenSearch Serverless
# Repository: https://github.com/kogunlowo123/terraform-aws-rag-pipeline

import json
import boto3
import hashlib
from typing import List, Dict, Optional
from dataclasses import dataclass, field
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth

# --- Configuration ---
BEDROCK_REGION = "us-east-1"
EMBEDDING_MODEL_ID = "amazon.titan-embed-text-v2:0"
GENERATION_MODEL_ID = "anthropic.claude-3-sonnet-20240229-v1:0"
OPENSEARCH_ENDPOINT = "your-collection.us-east-1.aoss.amazonaws.com"
INDEX_NAME = "rag-knowledge-base"
VECTOR_DIMENSION = 1024
TOP_K = 5

@dataclass
class RetrievedChunk:
    text: str
    source: str
    score: float
    metadata: Dict = field(default_factory=dict)

class ProductionRAGPipeline:
    """Production RAG pipeline with Bedrock + OpenSearch Serverless."""

    def __init__(self):
        self.bedrock = boto3.client("bedrock-runtime", region_name=BEDROCK_REGION)
        self.session = boto3.Session()
        credentials = self.session.get_credentials()
        aws_auth = AWS4Auth(
            credentials.access_key,
            credentials.secret_key,
            BEDROCK_REGION,
            "aoss",
            session_token=credentials.token,
        )
        self.opensearch = OpenSearch(
            hosts=[{"host": OPENSEARCH_ENDPOINT, "port": 443}],
            http_auth=aws_auth,
            use_ssl=True,
            verify_certs=True,
            connection_class=RequestsHttpConnection,
        )

    def generate_embedding(self, text: str) -> List[float]:
        """Generate embeddings using Amazon Titan Embeddings V2."""
        response = self.bedrock.invoke_model(
            modelId=EMBEDDING_MODEL_ID,
            body=json.dumps({
                "inputText": text,
                "dimensions": VECTOR_DIMENSION,
                "normalize": True
            }),
        )
        result = json.loads(response["body"].read())
        return result["embedding"]

    def index_document(
        self, text: str, source: str, metadata: Optional[Dict] = None
    ) -> str:
        """Embed and index a document chunk in OpenSearch."""
        embedding = self.generate_embedding(text)
        doc_id = hashlib.sha256(text.encode()).hexdigest()[:16]

        document = {
            "text": text,
            "embedding": embedding,
            "source": source,
            "metadata": metadata or {},
        }
        self.opensearch.index(
            index=INDEX_NAME, id=doc_id, body=document
        )
        return doc_id

    def retrieve(
        self, query: str, top_k: int = TOP_K, filters: Optional[Dict] = None
    ) -> List[RetrievedChunk]:
        """Retrieve relevant chunks using hybrid search."""
        query_embedding = self.generate_embedding(query)

        # Hybrid query: vector similarity + BM25 keyword match
        search_body = {
            "size": top_k,
            "_source": ["text", "source", "metadata"],
            "query": {
                "bool": {
                    "should": [
                        {
                            "knn": {
                                "embedding": {
                                    "vector": query_embedding,
                                    "k": top_k * 2,
                                }
                            }
                        },
                        {
                            "match": {
                                "text": {
                                    "query": query,
                                    "boost": 0.3,
                                }
                            }
                        },
                    ]
                }
            },
        }

        # Apply metadata filters if provided
        if filters:
            search_body["query"]["bool"]["filter"] = [
                {"term": {k: v}} for k, v in filters.items()
            ]

        response = self.opensearch.search(
            index=INDEX_NAME, body=search_body
        )

        return [
            RetrievedChunk(
                text=hit["_source"]["text"],
                source=hit["_source"]["source"],
                score=hit["_score"],
                metadata=hit["_source"].get("metadata", {}),
            )
            for hit in response["hits"]["hits"]
        ]

    def generate_response(
        self, query: str, chunks: List[RetrievedChunk]
    ) -> Dict:
        """Generate grounded response using Bedrock with citations."""
        context_blocks = []
        for i, chunk in enumerate(chunks):
            context_blocks.append(
                f"[Source {i+1}: {chunk.source}]\n{chunk.text}"
            )
        context = "\n\n---\n\n".join(context_blocks)

        prompt = f"""Based on the following retrieved context, answer the
user's question accurately. Only use information from the provided
context. If the context does not contain sufficient information to
answer the question, explicitly state that. Cite your sources using
[Source N] notation.

Context:
{context}

Question: {query}

Answer:"""

        response = self.bedrock.invoke_model(
            modelId=GENERATION_MODEL_ID,
            body=json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 2048,
                "temperature": 0.1,
                "messages": [
                    {"role": "user", "content": prompt}
                ],
            }),
        )

        result = json.loads(response["body"].read())
        answer = result["content"][0]["text"]

        return {
            "answer": answer,
            "sources": [
                {"source": c.source, "score": c.score}
                for c in chunks
            ],
            "model": GENERATION_MODEL_ID,
        }

    def query(
        self, question: str, filters: Optional[Dict] = None
    ) -> Dict:
        """End-to-end RAG query: retrieve then generate."""
        chunks = self.retrieve(question, filters=filters)
        if not chunks:
            return {
                "answer": "No relevant information found.",
                "sources": [],
            }
        return self.generate_response(question, chunks)


# --- Usage Example ---
if __name__ == "__main__":
    pipeline = ProductionRAGPipeline()

    # Index documents
    pipeline.index_document(
        text="Amazon EKS supports Kubernetes versions 1.28 and 1.29...",
        source="eks-docs/versions.md",
        metadata={"category": "kubernetes", "service": "eks"},
    )

    # Query the knowledge base
    result = pipeline.query(
        "What Kubernetes versions does EKS support?",
        filters={"metadata.service": "eks"},
    )
    print(json.dumps(result, indent=2))

This implementation demonstrates several production patterns: hybrid search combining vector and keyword retrieval, metadata-based filtering, citation tracking, low-temperature generation for factual accuracy, and structured response formatting. The Terraform infrastructure to deploy this pipeline, including OpenSearch Serverless collection, IAM roles, and VPC endpoints, is available in the terraform-aws-bedrock-agents repository.

RAG vs Fine-Tuning vs Prompt Engineering Comparison

RAG is not the only approach to customizing LLM behavior. Understanding when to use RAG versus fine-tuning versus prompt engineering is critical for making the right architectural decision.

Dimension RAG Fine-Tuning Prompt Engineering
Knowledge Source External retrieval from document store Embedded in model weights Provided in prompt context
Knowledge Freshness Real-time (update documents anytime) Static until re-trained Real-time (update prompt anytime)
Setup Complexity Medium (vector DB, embeddings, retrieval) High (training data, compute, evaluation) Low (prompt design only)
Cost Medium (embedding + retrieval + generation) High upfront, lower per-query Low (generation only)
Hallucination Risk Low (grounded in retrieved docs) Medium (can still hallucinate) High (limited by context window)
Scalability Scales with document volume Fixed after training Limited by context window
Auditability High (traceable to source docs) Low (opaque model internals) Medium (visible in prompt)
Best For Knowledge-intensive Q&A, doc search Style, tone, domain adaptation Simple tasks, prototyping
Update Frequency Minutes (re-index documents) Days to weeks (re-train) Instant (edit prompt)
Data Volume Supported Unlimited (vector DB scales) Limited by training budget Limited by context window (200K tokens)

For most enterprise use cases involving proprietary knowledge, RAG is the recommended starting point. It provides the best balance of accuracy, freshness, auditability, and cost. Fine-tuning is appropriate when you need to change the model's behavior, tone, or output format in ways that cannot be achieved through prompting alone. Prompt engineering is ideal for prototyping and simple tasks where the required context fits within the model's context window.

In practice, the most effective production systems combine all three approaches: prompt engineering for output formatting and instruction, RAG for dynamic knowledge retrieval, and fine-tuning for domain-specific language understanding when needed.

Hallucination Mitigation Strategies

Hallucination, where the model generates plausible but factually incorrect information, is the primary risk in production RAG systems. While RAG inherently reduces hallucination by grounding responses in retrieved documents, it does not eliminate it entirely. The model can still fabricate details, misinterpret context, or blend information from multiple sources incorrectly.

Retrieval-Side Mitigations

Improving retrieval quality is the most effective hallucination mitigation. Ensure your chunking preserves complete, self-contained pieces of information. Use relevance score thresholds to filter out low-quality retrievals. Implement re-ranking to ensure the most relevant documents are prioritized. When no sufficiently relevant documents are found (all scores below threshold), return "I don't have enough information to answer that" rather than allowing the model to generate from its parametric knowledge.

Generation-Side Mitigations

On the generation side, use low temperature settings (0.0-0.2) for factual queries. Include explicit instructions in the prompt to only use provided context and to acknowledge uncertainty. Implement citation requirements so the model must reference specific sources. Use structured output formats that make it easier to verify claims against sources programmatically.

Post-Processing Verification

Automated verification checks each claim in the generated response against the source documents. This can be implemented as a separate LLM call that acts as a fact-checker, comparing the response against the retrieved chunks and flagging unsupported claims. While this adds latency and cost, it provides a critical safety net for high-stakes applications like medical, legal, or financial advisory systems.

Best Practices for Production RAG Pipelines

Drawing from production RAG deployments processing millions of queries, these best practices ensure reliability, accuracy, and scalability:

  1. Invest heavily in chunking quality. Spend 40% of your RAG development time on chunking strategy. Test multiple approaches (fixed-size, semantic, hierarchical) against your actual query patterns. Chunking quality is the single biggest determinant of retrieval quality.
  2. Implement hybrid search from day one. Pure vector search misses exact keyword matches. Pure keyword search misses semantic understanding. Hybrid search with reciprocal rank fusion consistently outperforms either approach alone.
  3. Set retrieval score thresholds and gracefully handle low-confidence results. A RAG system that says "I don't know" when it truly does not have sufficient information is far more valuable than one that confidently produces incorrect answers. Set cosine similarity thresholds (typically 0.7-0.8) and return explicit uncertainty signals below the threshold.
  4. Use metadata filtering to narrow search scope. Before computing vector similarity across your entire corpus, filter by metadata (document type, date range, category, access permissions). This improves both relevance and performance.
  5. Monitor retrieval and generation quality continuously. Track metrics including retrieval precision@k, answer faithfulness, latency percentiles (p50, p95, p99), and user feedback signals. Set up CloudWatch dashboards and alerts for quality degradation.
  6. Version your embedding model and re-embed on upgrades. When you change embedding models, all existing vectors become incompatible. Plan for full re-embedding as part of model upgrades. Maintain the ability to run parallel indexes during migration.
  7. Implement document lifecycle management. Stale documents in your vector store produce stale answers. Build automated pipelines that detect document updates, re-chunk, re-embed, and replace old vectors. Track document freshness as a quality metric.
  8. Use guardrails for content safety and compliance. AWS Bedrock Guardrails allow you to filter harmful content, enforce topic boundaries, and redact PII in both inputs and outputs. Configure guardrails appropriate to your use case and regulatory requirements.
  9. Design for multi-turn conversation. Production RAG systems rarely handle single queries in isolation. Build conversation history management that provides relevant prior context without overwhelming the retrieval step. Use the conversation history to refine retrieval queries for follow-up questions.
  10. Load test with realistic query patterns. RAG pipeline performance depends on query complexity, corpus size, and concurrent users. Load test with representative query distributions and establish performance baselines. OpenSearch Serverless auto-scales, but Bedrock has account-level throughput limits that require capacity planning.

Frequently Asked Questions

What is a RAG pipeline and why use it on AWS?

A RAG (Retrieval Augmented Generation) pipeline combines information retrieval with language model generation to produce responses grounded in specific source documents. On AWS, Amazon Bedrock provides managed access to foundation models for both embeddings and generation, while OpenSearch Serverless offers scalable vector search. This fully managed stack eliminates the need to operate GPU clusters or vector databases, reducing operational overhead significantly compared to self-managed alternatives.

How does RAG reduce hallucinations in LLM responses?

RAG reduces hallucinations by grounding the LLM's responses in retrieved source documents rather than relying solely on the model's parametric knowledge. By providing relevant context from a verified knowledge base and instructing the model to only use provided information, the model generates answers based on factual data. Additional mitigations include relevance score thresholds, citation requirements, low temperature generation, and post-processing verification checks.

What embedding models does AWS Bedrock support?

AWS Bedrock supports Amazon Titan Embeddings V2 (1,024 dimensions, 8,192 token input), Cohere Embed models (multilingual support, 1,024 dimensions), and other embedding models as they become available. Amazon Titan Embeddings V2 is the most commonly used for AWS-native RAG pipelines due to its native integration, competitive performance on retrieval benchmarks, and support for dimension reduction to optimize storage costs.

Should I use OpenSearch Serverless or provisioned for RAG?

OpenSearch Serverless is recommended for most RAG workloads because it auto-scales compute and storage independently, requires no capacity planning, and integrates natively with Bedrock Knowledge Bases. Use provisioned OpenSearch clusters only when you need fine-grained tuning of shard allocation, require specific plugin versions, or can achieve significant cost savings at very large scale with reserved instances. Serverless pricing is based on OCU (OpenSearch Compute Units) hours consumed.

How do I evaluate RAG pipeline quality in production?

Evaluate RAG quality using four key dimensions: retrieval precision and recall (are the right documents being retrieved?), answer faithfulness (does the response accurately reflect the source documents without adding unsupported information?), answer relevance (does the response address the user's question?), and latency (are responses generated within acceptable time bounds?). The RAGAS framework provides automated metrics for these dimensions. Additionally, collect user feedback through thumbs-up/down signals and periodically conduct human evaluation on random samples.

RAG AWS Bedrock OpenSearch Vector Search LLM Embeddings Semantic Search Python Terraform GenAI

Need Enterprise-Grade RAG Pipelines?

Kehinde Ogunlowo and the team at Citadel Cloud Management build production RAG systems on AWS for enterprises requiring knowledge management, compliance automation, and intelligent document processing.

Contact Kehinde at citadelcloudmanagement.com

Kehinde Ogunlowo

Principal Multi-Cloud DevSecOps Architect | Citadel Cloud Management

Kehinde designs and deploys production AI/ML platforms on AWS, Azure, and GCP. With extensive experience in RAG architectures, vector databases, and LLM orchestration, he helps enterprises build intelligent systems that scale. He contributes open-source Terraform modules and architectural patterns through his GitHub.