AI Engineer's Guide to Search and Information Retrieval

Why Search Is the AI Engineer’s Most Underrated Skill

Every AI engineer eventually builds a search system. Whether it’s RAG (Retrieval-Augmented Generation), a recommendation engine, or a knowledge base, you need to understand how search works at a fundamental level. The difference between a mediocre AI product and a great one often comes down to retrieval quality.

Search is not a solved problem. It’s a spectrum of techniques, each with different trade-offs, and modern AI systems combine multiple approaches.

The Search Stack: From Keywords to Semantics

Level 1: Lexical Search (BM25)

BM25 (Best Matching 25) is the workhorse of traditional search. It’s a bag-of-words model that scores documents based on term frequency and inverse document frequency.

BM25(D, Q) = Σ IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D| / avgdl))

Where:

f(qi, D) = frequency of term qi in document D
|D| = document length
avgdl = average document length
k1 = term frequency saturation (typically 1.2-2.0)
b = length normalization (typically 0.75)

Strengths: Fast, interpretable, excellent for exact matches, no training needed Weaknesses: No understanding of synonyms, context, or meaning

# Using rank_bm25
from rank_bm25 import BM25Okapi

corpus = [
    "machine learning algorithms for classification",
    "deep neural networks for image recognition",
    "natural language processing with transformers",
]

tokenized_corpus = [doc.split() for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)

query = "neural network image classification"
scores = bm25.get_scores(query.split())

Level 2: Semantic Search (Dense Retrieval)

Encode queries and documents into the same embedding space. Similar meanings → close vectors.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('BAAI/bge-large-en-v1.5')

docs = ["The cat sat on the mat", "A feline rested on the rug"]
query = "Where is the kitty?"

doc_embeddings = model.encode(docs)
query_embedding = model.encode(query)

# Cosine similarity
similarities = np.dot(doc_embeddings, query_embedding) / (
    np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
)
# Both documents score high despite no word overlap with "kitty"

Strengths: Understands meaning, handles synonyms, multilingual capable Weaknesses: Misses exact terms, computationally expensive, needs good embeddings

Level 3: Hybrid Search

The production standard. Combine lexical and semantic search, then fuse the results.

Reciprocal Rank Fusion (RRF):

def reciprocal_rank_fusion(ranked_lists, k=60):
    """
    Combine multiple ranked lists using RRF.
    k=60 is the standard constant from the original paper.
    """
    scores = {}
    for ranked_list in ranked_lists:
        for rank, doc_id in enumerate(ranked_list):
            if doc_id not in scores:
                scores[doc_id] = 0
            scores[doc_id] += 1.0 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Convex Combination (Weighted):

def weighted_hybrid(bm25_scores, vector_scores, alpha=0.7):
    """
    alpha controls the weight: 1.0 = pure vector, 0.0 = pure BM25
    """
    # Normalize scores to [0, 1]
    bm25_norm = normalize(bm25_scores)
    vector_norm = normalize(vector_scores)

    combined = {}
    for doc_id in set(bm25_norm) | set(vector_norm):
        bm25_s = bm25_norm.get(doc_id, 0)
        vector_s = vector_norm.get(doc_id, 0)
        combined[doc_id] = alpha * vector_s + (1 - alpha) * bm25_s
    return combined

Level 4: Learned Sparse Retrieval (SPLADE)

A middle ground: neural networks that produce sparse representations, combining the interpretability of BM25 with the learning capacity of neural models.

Query: "best programming language"
SPLADE expansion: {
    "best": 2.1, "programming": 3.4, "language": 2.8,
    "code": 1.5, "software": 1.2, "developer": 0.8,  # expanded terms
    "python": 0.6, "java": 0.5  # semantic expansion
}

SPLADE models learn which terms to expand and how to weight them. This means you get BM25-like efficiency with semantic understanding.

Building a RAG Pipeline

RAG (Retrieval-Augmented Generation) is the most common search application for AI engineers. Here’s the production architecture:

Ingestion Pipeline

Documents → Chunking → Embedding → Vector DB
                ↓
         Metadata Extraction → Metadata Store
                ↓
         Keyword Indexing → Search Engine (Elasticsearch/BM25)

Query Pipeline

User Query
    ↓
Query Understanding (rewrite, expand, classify)
    ↓
Parallel Retrieval:
    ├── Vector Search (semantic)
    ├── Keyword Search (BM25)
    └── Metadata Filters
    ↓
Fusion (RRF or weighted)
    ↓
Reranking (cross-encoder)
    ↓
Context Assembly
    ↓
LLM Generation (with citations)

Query Rewriting

Raw user queries are often poor search queries. Rewrite them:

REWRITE_PROMPT = """Given the user's question and conversation history,
rewrite the question to be a standalone search query.

Conversation: {history}
Question: {question}

Rewritten search query:"""

# Example:
# History: "Tell me about Python web frameworks"
# Question: "What about the async ones?"
# Rewritten: "Python async web frameworks like FastAPI and Sanic"

Multi-Query Retrieval

Generate multiple search queries from one question, retrieve for each, then merge:

def multi_query_retrieve(question, retriever, llm):
    # Generate 3 different query perspectives
    queries = llm.generate_queries(question, n=3)
    # Example output:
    # ["vector database performance benchmarks",
    #  "fastest vector search engine comparison",
    #  "vector DB latency throughput benchmarks 2025"]

    all_docs = set()
    for query in queries:
        docs = retriever.search(query, top_k=5)
        all_docs.update(docs)

    return list(all_docs)

Reranking: The Secret Weapon

First-stage retrieval (BM25 + vector) is fast but approximate. A cross-encoder reranker scores each (query, document) pair more carefully:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('BAAI/bge-reranker-v2-m3')

query = "How do vector databases handle scaling?"
passages = [...]  # Top 20 from first-stage retrieval

# Cross-encoder sees both query and passage together
pairs = [(query, passage) for passage in passages]
scores = reranker.predict(pairs)

# Rerank by cross-encoder score
reranked = sorted(zip(passages, scores), key=lambda x: x[1], reverse=True)
top_5 = reranked[:5]

Why this matters: cross-encoders are 10-100x more accurate than bi-encoders (embedding similarity) but too slow to run on the full corpus. Use them as a second-stage filter on the top 20-50 results.

Elasticsearch for AI Engineers

Elasticsearch remains the backbone of many production search systems. Key concepts:

Analyzers and Tokenizers

{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "snowball"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": { "type": "text", "analyzer": "custom_analyzer" },
      "embedding": { "type": "dense_vector", "dims": 1536, "index": true, "similarity": "cosine" },
      "metadata": { "type": "keyword" }
    }
  }
}

Elasticsearch Hybrid Search (kNN + BM25)

{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "content": {
              "query": "vector database performance",
              "boost": 0.3
            }
          }
        }
      ]
    }
  },
  "knn": {
    "field": "embedding",
    "query_vector": [0.1, 0.2, ...],
    "k": 10,
    "num_candidates": 100,
    "boost": 0.7
  }
}

Evaluation: Measuring Search Quality

You can’t improve what you can’t measure. Key metrics:

Retrieval Metrics

Metric	What It Measures	Formula
Recall@K	Coverage of relevant docs in top K	relevant_in_top_k / total_relevant
Precision@K	Accuracy of top K results	relevant_in_top_k / K
MRR	Rank of first relevant result	1 / rank_of_first_relevant
NDCG@K	Quality considering position	Uses graded relevance with log discount
MAP	Average precision across queries	Mean of AP across query set

Building an Evaluation Set

# Minimum viable evaluation set
eval_set = [
    {
        "query": "How to fine-tune BERT?",
        "relevant_docs": ["doc_123", "doc_456", "doc_789"],
        "irrelevant_docs": ["doc_111", "doc_222"],  # hard negatives
    },
    # ... at least 50-100 queries for meaningful evaluation
]

def evaluate_retriever(retriever, eval_set, k=10):
    recalls, precisions, mrrs = [], [], []

    for item in eval_set:
        results = retriever.search(item["query"], top_k=k)
        result_ids = [r.id for r in results]

        relevant = set(item["relevant_docs"])
        retrieved_relevant = relevant & set(result_ids)

        recalls.append(len(retrieved_relevant) / len(relevant))
        precisions.append(len(retrieved_relevant) / k)

        for rank, doc_id in enumerate(result_ids, 1):
            if doc_id in relevant:
                mrrs.append(1.0 / rank)
                break
        else:
            mrrs.append(0.0)

    return {
        "recall@k": np.mean(recalls),
        "precision@k": np.mean(precisions),
        "mrr": np.mean(mrrs),
    }

End-to-End RAG Evaluation

Retrieval metrics alone don’t tell the full story. Also measure:

Answer Correctness: Does the LLM’s answer match the ground truth?
Faithfulness: Is the answer grounded in the retrieved context? (no hallucination)
Context Relevance: Are the retrieved passages actually relevant?

Tools: RAGAS, DeepEval, LangSmith, Braintrust

Performance Optimization Checklist

Index tuning: Adjust HNSW ef_construction and M parameters
Quantization: Use scalar or binary quantization to reduce memory
Caching: Cache frequent query embeddings and results
Batch embedding: Embed in batches, not one-by-one
Pre-filtering: Apply metadata filters before vector search, not after
Connection pooling: Reuse database connections
Async retrieval: Run BM25 and vector search in parallel

Takeaways

Always use hybrid search in production—pure vector or pure BM25 leaves performance on the table
Reranking is the highest-leverage improvement you can make to any search system
Query rewriting is underrated—spend time improving query understanding
Build evaluation sets early—you can’t optimize what you can’t measure
Chunking strategy matters more than which vector DB you pick
Start simple (BM25 + embeddings + RRF), measure, then add complexity where metrics show gaps