AI Engineer's Guide to Foundational ML Concepts

Why Fundamentals Still Matter

AI engineering has shifted from “build models from scratch” to “integrate and orchestrate models.” But understanding the fundamentals makes you dramatically more effective. When your fine-tune diverges, when your embeddings don’t cluster, when your RAG pipeline returns nonsense—the engineer who understands the theory diagnoses the problem in minutes, not days.

Core ML Concepts

The ML Problem Types

Type	Task	Output	Example
Supervised	Learn from labeled data	Prediction	Spam detection, price prediction
Unsupervised	Find structure in unlabeled data	Clusters/patterns	Customer segmentation, anomaly detection
Self-supervised	Create labels from data itself	Representations	Language models, contrastive learning
Reinforcement Learning	Learn from reward signals	Policy/actions	Game playing, RLHF

The Bias-Variance Tradeoff

Total Error = Bias² + Variance + Irreducible Noise

High Bias (Underfitting):
  - Model too simple
  - Poor training AND test performance
  - Fix: more features, more complex model, less regularization

High Variance (Overfitting):
  - Model too complex
  - Great training, poor test performance
  - Fix: more data, regularization, simpler model, dropout, early stopping

Loss Functions

# Classification
def binary_cross_entropy(y_true, y_pred):
    """Standard for binary classification."""
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

def categorical_cross_entropy(y_true, y_pred):
    """Standard for multi-class classification."""
    return -np.sum(y_true * np.log(y_pred)) / len(y_true)

# Regression
def mse_loss(y_true, y_pred):
    """Mean Squared Error - penalizes large errors more."""
    return np.mean((y_true - y_pred) ** 2)

def huber_loss(y_true, y_pred, delta=1.0):
    """Robust to outliers - combines MSE and MAE."""
    error = y_true - y_pred
    is_small = np.abs(error) <= delta
    squared_loss = 0.5 * error ** 2
    linear_loss = delta * (np.abs(error) - 0.5 * delta)
    return np.mean(np.where(is_small, squared_loss, linear_loss))

# Contrastive Learning
def contrastive_loss(anchor, positive, negatives, temperature=0.07):
    """InfoNCE loss - used in CLIP, SimCLR, embedding training."""
    pos_sim = np.dot(anchor, positive) / temperature
    neg_sims = np.dot(negatives, anchor) / temperature
    return -pos_sim + np.log(np.exp(pos_sim) + np.sum(np.exp(neg_sims)))

Optimization

# SGD: simple, needs careful tuning
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Adam: adaptive learning rates, good default
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0.01)

# AdamW: Adam with proper weight decay (use this for Transformers)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# Learning rate scheduling
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
# or
scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer, max_lr=0.01, total_steps=1000
)

Neural Network Architectures

The Transformer (You Must Know This)

The architecture behind every modern LLM, and increasingly used in vision, audio, and multimodal models.

import torch
import torch.nn as nn
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attention_weights = torch.softmax(scores, dim=-1)
        return torch.matmul(attention_weights, V)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        return self.W_o(attn_output)

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model),
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Pre-norm architecture (modern standard)
        attn_out = self.attention(self.norm1(x), self.norm1(x), self.norm1(x), mask)
        x = x + self.dropout(attn_out)
        ffn_out = self.ffn(self.norm2(x))
        x = x + self.dropout(ffn_out)
        return x

Key concepts to understand:

Self-attention: Each token attends to all other tokens. O(n²) in sequence length.
Positional encoding: Transformers have no inherent sense of order; positions must be encoded.
KV-cache: During inference, cache key-value pairs to avoid recomputation. Critical for serving.
Flash Attention: Memory-efficient attention that’s 2-4x faster. Use it always.

Architecture Variants

Architecture	Key Idea	Use Case
Encoder-only (BERT)	Bidirectional context	Classification, NER, embeddings
Decoder-only (GPT)	Autoregressive generation	Text generation, LLMs
Encoder-decoder (T5)	Sequence-to-sequence	Translation, summarization
Vision Transformer (ViT)	Patch embeddings + Transformer	Image classification
Diffusion Transformer (DiT)	Transformer backbone for diffusion	Image generation
Mamba / SSM	State space models (linear attention)	Long sequences, efficient inference

Fine-Tuning

Full Fine-Tuning vs. Parameter-Efficient Methods

# Full fine-tuning: update all parameters
# Expensive but most flexible
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
# All 8B parameters are trainable

# LoRA: Low-Rank Adaptation (most popular PEFT method)
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,              # rank (lower = fewer parameters)
    lora_alpha=32,     # scaling factor
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 6.5M (0.08% of 8B)

# QLoRA: Quantized LoRA (fits on consumer GPUs)
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=quantization_config,
)
model = get_peft_model(model, lora_config)
# Now fits on a single 24GB GPU

Training Data Preparation

# Instruction fine-tuning data format
training_examples = [
    {
        "instruction": "Summarize the following legal document",
        "input": "WHEREAS, the parties have agreed to the following terms...",
        "output": "This agreement establishes a partnership between..."
    },
    {
        "instruction": "Extract key entities from this medical report",
        "input": "Patient presented with acute chest pain...",
        "output": '{"conditions": ["acute chest pain"], "tests": ["ECG", "troponin"]}'
    },
]

# Chat format (preferred for modern models)
chat_examples = [
    {
        "messages": [
            {"role": "system", "content": "You are a legal document assistant."},
            {"role": "user", "content": "Summarize this contract: ..."},
            {"role": "assistant", "content": "This contract establishes..."},
        ]
    }
]

RLHF and DPO

RLHF (Reinforcement Learning from Human Feedback):

Step 1: Supervised Fine-Tuning (SFT) on demonstration data
Step 2: Train a reward model on human preference data
Step 3: Optimize the policy (LLM) using PPO against the reward model

DPO (Direct Preference Optimization) — simpler alternative:

# DPO directly optimizes on preference pairs without a separate reward model
preference_data = [
    {
        "prompt": "Explain quantum computing",
        "chosen": "Quantum computing uses quantum bits (qubits)...",   # preferred
        "rejected": "Quantum computing is really complicated...",       # less preferred
    }
]

# DPO loss (simplified)
def dpo_loss(policy_chosen_logps, policy_rejected_logps,
             reference_chosen_logps, reference_rejected_logps, beta=0.1):
    chosen_rewards = beta * (policy_chosen_logps - reference_chosen_logps)
    rejected_rewards = beta * (policy_rejected_logps - reference_rejected_logps)
    return -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards)).mean()

Embeddings and Representation Learning

What Makes a Good Embedding?

# Good embeddings have these properties:
# 1. Similar items are close in vector space
# 2. Dissimilar items are far apart
# 3. Meaningful directions exist (king - man + woman ≈ queen)
# 4. Clusters correspond to semantic categories

# Training embeddings with contrastive learning
class ContrastiveEmbeddingModel(nn.Module):
    def __init__(self, encoder):
        super().__init__()
        self.encoder = encoder
        self.projection = nn.Linear(encoder.output_dim, 256)

    def forward(self, anchor, positive, negatives):
        anchor_emb = self.projection(self.encoder(anchor))
        positive_emb = self.projection(self.encoder(positive))
        negative_embs = [self.projection(self.encoder(n)) for n in negatives]

        # InfoNCE loss
        pos_similarity = F.cosine_similarity(anchor_emb, positive_emb)
        neg_similarities = torch.stack([
            F.cosine_similarity(anchor_emb, neg) for neg in negative_embs
        ])

        temperature = 0.07
        logits = torch.cat([pos_similarity.unsqueeze(0), neg_similarities]) / temperature
        labels = torch.zeros(1, dtype=torch.long)  # positive is at index 0
        return F.cross_entropy(logits, labels)

Hard Negative Mining

The quality of negatives in contrastive learning dramatically affects embedding quality:

def mine_hard_negatives(query_embedding, corpus_embeddings, positive_ids, k=10):
    """Find examples that are similar to the query but not positive examples."""
    similarities = cosine_similarity([query_embedding], corpus_embeddings)[0]

    # Rank by similarity, exclude positives
    candidates = []
    for idx in similarities.argsort()[::-1]:
        if idx not in positive_ids:
            candidates.append(idx)
        if len(candidates) >= k:
            break

    return candidates

Practical Model Selection Guide

Task	First Try	Scale Up
Text classification	Fine-tuned BERT/RoBERTa	LLM with few-shot
Named Entity Recognition	SpaCy + fine-tuned Transformer	LLM extraction
Semantic similarity	Sentence-BERT	Fine-tuned embedding model
Text generation	API (Claude/GPT)	Fine-tuned open-source LLM
Image classification	CLIP zero-shot	Fine-tuned ViT
Object detection	YOLO v8/v11	Custom-trained model
Speech-to-text	Whisper	Fine-tuned Whisper
Tabular data	XGBoost/LightGBM	Neural network ensemble
Time series	Prophet/XGBoost	Temporal Fusion Transformer
Anomaly detection	Isolation Forest	Autoencoder

Key Metrics to Know

Classification

from sklearn.metrics import classification_report, roc_auc_score

# Precision: Of predicted positives, how many are correct?
# Recall: Of actual positives, how many did we find?
# F1: Harmonic mean of precision and recall
# AUC-ROC: Ranking quality across all thresholds

print(classification_report(y_true, y_pred))
print(f"AUC-ROC: {roc_auc_score(y_true, y_prob)}")

# When to optimize for what:
# - High precision: When false positives are costly (spam → important email)
# - High recall: When false negatives are costly (cancer screening)
# - AUC: When you need to choose a threshold later

Generative AI

Metric	What It Measures	Use When
Perplexity	Language model quality	Comparing LLMs
BLEU	N-gram overlap with reference	Translation (dated)
ROUGE	Recall of reference n-grams	Summarization
BERTScore	Semantic similarity to reference	General generation
Pass@k	Code correctness (k attempts)	Code generation
LLM-as-Judge	Human-like quality assessment	Open-ended generation

Takeaways

Understand Transformers deeply—attention, positional encoding, KV-cache. This is the foundation of modern AI.
Start with existing models, fine-tune if needed, train from scratch only as a last resort
LoRA/QLoRA make fine-tuning accessible on consumer hardware—learn to use them
Embedding quality determines the ceiling of your retrieval/search system
The best model for your task often isn’t the largest—XGBoost still wins on tabular data
DPO has largely replaced RLHF for preference optimization—it’s simpler and works well
Hard negative mining is the single most impactful technique for improving embedding quality