The Comprehensive AI & ML Glossary for 2026
A definitive guide to the essential abbreviations, architectures, and optimization terms in modern artificial intelligence and machine learning.
Navigating the Alphabet Soup of AI
If you've spent any time reading recent machine learning papers or participating in discussions about Large Language Models (LLMs), you've likely encountered a daunting wall of acronyms. From optimization algorithms like AdamW to alignment techniques like DPO, the terminology moves as fast as the field itself.
This post serves as a comprehensive, curated glossary of the key terms, architectures, and abbreviations that define the current AI landscape. Whether you are an engineer scaling models in production or a researcher keeping up with the latest state-of-the-art, this guide maps out what these terms mean and why they matter.
1. General AI & ML Concepts
- AI (Artificial Intelligence): The broad discipline of creating systems capable of performing tasks that typically require human intelligence, such as reasoning, planning, and perception.
- DL (Deep Learning): A powerful subfield of machine learning that utilizes deep, multi-layered neural networks to learn hierarchical representations of data.
- LLM (Large Language Model): A neural network—almost exclusively based on the Transformer architecture—with billions of parameters. They are pre-trained on vast text corpora to predict the next token, forming the foundation of modern generative AI.
2. Mathematical Foundations & Optimization
Optimization is the engine of model training. These terms define how we update model weights to minimize errors.
- SGD (Stochastic Gradient Descent): The classic optimization algorithm that updates parameters using the gradient computed from a randomly sampled mini-batch of data.
- Adam (Adaptive Moment Estimation): A widely-used optimizer that computes adaptive learning rates for each parameter by combining momentum and RMSprop-like scaling.
- AdamW (Adam with Weight Decay): A crucial variant of Adam that decouples weight decay from the gradient update. It has become the standard optimizer for training Transformers.
- K-FAC (Kronecker-Factored Approximate Curvature): A second-order optimization method that approximates the Fisher information matrix using Kronecker products, enabling highly efficient natural gradient steps.
- SVD (Singular Value Decomposition): A fundamental matrix factorization technique ($A = U\Sigma V^T$). In modern ML, it's widely used for dimensionality reduction and forms the theoretical basis for low-rank adaptations like LoRA.
- KL Divergence (Kullback-Leibler Divergence): A statistical measure of how one probability distribution differs from a reference distribution. It is frequently used as a regularization penalty in alignment methods (like RLHF and DPO) to prevent the fine-tuned model from drifting too far from the base model.
- NTK (Neural Tangent Kernel): A theoretical framework describing the evolution of infinitely wide neural networks during training. It has practical applications today in techniques like NTK-aware RoPE scaling for extending context windows.
- Big O Notation (e.g., $\mathcal{O}(n^2)$): The asymptotic upper-bound for an algorithm's time or memory complexity. For instance, standard self-attention has an $\mathcal{O}(n^2)$ complexity with respect to sequence length $n$, which drives the need for linear-time alternatives.
3. Activation Functions & Normalization
These micro-architectural choices dictate how signals propagate through a network, critically impacting training stability.
- ReLU (Rectified Linear Unit): The simple $max(0, x)$ function. While historically dominant, it suffers from the "dying ReLU" problem where negative inputs map to zero gradients.
- GELU (Gaussian Error Linear Unit): A smooth approximation of ReLU ($x \cdot \Phi(x)$) that has become the default activation in models like BERT and GPT.
- SwiGLU (Swish-Gated Linear Unit): A feed-forward variant that combines a Swish activation with a gating mechanism. Used in LLaMA and PaLM, it empirically improves training efficiency and downstream performance.
- LayerNorm (Layer Normalization): Normalizes activations across the feature dimensions for each token independently of the batch. It is the backbone of Transformer stability.
- RMSNorm (Root Mean Square Layer Normalization): A streamlined version of LayerNorm that omits mean-centering. It is computationally faster and widely adopted in modern architectures like the LLaMA family.
- Pre-LN vs. Post-LN: Refers to where normalization occurs. Pre-LN (before the attention/FFN layers) is standard today as it provides much better training stability for deep networks compared to the original Transformer's Post-LN (after the residual addition).
- BatchNorm (Batch Normalization): Normalizes across the mini-batch dimension. While ubiquitous in computer vision, it is rarely used in sequence models due to dependencies on batch size and sequence length.
4. Attention Mechanisms & Memory
Attention allows models to focus on relevant parts of the input, but it is notoriously memory-hungry.
- MHA (Multi-Head Attention): The original formulation where multiple attention "heads" run in parallel, allowing the model to attend to different representation subspaces simultaneously.
- MQA (Multi-Query Attention): An optimization that shares a single Key-Value (KV) head across all Query heads. This drastically reduces the memory footprint of the KV cache during inference.
- GQA (Grouped-Query Attention): A highly successful compromise between MHA and MQA used in LLaMA-2 and 3. Queries are divided into groups, and each group shares one KV head, balancing model quality with inference efficiency.
- FlashAttention: A revolutionary, hardware-aware exact attention algorithm. By tiling computation to minimize slow memory reads/writes to GPU HBM, it significantly speeds up training and enables massive context windows.
- KV Cache (Key-Value Cache): During autoregressive generation, previously computed Keys and Values are cached in memory to avoid redundant computations for past tokens. Managing this cache is the primary bottleneck in LLM serving.
5. Positional Encodings & Context Extension
Transformers don't natively understand sequence order; positional encodings provide this crucial inductive bias.
- RoPE (Rotary Position Embedding): Encodes absolute position by mathematically rotating query and key vectors in the complex plane. It elegantly captures relative positional distances and is the standard for models like LLaMA and Mistral.
- ALiBi (Attention with Linear Biases): Adds a static, non-learned linear penalty to attention scores based on the distance between tokens. It enables models to extrapolate to sequences longer than those seen during training.
- YaRN (Yet another RoPE extensioN): An advanced method for extending the context window of RoPE-based models post-training. It uses NTK-aware interpolation and a ramp function to stretch the context limit efficiently.
6. Recurrent & State-Space Models
While Transformers dominate, alternative architectures are emerging to solve the $\mathcal{O}(n^2)$ attention bottleneck.
- LSTM / GRU: Classic gated recurrent neural networks. They mitigate vanishing gradients and capture long-term dependencies but are difficult to parallelize and slow to train at massive scale.
- S4 (Structured State Space for Sequences): A state-space model utilizing a structured transition matrix to process incredibly long sequences efficiently.
- Mamba: A breakthrough selective state-space model. By making the state transitions input-dependent ("selective scanning"), Mamba achieves Transformer-level modeling quality while maintaining linear $\mathcal{O}(n)$ time and memory scaling.
7. Architectures & Model Components
- FFN (Feed-Forward Network): The position-wise, two-layer Multi-Layer Perceptron (MLP) found inside every Transformer block, responsible for the bulk of the model's parameters and factual recall.
- MoE (Mixture of Experts): An architecture that replaces dense FFNs with multiple "expert" sub-networks. A learned routing mechanism sparsely activates only a few experts (e.g., 2 out of 8) per token, drastically increasing parameter count without proportionally increasing inference compute.
- BERT / T5: Pioneering models. BERT is an encoder-only model trained via masked language modeling. T5 is an encoder-decoder framework that formulates all NLP tasks as text-to-text generation.
- CLIP (Contrastive Language-Image Pre-training): A landmark dual-encoder model by OpenAI that maps images and text into a shared latent space, enabling powerful zero-shot image classification.
- LVLM / LLaVA (Large Vision-Language Model): Multimodal models that understand both text and images. LLaVA specifically connects a pre-trained vision encoder to an LLM via a simple projection layer, yielding impressive visual chat capabilities.
8. Scaling Laws & Data
- Kaplan Scaling Laws: Empirical findings showing power-law relationships between a model's cross-entropy loss and its parameter count, compute budget, and dataset size.
- Chinchilla Scaling Law: A crucial refinement by DeepMind demonstrating that earlier models (like GPT-3) were significantly undertrained. It established that for compute-optimal training, model size and training data tokens should scale proportionally (roughly a 1:20 ratio of parameters to tokens).
- BPE (Byte Pair Encoding) & Unigram LM: Standard subword tokenization algorithms that compress text into integer IDs. BPE merges frequent pairs, while Unigram prunes a large vocabulary based on probabilities.
- C4 & FineWeb: Massive, curated text datasets. C4 is derived from Common Crawl, while FineWeb represents the modern standard for ultra-high-quality, filtered web-scale training data.
9. Fine-Tuning, Alignment, & Reasoning
Base models only predict the next word. Alignment techniques shape them into helpful, safe assistants.
- SFT (Supervised Fine-Tuning): The first step of alignment, where a base model is trained on high-quality instruction-response pairs to learn formatting and conversational style.
- PEFT (Parameter-Efficient Fine-Tuning) & LoRA (Low-Rank Adaptation): Techniques to adapt massive models with limited compute. LoRA freezes the base weights and injects small, trainable low-rank matrices into the network, cutting memory usage drastically.
- QLoRA & NF4: QLoRA pushes efficiency further by quantizing the base model to 4-bits using the information-theoretically optimal NF4 (Normal Float 4-bit) format, allowing massive models to be fine-tuned on consumer GPUs.
- RLHF (Reinforcement Learning from Human Feedback): The classic three-stage alignment pipeline involving SFT, training a Reward Model on human preferences, and optimizing the policy via RL.
- PPO (Proximal Policy Optimization): The standard policy gradient RL algorithm used in RLHF. It heavily clips weight updates to prevent the model from disastrously forgetting its pre-training.
- DPO (Direct Preference Optimization): A highly popular offline alignment method that bypasses the separate Reward Model entirely, directly optimizing the LLM to maximize the likelihood of preferred responses over dispreferred ones.
- GRPO / ORPO / SimPO / KTO: Modern alignment variants. GRPO (used in DeepSeek) eliminates the need for a value network. ORPO combines SFT and preference alignment. SimPO relies on average log-probabilities, and KTO uses a binary good/bad signal based on human utility functions.
- RLAIF (RL from AI Feedback): Scaling alignment by replacing expensive human annotators with feedback generated by a superior LLM.
- CoT (Chain-of-Thought): A prompting strategy that forces the model to generate step-by-step intermediate reasoning before outputting a final answer, vastly improving performance on logic and math tasks.
- PRM (Process Reward Model) & MCTS (Monte Carlo Tree Search): Advanced reasoning techniques. A PRM scores each step of a CoT reasoning trace rather than just the final outcome. When combined with search algorithms like MCTS, models can explore multiple reasoning paths and self-correct, simulating "System 2" thinking.
10. Inference, Quantization, & Serving
Deploying LLMs requires mitigating massive memory and compute bandwidth bottlenecks.
- Quantization (INT8 / INT4 / GGUF): Converting floating-point model weights into lower-precision integers. Formats like GGUF (used heavily in
llama.cpp) package these quantized weights for highly efficient CPU/GPU inference. - GPTQ & AWQ: Advanced post-training quantization methods. GPTQ uses second-order Hessian information, while AWQ observes activation distributions to preserve salient weight channels, maintaining much higher accuracy than naive quantization.
- vLLM & PagedAttention: vLLM is the industry-standard serving engine. It utilizes PagedAttention, a technique inspired by virtual memory paging, to dynamically allocate KV cache blocks, eliminating fragmentation and drastically increasing serving throughput.
- Prefill vs. Decode: The two phases of inference. Prefill processes the entire input prompt in parallel to compute the initial KV cache (compute-bound). Decode generates tokens one-by-one autoregressively (memory bandwidth-bound).
11. Evaluation Benchmarks
How do we measure "intelligence"? These benchmarks define the leaderboards.
- MMLU (Massive Multitask Language Understanding): The standard general knowledge benchmark covering 57 subjects via multiple-choice questions.
- GSM8K & MATH: Rigorous benchmarks for mathematical reasoning, ranging from elementary school word problems (GSM8K) to competition-level high school math (MATH).
- HumanEval: The primary code generation benchmark, evaluating functional correctness (
pass@k) of Python functions based on docstrings. - TruthfulQA & HellaSwag: Tests for robustness. TruthfulQA probes against common misconceptions, while HellaSwag tests commonsense natural language inference using adversarially mined distractors.
- Chatbot Arena (LMSys) & Elo Rating: The gold standard for holistic model evaluation. It relies on crowdsourced, blind side-by-side human voting, calculating an Elo rating (similar to chess rankings) to determine the true state-of-the-art.
12. Distributed Training & Parallelism
Training a 100-billion parameter model requires orchestrating thousands of GPUs seamlessly.
- DP / DDP / FSDP: Data Parallelism strategies. DDP replicates the model across GPUs. FSDP (Fully Sharded Data Parallel) shards model parameters, gradients, and optimizer states across workers to break the single-GPU memory barrier.
- ZeRO (Zero Redundancy Optimizer): The memory optimization stages behind libraries like DeepSpeed. ZeRO-1/2/3 progressively shard optimizer states, gradients, and parameters, heavily influencing FSDP.
- TP (Tensor Parallelism): Slicing individual matrix operations (like a single attention layer) across multiple GPUs. Crucial for models too large to fit on a single chip, though it requires incredibly fast inter-GPU communication (NVLink).
- PP (Pipeline Parallelism): Splitting the network layer-by-layer across GPUs. Execution is staggered into micro-batches using schedules like 1F1B (One-Forward-One-Backward) to minimize idle GPU time ("pipeline bubbles").
- SP (Sequence Parallelism): Dividing the sequence length dimension across GPUs, vital for training models on massive context windows.
- AllReduce / AllGather / ReduceScatter: The fundamental collective communication primitives used by NCCL to synchronize gradients and weights across the cluster.
13. Retrieval-Augmented Generation (RAG) & Agents
Connecting static LLMs to dynamic external knowledge and actions.
- RAG (Retrieval-Augmented Generation): The dominant enterprise architecture. Instead of relying on internal model weights for facts, relevant documents are retrieved from a database and injected into the prompt.
- BM25 vs. Dense Embeddings: BM25 is the classic, highly effective sparse lexical (keyword) search. Dense embeddings (like BGE or E5) use neural networks to map text into vector spaces for semantic similarity search.
- SPLADE: A hybrid approach generating learned sparse representations, combining the exact-match benefits of BM25 with the semantic understanding of dense models.
- RRF (Reciprocal Rank Fusion): A robust algorithm for combining search results from multiple retrieval strategies (e.g., lexical + semantic) into a single, highly accurate ranked list.
- Vector Search (FAISS, ScaNN, HNSW): Algorithms and libraries designed for incredibly fast approximate nearest-neighbor (ANN) searches across millions of vector embeddings. HNSW (Hierarchical Navigable Small World) uses graph-structures, while IVF-PQ uses clustering and quantization.
- ReAct (Reasoning + Acting): A foundational agentic paradigm. The LLM is prompted to explicitly interleave internal reasoning traces ("Thought:") with external environment actions ("Action: Tool Call"), enabling it to solve multi-step problems autonomously.
14. Frameworks & Ecosystem
- PyTorch: The dominant open-source deep learning framework (by Meta), heavily favored for its dynamic computation graph and developer experience.
- JAX: A high-performance numerical computing library from Google featuring automatic differentiation and Just-In-Time compilation, favored for large-scale model training.
- XLA (Accelerated Linear Algebra): A domain-specific compiler that optimizes linear algebra operations across backend accelerators (GPUs/TPUs). Used extensively by JAX and TensorFlow.
- cs.CL / cs.LG: The standard arXiv categories where cutting-edge ML research is published (Computation and Language / Machine Learning).
Conclusion
The terminology in artificial intelligence evolves just as rapidly as the underlying architectures. By understanding these core concepts—from how we optimize weights with AdamW, to how we serve models with vLLM, to how we align them with DPO—you gain the fundamental vocabulary needed to navigate research papers, architectural design documents, and the broader AI engineering ecosystem.
Related Posts
Escaping the AI "Dumb Loop": Architectural Lessons from a Media3 Music Player
AI coding agents are incredible typists but terrible architects. Discover how to avoid the 'Dumb Loop' by mastering Media3 queue management and architectural oversight.