AI Is Mathematics, Not Magic
The Foundation: From Text to Numbers
Tokenization: Breaking Language into Pieces
Large Language Models (LLMs) cannot read text. They process numbers. Tokenization converts text into numerical representations.
Byte-Pair Encoding (BPE) Example
Input: "It's over 9000!"
Tokenization (GPT-2):
- "It" → 1026
- "'s" → 338
- " over" → 625
- " 9000" → 50138
- "!" → 0
Result: [1026, 338, 625, 50138, 0]
Vocabulary Building Process
1. Start with individual characters (a-z, punctuation)
2. Scan training corpus for most frequent pairs
3. Merge frequent pairs into new tokens
4. Repeat until vocabulary limit reached (e.g., 50,000 tokens)
5. Final vocabulary: mix of characters, subwords, common words
Token ID Properties
- Same token = same ID across all contexts
- “a” appearing twice in sentence gets same ID
- Sequence order preserved through position information
Embeddings: Mapping Tokens to Vectors
After tokenization, each ID converts to a high-dimensional vector (embedding).
Embedding Characteristics
Dimensions: 512 (small models) to 12,288 (GPT-4 class)
Values: Learned during training, typically -1 to 1 range
Semantic relationships preserved:
- "king" - "man" + "woman" ≈ "queen"
- "Paris" - "France" + "Italy" ≈ "Rome"
Positional Encoding
Since transformers process all tokens simultaneously, position information added:
Original: [embedding_of_"The"]
After positional encoding: [embedding_of_"The" + position_0_vector]
Mathematical forms:
- Sinusoidal: sin(position / 10000^(2i/d_model))
- Learned: Trainable position embeddings
- Rotary (RoPE): Rotates query/key vectors by position angle
The Transformer Architecture
Core Components
Self-Attention Mechanism
Purpose: Calculate relevance between every token pair
Input sequence: "The cat sat on the mat"
Attention scores (simplified):
The cat sat on the mat
The 0.3 0.2 0.1 0.1 0.2 0.1
cat 0.2 0.3 0.2 0.1 0.1 0.1
sat 0.1 0.2 0.3 0.2 0.1 0.1
...
Mathematical operation:
Attention(Q,K,V) = softmax(QK^T / √d_k) × V
Where:
Q (Query): What am I looking for?
K (Key): What do I contain?
V (Value): What information do I provide?
Multi-Head Attention
Instead of single attention calculation, use multiple parallel heads:
Head 1: Syntactic relationships (subject-verb agreement)
Head 2: Semantic relationships (word meanings)
Head 3: Positional relationships (token distances)
Head 4: Coreference resolution (pronoun antecedents)
Concatenate outputs from all 8-16 heads, project to model dimension
Feed-Forward Networks
After attention, each token processed independently:
Adds non-linearity, allows complex pattern learning
Typically 4x expansion: 512 → 2048 → 512 dimensions
Layer Stack
Input Embeddings + Positional Encoding
↓
[Layer 1]
Multi-Head Self-Attention
Add & Normalize (residual connection)
Feed-Forward Network
Add & Normalize
↓
[Layer 2] (repeat 12-96 times)
↓
[Layer N]
↓
Linear Projection to Vocabulary Size
Softmax → Probability Distribution
Decoder-Only Architecture (GPT-style)
Autoregressive Generation
Training phase:
Input: "The cat sat on the..."
Target: "cat sat on the mat"
Masking prevents looking at future tokens:
Position 0 can see: [The]
Position 1 can see: [The, cat]
Position 2 can see: [The, cat, sat]
Loss calculation: Cross-entropy between predicted and actual next token
Generation Process
1. Start with prompt tokens: "Once upon a"
2. Forward pass through model
3. Output: probability distribution over vocabulary
4. Sample next token (greedy, temperature, or nucleus sampling)
5. Append token to sequence
6. Repeat until stop condition
Temperature sampling:
- T=0.1: Deterministic, high probability tokens only
- T=1.0: Random sampling from distribution
- T=2.0: Very random, creative but incoherent
Diffusion Models for Images
Forward Process: Adding Noise
Original image: x₀
Timestep 1: x₁ = √(1-β₁)x₀ + √β₁·ε₁ (slight noise)
Timestep 2: x₂ = √(1-β₂)x₁ + √β₂·ε₂ (more noise)
...
Timestep T: x_T ≈ pure Gaussian noise
Mathematical property: Can jump to any timestep directly
q(x_t | x_0) = N(x_t; √ᾱₜ x₀, (1-ᾱₜ)I)
Reverse Process: Denoising
Neural network learns:
Input: Noisy image
Output: Predicted noise
Update rule:
Where z ~ N(0,I) for stochastic sampling
Training Objective
Minimize:
Simple mean squared error between actual and predicted noise
Text-to-Image Conditioning
Cross-attention layers connect text and image:
Text encoder (CLIP/T5) → Text embeddings
↓
U-Net denoising network ← Cross-attention
↓
Generated image
Classifier-Free Guidance (CFG):
- Generate with text conditioning: ε_c
- Generate without conditioning: ε_u
- Final output:
Where s = guidance scale (7.5 typical, higher = more literal)
Synthetic Media Detection
Forensic Indicators
Image Artifacts
Diffusion model fingerprints:
- Unnatural hair strands at boundaries
- Inconsistent eye reflections (pupil shape, catchlights)
- Teeth irregularities (extra/missing incisors)
- Ear structure anomalies
- Background texture repetition
- EXIF metadata inconsistencies
Video Analysis
Temporal inconsistencies:
- Irregular blinking patterns
- Lip-sync mismatches (audio-visual desynchronization)
- Inconsistent head pose across frames
- Lighting changes between shots
- Frame-to-frame jitter in static backgrounds
Audio Detection
Voice cloning artifacts:
- Spectrogram discontinuities
- Unnatural breathing patterns
- Prosody inconsistencies (stress, intonation)
- Room impulse response mismatches
- High-frequency noise in silent segments
Detection Architectures
CNN-Based Detectors
Input: Image
↓
Convolutional layers (ResNet/EfficientNet backbone)
↓
Global average pooling
↓
Fully connected classifier
↓
Output: Real / Fake probability
Training: Binary cross-entropy on labeled datasets
(FaceForensics++, Celeb-DF, DeepFakeDetection)
Vision Transformers for Deepfakes
Patch embedding: 16×16 pixel patches → vectors
↓
Transformer encoder (self-attention across patches)
↓
Classification token
↓
MLP head
↓
Real / Fake prediction
Advantage: Captures global inconsistencies better than CNNs
Mathematical Foundations Summary
| Component | Core Mathematics | Purpose |
|---|---|---|
| Tokenization | Frequency analysis, string algorithms | Text → discrete units |
| Embeddings | High-dimensional vector spaces | Semantic representation |
| Attention | Matrix multiplication, softmax | Contextual relationships |
| Transformers | Linear algebra, gradient descent | Sequence modeling |
| Diffusion | Stochastic differential equations | Image generation |
| Sampling | Probability distributions, temperature | Controlled randomness |
| Training | Backpropagation, cross-entropy | Parameter optimization |
Key Equations
Scaled Dot-Product Attention
(attention scores)
softmax normalizes rows to probability distribution
Layer Normalization
LayerNorm(x) = γ · (x - μ) / √(σ² + ε) + β
μ = mean across features
σ² = variance across features
γ, β = learned scale and shift parameters
Diffusion Training Loss