AI Is Mathematics, Not Magic

The Foundation: From Text to Numbers

Tokenization: Breaking Language into Pieces

Large Language Models (LLMs) cannot read text. They process numbers. Tokenization converts text into numerical representations.

Byte-Pair Encoding (BPE) Example

Input: "It's over 9000!"

Tokenization (GPT-2):
- "It" → 1026
- "'s" → 338  
- " over" → 625
- " 9000" → 50138
- "!" → 0

Result: [1026, 338, 625, 50138, 0]

Vocabulary Building Process

1. Start with individual characters (a-z, punctuation)
2. Scan training corpus for most frequent pairs
3. Merge frequent pairs into new tokens
4. Repeat until vocabulary limit reached (e.g., 50,000 tokens)
5. Final vocabulary: mix of characters, subwords, common words

Token ID Properties

Same token = same ID across all contexts
“a” appearing twice in sentence gets same ID
Sequence order preserved through position information

Embeddings: Mapping Tokens to Vectors

After tokenization, each ID converts to a high-dimensional vector (embedding).

Embedding Characteristics

Dimensions: 512 (small models) to 12,288 (GPT-4 class)
Values: Learned during training, typically -1 to 1 range

Semantic relationships preserved:
- "king" - "man" + "woman" ≈ "queen"
- "Paris" - "France" + "Italy" ≈ "Rome"

Positional Encoding

Since transformers process all tokens simultaneously, position information added:

Original: [embedding_of_"The"]
After positional encoding: [embedding_of_"The" + position_0_vector]

Mathematical forms:
- Sinusoidal: sin(position / 10000^(2i/d_model))
- Learned: Trainable position embeddings
- Rotary (RoPE): Rotates query/key vectors by position angle

The Transformer Architecture

Core Components

Self-Attention Mechanism

Purpose: Calculate relevance between every token pair

Input sequence: "The cat sat on the mat"

Attention scores (simplified):
          The   cat   sat   on   the   mat
The      0.3   0.2   0.1   0.1   0.2   0.1
cat      0.2   0.3   0.2   0.1   0.1   0.1
sat      0.1   0.2   0.3   0.2   0.1   0.1
...

Mathematical operation:
Attention(Q,K,V) = softmax(QK^T / √d_k) × V

Where:
Q (Query): What am I looking for?
K (Key): What do I contain?
V (Value): What information do I provide?

Multi-Head Attention

Instead of single attention calculation, use multiple parallel heads:

Head 1: Syntactic relationships (subject-verb agreement)
Head 2: Semantic relationships (word meanings)
Head 3: Positional relationships (token distances)
Head 4: Coreference resolution (pronoun antecedents)

Concatenate outputs from all 8-16 heads, project to model dimension

Feed-Forward Networks

After attention, each token processed independently:

 $FFN(x) = max(0, xW₁ + b₁)W₂ + b₂$ 

Adds non-linearity, allows complex pattern learning
Typically 4x expansion: 512 → 2048 → 512 dimensions

Layer Stack

Input Embeddings + Positional Encoding
        ↓
    [Layer 1]
    Multi-Head Self-Attention
    Add & Normalize (residual connection)
    Feed-Forward Network  
    Add & Normalize
        ↓
    [Layer 2] (repeat 12-96 times)
        ↓
    [Layer N]
        ↓
Linear Projection to Vocabulary Size
Softmax → Probability Distribution

Decoder-Only Architecture (GPT-style)

Autoregressive Generation

Training phase:
Input:  "The cat sat on the..."
Target: "cat sat on the mat"

Masking prevents looking at future tokens:
Position 0 can see: [The]
Position 1 can see: [The, cat]
Position 2 can see: [The, cat, sat]

Loss calculation: Cross-entropy between predicted and actual next token

Generation Process

1. Start with prompt tokens: "Once upon a"
2. Forward pass through model
3. Output: probability distribution over vocabulary
4. Sample next token (greedy, temperature, or nucleus sampling)
5. Append token to sequence
6. Repeat until stop condition

Temperature sampling:
- T=0.1: Deterministic, high probability tokens only
- T=1.0: Random sampling from distribution
- T=2.0: Very random, creative but incoherent

Diffusion Models for Images

Forward Process: Adding Noise

Original image: x₀
Timestep 1: x₁ = √(1-β₁)x₀ + √β₁·ε₁  (slight noise)
Timestep 2: x₂ = √(1-β₂)x₁ + √β₂·ε₂  (more noise)
...
Timestep T: x_T ≈ pure Gaussian noise

Mathematical property: Can jump to any timestep directly
q(x_t | x_0) = N(x_t; √ᾱₜ x₀, (1-ᾱₜ)I)

Reverse Process: Denoising

Neural network learns:  $p_θ(x_{t-1} | x_t)$ 

Input: Noisy image  $x_t + Timestep t$ 
Output: Predicted noise  $ε_θ(x_t, t)$ 

Update rule:
 $x_{t-1} = (x_t - √((1-ᾱₜ)/(1-ᾱₜ))·ε_θ) / √ᾱₜ₋₁ + σₜ·z$ 
Where z ~ N(0,I) for stochastic sampling

Training Objective

Minimize:  $E[||ε - ε_θ(x_t, t)||²]$ 

Simple mean squared error between actual and predicted noise

Text-to-Image Conditioning

Cross-attention layers connect text and image:

Text encoder (CLIP/T5) → Text embeddings
                              ↓
U-Net denoising network ← Cross-attention
                              ↓
Generated image

Classifier-Free Guidance (CFG):
- Generate with text conditioning: ε_c
- Generate without conditioning: ε_u
- Final output:  $ε = ε_u + s·(ε_c - ε_u)$ 

Where s = guidance scale (7.5 typical, higher = more literal)

Synthetic Media Detection

Forensic Indicators

Image Artifacts

Diffusion model fingerprints:
- Unnatural hair strands at boundaries
- Inconsistent eye reflections (pupil shape, catchlights)
- Teeth irregularities (extra/missing incisors)
- Ear structure anomalies
- Background texture repetition
- EXIF metadata inconsistencies

Video Analysis

Temporal inconsistencies:
- Irregular blinking patterns
- Lip-sync mismatches (audio-visual desynchronization)
- Inconsistent head pose across frames
- Lighting changes between shots
- Frame-to-frame jitter in static backgrounds

Audio Detection

Voice cloning artifacts:
- Spectrogram discontinuities
- Unnatural breathing patterns
- Prosody inconsistencies (stress, intonation)
- Room impulse response mismatches
- High-frequency noise in silent segments

Detection Architectures

CNN-Based Detectors

Input: Image
↓
Convolutional layers (ResNet/EfficientNet backbone)
↓
Global average pooling
↓
Fully connected classifier
↓
Output: Real / Fake probability

Training: Binary cross-entropy on labeled datasets
(FaceForensics++, Celeb-DF, DeepFakeDetection)

Vision Transformers for Deepfakes

Patch embedding: 16×16 pixel patches → vectors
↓
Transformer encoder (self-attention across patches)
↓
Classification token
↓
MLP head
↓
Real / Fake prediction

Advantage: Captures global inconsistencies better than CNNs

Mathematical Foundations Summary

Component	Core Mathematics	Purpose
Tokenization	Frequency analysis, string algorithms	Text → discrete units
Embeddings	High-dimensional vector spaces	Semantic representation
Attention	Matrix multiplication, softmax	Contextual relationships
Transformers	Linear algebra, gradient descent	Sequence modeling
Diffusion	Stochastic differential equations	Image generation
Sampling	Probability distributions, temperature	Controlled randomness
Training	Backpropagation, cross-entropy	Parameter optimization

Key Equations

Scaled Dot-Product Attention


 $Attention(Q, K, V) = softmax(QK^T / √d_k)V$ 
 $Q, K, V ∈ R^(n×d_k)$ 
 $QK^T ∈ R^(n×n)$ (attention scores)
softmax normalizes rows to probability distribution

Layer Normalization

LayerNorm(x) = γ · (x - μ) / √(σ² + ε) + β

μ = mean across features
σ² = variance across features
γ, β = learned scale and shift parameters

Diffusion Training Loss

 $L = E_{x₀~q(x₀), ε~N(0,I), t~Uniform(1,T)} [||ε - ε_θ(√ᾱₜx₀ + √(1-ᾱₜ)ε, t)||²]$

How GenAI Works in Text and Synthetic Media