How GenAI Works in Text and Synthetic Media

AI Is Mathematics, Not Magic


The Foundation: From Text to Numbers

Tokenization: Breaking Language into Pieces

Large Language Models (LLMs) cannot read text. They process numbers. Tokenization converts text into numerical representations.

Byte-Pair Encoding (BPE) Example

Input: "It's over 9000!"

Tokenization (GPT-2):
- "It" → 1026
- "'s" → 338  
- " over" → 625
- " 9000" → 50138
- "!" → 0

Result: [1026, 338, 625, 50138, 0]

Vocabulary Building Process

1. Start with individual characters (a-z, punctuation)
2. Scan training corpus for most frequent pairs
3. Merge frequent pairs into new tokens
4. Repeat until vocabulary limit reached (e.g., 50,000 tokens)
5. Final vocabulary: mix of characters, subwords, common words

Token ID Properties

  • Same token = same ID across all contexts
  • “a” appearing twice in sentence gets same ID
  • Sequence order preserved through position information

Embeddings: Mapping Tokens to Vectors

After tokenization, each ID converts to a high-dimensional vector (embedding).

Embedding Characteristics

Dimensions: 512 (small models) to 12,288 (GPT-4 class)
Values: Learned during training, typically -1 to 1 range

Semantic relationships preserved:
- "king" - "man" + "woman" ≈ "queen"
- "Paris" - "France" + "Italy" ≈ "Rome"

Positional Encoding

Since transformers process all tokens simultaneously, position information added:

Original: [embedding_of_"The"]
After positional encoding: [embedding_of_"The" + position_0_vector]

Mathematical forms:
- Sinusoidal: sin(position / 10000^(2i/d_model))
- Learned: Trainable position embeddings
- Rotary (RoPE): Rotates query/key vectors by position angle

The Transformer Architecture

Core Components

Self-Attention Mechanism

Purpose: Calculate relevance between every token pair

Input sequence: "The cat sat on the mat"

Attention scores (simplified):
          The   cat   sat   on   the   mat
The      0.3   0.2   0.1   0.1   0.2   0.1
cat      0.2   0.3   0.2   0.1   0.1   0.1
sat      0.1   0.2   0.3   0.2   0.1   0.1
...

Mathematical operation:
Attention(Q,K,V) = softmax(QK^T / √d_k) × V

Where:
Q (Query): What am I looking for?
K (Key): What do I contain?
V (Value): What information do I provide?

Multi-Head Attention

Instead of single attention calculation, use multiple parallel heads:

Head 1: Syntactic relationships (subject-verb agreement)
Head 2: Semantic relationships (word meanings)
Head 3: Positional relationships (token distances)
Head 4: Coreference resolution (pronoun antecedents)

Concatenate outputs from all 8-16 heads, project to model dimension

Feed-Forward Networks

After attention, each token processed independently:

FFN(x)=max(0,xW1+b1)W2+b2FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

Adds non-linearity, allows complex pattern learning
Typically 4x expansion: 512 → 2048 → 512 dimensions

Layer Stack

Input Embeddings + Positional Encoding
        ↓
    [Layer 1]
    Multi-Head Self-Attention
    Add & Normalize (residual connection)
    Feed-Forward Network  
    Add & Normalize
        ↓
    [Layer 2] (repeat 12-96 times)
        ↓
    [Layer N]
        ↓
Linear Projection to Vocabulary Size
Softmax → Probability Distribution

Decoder-Only Architecture (GPT-style)

Autoregressive Generation

Training phase:
Input:  "The cat sat on the..."
Target: "cat sat on the mat"

Masking prevents looking at future tokens:
Position 0 can see: [The]
Position 1 can see: [The, cat]
Position 2 can see: [The, cat, sat]

Loss calculation: Cross-entropy between predicted and actual next token

Generation Process

1. Start with prompt tokens: "Once upon a"
2. Forward pass through model
3. Output: probability distribution over vocabulary
4. Sample next token (greedy, temperature, or nucleus sampling)
5. Append token to sequence
6. Repeat until stop condition

Temperature sampling:
- T=0.1: Deterministic, high probability tokens only
- T=1.0: Random sampling from distribution
- T=2.0: Very random, creative but incoherent

Diffusion Models for Images

Forward Process: Adding Noise

Original image: x₀
Timestep 1: x₁ = √(1-β₁)x₀ + √β₁·ε₁  (slight noise)
Timestep 2: x₂ = √(1-β₂)x₁ + √β₂·ε₂  (more noise)
...
Timestep T: x_T ≈ pure Gaussian noise

Mathematical property: Can jump to any timestep directly
q(x_t | x_0) = N(x_t; √ᾱₜ x₀, (1-ᾱₜ)I)

Reverse Process: Denoising

Neural network learns: pθ(xt1|xt)p_θ(x_{t-1} | x_t)

Input: Noisy image xt+Timestept x_t + Timestep t
Output: Predicted noise εθ(xt,t)ε_θ(x_t, t)

Update rule:
xt1=(xt((1αt)/(1αt))·εθ)/αt1+σt·zx_{t-1} = (x_t - √((1-ᾱₜ)/(1-ᾱₜ))·ε_θ) / √ᾱₜ₋₁ + σₜ·z
Where z ~ N(0,I) for stochastic sampling

Training Objective

Minimize: E[||εεθ(xt,t)||2]E[||ε - ε_θ(x_t, t)||²]

Simple mean squared error between actual and predicted noise

Text-to-Image Conditioning

Cross-attention layers connect text and image:

Text encoder (CLIP/T5) → Text embeddings
                              ↓
U-Net denoising network ← Cross-attention
                              ↓
Generated image

Classifier-Free Guidance (CFG):
- Generate with text conditioning: ε_c
- Generate without conditioning: ε_u
- Final output: ε=εu+s·(εcεu)ε = ε_u + s·(ε_c - ε_u)

Where s = guidance scale (7.5 typical, higher = more literal)

Synthetic Media Detection

Forensic Indicators

Image Artifacts

Diffusion model fingerprints:
- Unnatural hair strands at boundaries
- Inconsistent eye reflections (pupil shape, catchlights)
- Teeth irregularities (extra/missing incisors)
- Ear structure anomalies
- Background texture repetition
- EXIF metadata inconsistencies

Video Analysis

Temporal inconsistencies:
- Irregular blinking patterns
- Lip-sync mismatches (audio-visual desynchronization)
- Inconsistent head pose across frames
- Lighting changes between shots
- Frame-to-frame jitter in static backgrounds

Audio Detection

Voice cloning artifacts:
- Spectrogram discontinuities
- Unnatural breathing patterns
- Prosody inconsistencies (stress, intonation)
- Room impulse response mismatches
- High-frequency noise in silent segments

Detection Architectures

CNN-Based Detectors

Input: Image
↓
Convolutional layers (ResNet/EfficientNet backbone)
↓
Global average pooling
↓
Fully connected classifier
↓
Output: Real / Fake probability

Training: Binary cross-entropy on labeled datasets
(FaceForensics++, Celeb-DF, DeepFakeDetection)

Vision Transformers for Deepfakes

Patch embedding: 16×16 pixel patches → vectors
↓
Transformer encoder (self-attention across patches)
↓
Classification token
↓
MLP head
↓
Real / Fake prediction

Advantage: Captures global inconsistencies better than CNNs

Mathematical Foundations Summary

ComponentCore MathematicsPurpose
TokenizationFrequency analysis, string algorithmsText → discrete units
EmbeddingsHigh-dimensional vector spacesSemantic representation
AttentionMatrix multiplication, softmaxContextual relationships
TransformersLinear algebra, gradient descentSequence modeling
DiffusionStochastic differential equationsImage generation
SamplingProbability distributions, temperatureControlled randomness
TrainingBackpropagation, cross-entropyParameter optimization

Key Equations

Scaled Dot-Product Attention


Attention(Q,K,V)=softmax(QKT/dk)VAttention(Q, K, V) = softmax(QK^T / √d_k)V
Q,K,VR(n×dk)  Q, K, V ∈ R^(n×d_k) 
QKTR(n×n)QK^T ∈ R^(n×n)(attention scores)
softmax normalizes rows to probability distribution

Layer Normalization

LayerNorm(x) = γ · (x - μ) / √(σ² + ε) + β

μ = mean across features
σ² = variance across features
γ, β = learned scale and shift parameters

Diffusion Training Loss

L=Ex0 q(x0),ε N(0,I),t Uniform(1,T)[||εεθ(αtx0+(1αt)ε,t)||2]L = E_{x₀~q(x₀), ε~N(0,I), t~Uniform(1,T)} [||ε - ε_θ(√ᾱₜx₀ + √(1-ᾱₜ)ε, t)||²]

Leave a Reply

Your email address will not be published. Required fields are marked *