How the AI Detector Works – 7-Layer Technical Breakdown

System Architecture

The Scascan AI detector is a multi-layer statistical classifier that runs 100% inside the browser using the Canvas 2D API, WebGL, and optionally TensorFlow.js (loaded on-demand, never stored). No model weights or user files are sent to any server at any point.

Each file is routed through a context detection step that classifies it as JPEG, PNG, video, or screenshot before any analysis begins. This determines the weight profile used to combine layer scores — ensuring the most reliable signals are always emphasised for the given format.

File input → Context detection (JPEG / PNG / video / screenshot)

↓ Dynamic weight profile selected

↓ Layers run in parallel (image) or per-frame (video)

↓ Weighted aggregate → AI probability %

↓ Temporal multiplier applied (video only)

↓ Explanation generated → UI rendered

The 7 Detection Layers

Each layer produces a score from 0–100 (100 = almost certainly AI) and a confidence rating of low / medium / high. The layers are independently designed to catch different AI generator signatures.

Metadata Analysis

10% base weight · EXIF · XMP · PNG tEXt

Scans the first 64KB of file bytes for AI software signatures: Stable Diffusion, Midjourney, DALL·E, Grok, ComfyUI, Flux, Kling, Runway, and 20+ others. For JPEG files, checks for the presence of an EXIF APP1 segment — missing EXIF in a high-resolution JPEG is a significant AI indicator. GPS coordinates and camera make/model are treated as authenticity signals (reduce score). PNG iCCP and tEXt chunks containing parameters, steps, or cfg scale fields are direct Stable Diffusion fingerprints.

Weight auto-reduces to 0% for videos (metadata stripped by codec) and screenshots (no EXIF from screen capture).

Visual Artifact Detection

18% base weight · Canvas pixel analysis

Divides the image into non-overlapping 8×8 pixel patches and computes luminance variance per patch. Diffusion models produce unnaturally smooth mid-frequency regions — patches with variance < 15 are flagged. A smooth ratio above 45% of all patches triggers a high-confidence AI signal.

Also measures RGB inter-channel balance. AI images tend to have artificially equal R/G/B channel distributions (average channel difference < 18) because they synthesise colour globally rather than inheriting sensor noise. A variance-of-variance metric checks whether patch variance is itself too uniform across the image — another diffusion hallmark.

Frequency Domain Analysis (FFT)

8% base weight · GAN fingerprint · reduced from 20%

Applies a row-wise 1D Fast Fourier Transform to a 128×128 centre crop of the image. Computes three ratios: DC energy concentration (very smooth image), periodic peak energy at quarter-frequency (GAN upsampling fingerprint), and high-frequency energy ratio (over-smoothing signal).

This layer's weight has been deliberately reduced from 20% to 8%. GAN-based generators (StyleGAN, BigGAN) leave strong periodic spectral peaks, but modern diffusion models (Stable Diffusion XL, Flux, Grok) do not. Keeping the weight high causes false confidence for diffusion images. It remains useful as a tiebreaker for older GAN-based synthetic media.

ML Feature Analysis

30% base weight · TF.js · tensor statistics · strongest cross-format signal

Loads a 224×224 canvas slice and converts it to a normalised TensorFlow.js tensor in the range [−1, 1]. Computes four statistical signals without any trained classification model:

Global mean — diffusion outputs are over-normalised, producing means very close to 0.0 (|mean| < 0.05 → strong signal)
Inter-channel variance balance — AI images have suspiciously equal R/G/B variances (< 0.015 difference → strong signal)
Kurtosis — diffusion images produce flat pixel distributions (kurtosis < 2.2 is highly indicative; real photos typically show kurtosis > 3.0)
Local contrast entropy — divides image into 8×8 blocks of 28px; low mean contrast (< 60) combined with low block std dev (< 20) indicates artificially uniform local detail

This approach requires no labelled training data and generalises well across generator architectures. TF.js is loaded lazily — only when detection begins — and the tensor is disposed immediately after stats are extracted.

Compression & Context

10% base weight · DCT · file size ratio · dimension heuristics

For JPEG files: computes bytes-per-pixel ratio (unusually small → over-processed AI image) and scans DCT block boundaries at every 8th row for blocking artifacts. Counterintuitively, very clean DCT boundaries (ratio < 0.02) in a JPEG also indicate AI generation, because diffusion models produce images without the block-boundary errors of real-camera encoding.

For PNG files: checks whether dimensions match common AI output sizes (512, 768, 1024, 1536, 2048px) or are exact multiples of 64 — the diffusion diffusion model latent space tile size. Also checks for exact standard aspect ratios (16:9, 4:3, 1:1 etc.) which are far more common in AI outputs than in real photography.

Error Level Analysis (ELA)

12% (JPEG) · 6% (PNG) · 0% (video) base weight · New in v2 · JPEG recompression uniformity

Re-encodes the canvas as JPEG at quality 75 using canvas.toDataURL('image/jpeg', 0.75), reloads the output as an image, and computes the absolute per-pixel difference between original and re-encoded versions.

The key metric is the Coefficient of Variation (CoV) of the error distribution: CoV = std(errors) / mean(errors). Authentic photographs show non-uniform error maps — edges and high-detail areas have higher re-encoding error than smooth regions. AI diffusion images are generated at globally consistent quality, producing unnaturally uniform error maps (CoV < 0.5 = strong signal; CoV < 0.8 = moderate signal).

Additionally, patch-level ELA uniformity is checked: the image is divided into 8×8 coarse patches and the standard deviation of patch-level mean errors is computed. Real photos show high variation between patches; AI images show minimal patch-to-patch error variation.

Multi-Scale Patch Self-Similarity

12% base weight · New in v2 · fractal texture test

Computes the average luminance variance across non-overlapping patches at three scales: 4px, 16px, and 32px. In authentic photographs, variance is self-similar across scales — a property arising from the fractal nature of natural scenes. The 16px/4px variance ratio in real photos typically falls between 0.7 and 1.3.

AI diffusion models over-smooth at medium and coarse scales, producing a sharp variance drop-off: 16px/4px ratio < 0.35 contributes 40 score points. A 32px/16px ratio < 0.4 contributes 30 additional points. This layer is effective even after JPEG recompression because the multi-scale structure is a property of image content, not encoding artifacts.

Dynamic Weight System

Base weights are only a starting point. Before scoring begins, a dynamicWeights(context) function selects the optimal weight profile based on detected file type. This is the key advance in v2 — different formats carry fundamentally different AI fingerprint distributions, and a one-size-fits-all weight table is a common failure mode in AI detectors.

Layer	Default	JPEG	PNG	Video	Screenshot
Metadata	10%	10%	14%	0%	0%
Visual Artifact	18%	16%	16%	25%	20%
FFT	8%	8%	8%	5%	5%
ML Analysis	30%	28%	28%	40%	40%
Compression	10%	10%	12%	5%	0%
ELA (new)	12%	16%	6%	0%	5%
Multi-Scale (new)	12%	12%	16%	25%	30%
Total	100%	100%	100%	100%	100%

Weights for video apply per extracted frame. Metadata analysis on video is run once at the file level and shared across frames.

Scoring Algorithm

The final AI probability is a dynamic weighted average of layer scores:

probability = Σ(layer.score × weight) / Σ(weight)

// clamped to [0, 99] — never displays 100% (no certainty)

Confidence (Low / Moderate / High) is determined by the number of individual layers that returned high confidence: High = 3+ layers, Moderate = 1–2 layers, Low = 0 layers.

Score thresholds for the final verdict:

0–41% — Likely AuthenticMajority of layers return low or no signals. Always combine with other verification methods.

42–71% — UncertainMixed signals detected. Could be AI-generated, heavily post-processed, or recompressed authentic content.

72–99% — Likely AIMultiple layers agree on strong AI signals. Use additional verification before acting on this result.

Video Temporal Pipeline

Video analysis adds a critical dimension unavailable to image detectors: temporal consistency. Real video has natural variance in exposure, noise grain, motion blur, and scene content across frames. AI video generators produce frames that score too similarly — a statistical signature that survives heavy recompression.

1. Frame Extraction

Up to 20 frames are extracted from the first 90 seconds of the video using HTML5 video.currentTime seek operations. Frames are evenly distributed starting at 0.5s (to skip potential black leader frames). Each frame is drawn to a 320×320 canvas for analysis — sufficient resolution for statistical features while keeping per-frame compute under 100ms.

2. Per-Frame Scoring

Visual artifact detection, FFT analysis, multi-scale patch, and ML tensor statistics run on each frame with the video weight profile (metadata=0%, ELA=0%, ML=40%, multi-scale=25%). ML analysis runs on the first 4 frames only; remaining frames use the prior average score to maintain performance budget.

3. Temporal Multiplier

After per-frame scoring, the standard deviation of all frame scores is computed. The multiplier table:

σ condition	Multiplier	Interpretation
σ < 3%	× 1.4	Extremely consistent — near-certain AI video
σ < 6%	× 1.2	Low variance — strong AI signal
σ 6–12%	× 1.0	Normal variance — inconclusive
σ > 12%	× 0.9	High variance — suggests authentic content

4. Final Video Score

Layer results are averaged across all frames. The temporal multiplier is applied to all layer scores before the weighted aggregate is computed. The σ value is shown as a badge in the UI. Metadata analysis from the file level is injected into the averaged layer set.

Screenshot Detection Heuristic

Screenshots of AI images are a common evasion vector: the original AI image is screenhotted and shared, stripping EXIF data and adding compression artifacts that confuse detectors. The detectScreenshot() function identifies likely screenshots using four independent signals:

Filename pattern — screenshot, screen_, or img_e prefix
Screen-standard dimensions — width/height matching common device screen resolutions (1920×1080, 2560×1440, 390×844, etc.)
Vignette absence — real camera photos exhibit corner darkening; screenshots have uniform corner luminance (corner/centre diff < 8 lum units)
Noise floor — pixel-to-pixel luminance variance in a sampled interior region; screenshots have near-zero noise floor (< 1.5 avg diff) compared to camera sensor noise

When screenshot mode activates, metadata and compression weights drop to 0%, and ML feature analysis and multi-scale patch weights are boosted to compensate. A blue notice is shown in the result UI.

Why 100% Accuracy Is Impossible

All AI detection techniques — including this one — operate on statistical distributions, not deterministic rules. Several fundamental challenges prevent 100% accuracy:

⚠ Adversarial evasion

AI model developers actively study and counteract detection methods. Diffusion models already produce images with more natural pixel distributions than early GANs. Each generation of models reduces the statistical gap between AI and real images.

⚠ Distribution overlap

Some authentic photographs share statistical properties with AI images — extremely clean studio photography, digitally processed news photos, and heavily edited social media content can all produce high scores. Conversely, some AI images (especially post-processed or recompressed ones) show scores well below 50%.

⚠ Recompression degradation

Heavy JPEG recompression (common on social media platforms) destroys most spatial artifacts and changes pixel statistics. Our ELA and multi-scale patch layers are designed to survive recompression, but extreme compression (quality < 40) can reduce signal fidelity significantly.

⚠ Novel architectures

A detector trained (explicitly or implicitly) on signals from current generators may yield poor results on new architectures released after its calibration. The system is updated regularly but cannot anticipate future AI models.

⚠ Edge case diversity

Infrared photography, macro photography, satellite imagery, medical scans, and other non-standard authentic image types can produce unusual statistical profiles that elevate scores.

Do not use this tool as sole evidence in any consequential decision. Use it as one signal among several, alongside reverse image search, OSINT source verification, and cross-referencing with established news organisations.

False Positive Prevention

Several design decisions specifically reduce false positives on legitimate authentic content:

✓Camera fingerprints as negative signals — detected camera make/model, GPS coordinates, and EXIF photometric interpretation are all treated as authenticity evidence and subtract from the metadata layer score.
✓High kurtosis discount — pixel distributions with kurtosis > 4.5 (highly peaky, typical of natural scene photography) reduce the ML layer score rather than adding to it.
✓High local contrast variance reward — images with high local contrast variation (lcStd > 50) receive a score reduction in the ML layer, reflecting the diversity of natural scene content.
✓High ELA mean penalisation — images with very high mean error level (meanErr > 12) receive a score reduction in the ELA layer, consistent with high-detail authentic photography.
✓Capped layer scores — most layers cap at 90–95%, ensuring that even multiple strong signals from a single layer cannot dominate the final probability.
✓Max output of 99% — the probability display is clamped to 99%. A score of 100% would imply mathematical certainty, which no classifier can provide.

Browser-Only Architecture & Privacy

Every computation described in this document executes inside your browser tab using standard Web APIs. Here is a technical account of what happens — and what doesn't:

Canvas 2D pixel access: Image bytes are decoded and drawn to an HTMLCanvasElement. getImageData() reads raw pixel values into a Uint8ClampedArray in browser memory. No network call.

TensorFlow.js tensor compute: tf.browser.fromPixels() converts the canvas to a float32 tensor. Math operations (mean, moments, split) run on the WebGL GPU backend or CPU fallback. The tensor is disposed via tf.dispose() immediately after stat extraction — no tensor is retained.

Video seek & frame capture: HTMLVideoElement.currentTime is set to each timestamp. onseeked fires, the frame is drawn to an offscreen canvas, and the URL Object is revoked. Video bytes stay in browser memory, never transmitted.

ELA JPEG re-encoding: canvas.toDataURL('image/jpeg', 0.75) runs locally in the browser's JPEG encoder (libwebp/skia). The resulting data URL is never sent anywhere — it's drawn to a second offscreen canvas for pixel comparison.

No analytics on file content: Scascan does not log filenames, file sizes, classification results, or any derived data. Standard page-level analytics (if any) cannot access the contents of your Canvas or TF.js computations by browser security policy.

scascan.com

How the AI Image & Video Detector Works

Contents