How the AI Image & Video Detector Works
A full technical breakdown of Scascan's 7-layer browser-based AI detection engine. No black boxes — every signal, weight, and scoring decision is documented here. Runs entirely client-side: zero bytes transmitted to any server.
Contents
System Architecture
The Scascan AI detector is a multi-layer statistical classifier that runs 100% inside the browser using the Canvas 2D API, WebGL, and optionally TensorFlow.js (loaded on-demand, never stored). No model weights or user files are sent to any server at any point.
Each file is routed through a context detection step that classifies it as JPEG, PNG, video, or screenshot before any analysis begins. This determines the weight profile used to combine layer scores — ensuring the most reliable signals are always emphasised for the given format.
The 7 Detection Layers
Each layer produces a score from 0–100 (100 = almost certainly AI) and a confidence rating of low / medium / high. The layers are independently designed to catch different AI generator signatures.
Scans the first 64KB of file bytes for AI software signatures: Stable Diffusion, Midjourney, DALL·E, Grok, ComfyUI, Flux, Kling, Runway, and 20+ others. For JPEG files, checks for the presence of an EXIF APP1 segment — missing EXIF in a high-resolution JPEG is a significant AI indicator. GPS coordinates and camera make/model are treated as authenticity signals (reduce score). PNG iCCP and tEXt chunks containing parameters, steps, or cfg scale fields are direct Stable Diffusion fingerprints.
Weight auto-reduces to 0% for videos (metadata stripped by codec) and screenshots (no EXIF from screen capture).
Divides the image into non-overlapping 8×8 pixel patches and computes luminance variance per patch. Diffusion models produce unnaturally smooth mid-frequency regions — patches with variance < 15 are flagged. A smooth ratio above 45% of all patches triggers a high-confidence AI signal.
Also measures RGB inter-channel balance. AI images tend to have artificially equal R/G/B channel distributions (average channel difference < 18) because they synthesise colour globally rather than inheriting sensor noise. A variance-of-variance metric checks whether patch variance is itself too uniform across the image — another diffusion hallmark.
Applies a row-wise 1D Fast Fourier Transform to a 128×128 centre crop of the image. Computes three ratios: DC energy concentration (very smooth image), periodic peak energy at quarter-frequency (GAN upsampling fingerprint), and high-frequency energy ratio (over-smoothing signal).
This layer's weight has been deliberately reduced from 20% to 8%. GAN-based generators (StyleGAN, BigGAN) leave strong periodic spectral peaks, but modern diffusion models (Stable Diffusion XL, Flux, Grok) do not. Keeping the weight high causes false confidence for diffusion images. It remains useful as a tiebreaker for older GAN-based synthetic media.
Loads a 224×224 canvas slice and converts it to a normalised TensorFlow.js tensor in the range [−1, 1]. Computes four statistical signals without any trained classification model:
- Global mean — diffusion outputs are over-normalised, producing means very close to 0.0 (|mean| < 0.05 → strong signal)
- Inter-channel variance balance — AI images have suspiciously equal R/G/B variances (< 0.015 difference → strong signal)
- Kurtosis — diffusion images produce flat pixel distributions (kurtosis < 2.2 is highly indicative; real photos typically show kurtosis > 3.0)
- Local contrast entropy — divides image into 8×8 blocks of 28px; low mean contrast (< 60) combined with low block std dev (< 20) indicates artificially uniform local detail
This approach requires no labelled training data and generalises well across generator architectures. TF.js is loaded lazily — only when detection begins — and the tensor is disposed immediately after stats are extracted.
For JPEG files: computes bytes-per-pixel ratio (unusually small → over-processed AI image) and scans DCT block boundaries at every 8th row for blocking artifacts. Counterintuitively, very clean DCT boundaries (ratio < 0.02) in a JPEG also indicate AI generation, because diffusion models produce images without the block-boundary errors of real-camera encoding.
For PNG files: checks whether dimensions match common AI output sizes (512, 768, 1024, 1536, 2048px) or are exact multiples of 64 — the diffusion diffusion model latent space tile size. Also checks for exact standard aspect ratios (16:9, 4:3, 1:1 etc.) which are far more common in AI outputs than in real photography.
Re-encodes the canvas as JPEG at quality 75 using canvas.toDataURL('image/jpeg', 0.75), reloads the output as an image, and computes the absolute per-pixel difference between original and re-encoded versions.
The key metric is the Coefficient of Variation (CoV) of the error distribution: CoV = std(errors) / mean(errors). Authentic photographs show non-uniform error maps — edges and high-detail areas have higher re-encoding error than smooth regions. AI diffusion images are generated at globally consistent quality, producing unnaturally uniform error maps (CoV < 0.5 = strong signal; CoV < 0.8 = moderate signal).
Additionally, patch-level ELA uniformity is checked: the image is divided into 8×8 coarse patches and the standard deviation of patch-level mean errors is computed. Real photos show high variation between patches; AI images show minimal patch-to-patch error variation.
Computes the average luminance variance across non-overlapping patches at three scales: 4px, 16px, and 32px. In authentic photographs, variance is self-similar across scales — a property arising from the fractal nature of natural scenes. The 16px/4px variance ratio in real photos typically falls between 0.7 and 1.3.
AI diffusion models over-smooth at medium and coarse scales, producing a sharp variance drop-off: 16px/4px ratio < 0.35 contributes 40 score points. A 32px/16px ratio < 0.4 contributes 30 additional points. This layer is effective even after JPEG recompression because the multi-scale structure is a property of image content, not encoding artifacts.
Dynamic Weight System
Base weights are only a starting point. Before scoring begins, a dynamicWeights(context) function selects the optimal weight profile based on detected file type. This is the key advance in v2 — different formats carry fundamentally different AI fingerprint distributions, and a one-size-fits-all weight table is a common failure mode in AI detectors.
| Layer | Default | JPEG | PNG | Video | Screenshot |
|---|---|---|---|---|---|
| Metadata | 10% | 10% | 14% | 0% | 0% |
| Visual Artifact | 18% | 16% | 16% | 25% | 20% |
| FFT | 8% | 8% | 8% | 5% | 5% |
| ML Analysis | 30% | 28% | 28% | 40% | 40% |
| Compression | 10% | 10% | 12% | 5% | 0% |
| ELA (new) | 12% | 16% | 6% | 0% | 5% |
| Multi-Scale (new) | 12% | 12% | 16% | 25% | 30% |
| Total | 100% | 100% | 100% | 100% | 100% |
Weights for video apply per extracted frame. Metadata analysis on video is run once at the file level and shared across frames.
Scoring Algorithm
The final AI probability is a dynamic weighted average of layer scores:
Confidence (Low / Moderate / High) is determined by the number of individual layers that returned high confidence: High = 3+ layers, Moderate = 1–2 layers, Low = 0 layers.
Score thresholds for the final verdict:
Video Temporal Pipeline
Video analysis adds a critical dimension unavailable to image detectors: temporal consistency. Real video has natural variance in exposure, noise grain, motion blur, and scene content across frames. AI video generators produce frames that score too similarly — a statistical signature that survives heavy recompression.
1. Frame Extraction
Up to 20 frames are extracted from the first 90 seconds of the video using HTML5 video.currentTime seek operations. Frames are evenly distributed starting at 0.5s (to skip potential black leader frames). Each frame is drawn to a 320×320 canvas for analysis — sufficient resolution for statistical features while keeping per-frame compute under 100ms.
2. Per-Frame Scoring
Visual artifact detection, FFT analysis, multi-scale patch, and ML tensor statistics run on each frame with the video weight profile (metadata=0%, ELA=0%, ML=40%, multi-scale=25%). ML analysis runs on the first 4 frames only; remaining frames use the prior average score to maintain performance budget.
3. Temporal Multiplier
After per-frame scoring, the standard deviation of all frame scores is computed. The multiplier table:
| σ condition | Multiplier | Interpretation |
|---|---|---|
| σ < 3% | × 1.4 | Extremely consistent — near-certain AI video |
| σ < 6% | × 1.2 | Low variance — strong AI signal |
| σ 6–12% | × 1.0 | Normal variance — inconclusive |
| σ > 12% | × 0.9 | High variance — suggests authentic content |
4. Final Video Score
Layer results are averaged across all frames. The temporal multiplier is applied to all layer scores before the weighted aggregate is computed. The σ value is shown as a badge in the UI. Metadata analysis from the file level is injected into the averaged layer set.
Screenshot Detection Heuristic
Screenshots of AI images are a common evasion vector: the original AI image is screenhotted and shared, stripping EXIF data and adding compression artifacts that confuse detectors. The detectScreenshot() function identifies likely screenshots using four independent signals:
- Filename pattern —
screenshot,screen_, orimg_eprefix - Screen-standard dimensions — width/height matching common device screen resolutions (1920×1080, 2560×1440, 390×844, etc.)
- Vignette absence — real camera photos exhibit corner darkening; screenshots have uniform corner luminance (corner/centre diff < 8 lum units)
- Noise floor — pixel-to-pixel luminance variance in a sampled interior region; screenshots have near-zero noise floor (< 1.5 avg diff) compared to camera sensor noise
When screenshot mode activates, metadata and compression weights drop to 0%, and ML feature analysis and multi-scale patch weights are boosted to compensate. A blue notice is shown in the result UI.
Why 100% Accuracy Is Impossible
All AI detection techniques — including this one — operate on statistical distributions, not deterministic rules. Several fundamental challenges prevent 100% accuracy:
⚠ Adversarial evasion
AI model developers actively study and counteract detection methods. Diffusion models already produce images with more natural pixel distributions than early GANs. Each generation of models reduces the statistical gap between AI and real images.
⚠ Distribution overlap
Some authentic photographs share statistical properties with AI images — extremely clean studio photography, digitally processed news photos, and heavily edited social media content can all produce high scores. Conversely, some AI images (especially post-processed or recompressed ones) show scores well below 50%.
⚠ Recompression degradation
Heavy JPEG recompression (common on social media platforms) destroys most spatial artifacts and changes pixel statistics. Our ELA and multi-scale patch layers are designed to survive recompression, but extreme compression (quality < 40) can reduce signal fidelity significantly.
⚠ Novel architectures
A detector trained (explicitly or implicitly) on signals from current generators may yield poor results on new architectures released after its calibration. The system is updated regularly but cannot anticipate future AI models.
⚠ Edge case diversity
Infrared photography, macro photography, satellite imagery, medical scans, and other non-standard authentic image types can produce unusual statistical profiles that elevate scores.
Do not use this tool as sole evidence in any consequential decision. Use it as one signal among several, alongside reverse image search, OSINT source verification, and cross-referencing with established news organisations.
False Positive Prevention
Several design decisions specifically reduce false positives on legitimate authentic content:
- ✓Camera fingerprints as negative signals — detected camera make/model, GPS coordinates, and EXIF photometric interpretation are all treated as authenticity evidence and subtract from the metadata layer score.
- ✓High kurtosis discount — pixel distributions with kurtosis > 4.5 (highly peaky, typical of natural scene photography) reduce the ML layer score rather than adding to it.
- ✓High local contrast variance reward — images with high local contrast variation (lcStd > 50) receive a score reduction in the ML layer, reflecting the diversity of natural scene content.
- ✓High ELA mean penalisation — images with very high mean error level (meanErr > 12) receive a score reduction in the ELA layer, consistent with high-detail authentic photography.
- ✓Capped layer scores — most layers cap at 90–95%, ensuring that even multiple strong signals from a single layer cannot dominate the final probability.
- ✓Max output of 99% — the probability display is clamped to 99%. A score of 100% would imply mathematical certainty, which no classifier can provide.
Browser-Only Architecture & Privacy
Every computation described in this document executes inside your browser tab using standard Web APIs. Here is a technical account of what happens — and what doesn't:
Ready to Verify an Image or Video?
Upload directly in your browser — no account, no upload, no data transmission.
🔍 Use the AI Detector →