Computer Vision in Web Applications: TensorFlow.js, Image Recognition & OCR

Computer vision has transformed from a research curiosity into a practical tool that powers some of the most innovative web applications today. From virtual try-on features in e-commerce to document scanning apps that extract text from images, visual AI is becoming an essential capability for modern web developers. According to Markets and Markets, the computer vision market is projected to reach $41.11 billion by 2030, with web and mobile applications driving significant growth.

In this comprehensive guide, we'll explore how to implement visual AI features directly in web browsers using TensorFlow.js and MediaPipe. You'll learn to build image recognition systems, real-time object detection, OCR capabilities, and visual search features that run entirely client-side or integrate with cloud vision APIs for production-grade applications.

Understanding Computer Vision in the Browser

Browser-based computer vision has become viable thanks to WebGL acceleration and optimized machine learning frameworks. Running vision models client-side offers significant advantages: privacy preservation (images never leave the device), reduced latency, offline capability, and lower server costs. However, it also brings constraints around model size, memory usage, and processing power that require careful optimization.

Architecture Options

When implementing computer vision features, you have three main architectural approaches:

// computer-vision-architecture.ts

// 1. Client-Side Only - Privacy-first, works offline
interface ClientSideVision {
    framework: 'tensorflow.js' | 'mediapipe' | 'onnx-runtime';
    model: 'pre-trained' | 'custom-converted';
    acceleration: 'webgl' | 'wasm' | 'webgpu';
    pros: ['privacy', 'low-latency', 'offline'];
    cons: ['model-size-limits', 'device-dependent-performance'];
}

// 2. Server-Side Processing - Maximum accuracy
interface ServerSideVision {
    provider: 'google-cloud-vision' | 'aws-rekognition' | 'azure-cognitive';
    deployment: 'api' | 'custom-model-inference';
    pros: ['high-accuracy', 'large-models', 'consistent-performance'];
    cons: ['latency', 'privacy-concerns', 'costs', 'requires-internet'];
}

// 3. Hybrid Approach - Best of both worlds
interface HybridVision {
    clientTasks: string[]; // Quick, privacy-sensitive tasks
    serverTasks: string[]; // Complex, accuracy-critical tasks
    fallback: 'client' | 'server';
}

// Example hybrid configuration
const hybridConfig: HybridVision = {
    clientTasks: [
        'face-detection',      // Real-time preview
        'image-preprocessing', // Crop, resize before upload
        'basic-classification' // Quick categorization
    ],
    serverTasks: [
        'document-ocr',        // Complex text extraction
        'product-recognition', // Large catalog matching
        'content-moderation'   // Policy compliance
    ],
    fallback: 'server'
};

Setting Up TensorFlow.js for Computer Vision

TensorFlow.js provides the most comprehensive framework for running ML models in the browser. Here's a production-ready setup with performance optimization:

// tensorflow-setup.ts
import * as tf from '@tensorflow/tfjs';
import '@tensorflow/tfjs-backend-webgl';
import '@tensorflow/tfjs-backend-wasm';
import { setWasmPaths } from '@tensorflow/tfjs-backend-wasm';

interface TFSetupConfig {
    preferredBackend: 'webgl' | 'wasm' | 'cpu';
    wasmPath?: string;
    enableProfiling?: boolean;
}

class TensorFlowSetup {
    private initialized: boolean = false;
    private currentBackend: string = '';

    async initialize(config: TFSetupConfig): Promise {
        if (this.initialized) return;

        // Configure WASM paths if using WASM backend
        if (config.wasmPath) {
            setWasmPaths(config.wasmPath);
        }

        // Try backends in order of preference
        const backends = this.getBackendOrder(config.preferredBackend);

        for (const backend of backends) {
            try {
                await tf.setBackend(backend);
                await tf.ready();
                this.currentBackend = backend;
                console.log(`TensorFlow.js initialized with ${backend} backend`);
                break;
            } catch (error) {
                console.warn(`Failed to initialize ${backend} backend:`, error);
            }
        }

        if (config.enableProfiling) {
            tf.enableProdMode(); // Disable debug checks for better performance
        }

        // Memory management settings
        tf.env().set('WEBGL_DELETE_TEXTURE_THRESHOLD', 0);
        tf.env().set('WEBGL_FORCE_F16_TEXTURES', true); // Use FP16 for memory savings

        this.initialized = true;
    }

    private getBackendOrder(preferred: string): string[] {
        const order = [preferred];
        const allBackends = ['webgl', 'wasm', 'cpu'];
        allBackends.forEach(b => {
            if (!order.includes(b)) order.push(b);
        });
        return order;
    }

    getBackendInfo(): { backend: string; features: string[] } {
        const features: string[] = [];

        if (this.currentBackend === 'webgl') {
            const gl = document.createElement('canvas').getContext('webgl2');
            if (gl) {
                features.push('WebGL 2.0');
                const ext = gl.getExtension('WEBGL_debug_renderer_info');
                if (ext) {
                    features.push(gl.getParameter(ext.UNMASKED_RENDERER_WEBGL));
                }
            }
        }

        return { backend: this.currentBackend, features };
    }

    // Memory-safe tensor operations
    async runInference<T>(
        fn: () => T | Promise<T>
    ): Promise<T> {
        return tf.tidy(() => fn());
    }

    // Get memory usage
    getMemoryInfo(): tf.MemoryInfo {
        return tf.memory();
    }

    // Clean up tensors
    dispose(): void {
        tf.disposeVariables();
    }
}

export const tfSetup = new TensorFlowSetup();

Image Classification with Pre-trained Models

Image classification is the foundation of many computer vision features. Using MobileNet, you can classify images into 1000 categories with excellent performance on mobile devices:

// image-classifier.ts
import * as tf from '@tensorflow/tfjs';
import * as mobilenet from '@tensorflow-models/mobilenet';

interface ClassificationResult {
    className: string;
    probability: number;
}

interface ClassifierConfig {
    version?: 1 | 2;
    alpha?: 0.25 | 0.50 | 0.75 | 1.0;
    topK?: number;
}

class ImageClassifier {
    private model: mobilenet.MobileNet | null = null;
    private config: ClassifierConfig;

    constructor(config: ClassifierConfig = {}) {
        this.config = {
            version: config.version || 2,
            alpha: config.alpha || 1.0, // Accuracy vs speed tradeoff
            topK: config.topK || 5
        };
    }

    async load(): Promise<void> {
        console.log('Loading MobileNet model...');
        const startTime = performance.now();

        this.model = await mobilenet.load({
            version: this.config.version,
            alpha: this.config.alpha
        });

        console.log(`Model loaded in ${(performance.now() - startTime).toFixed(0)}ms`);
    }

    async classify(
        input: HTMLImageElement | HTMLVideoElement | HTMLCanvasElement | ImageData
    ): Promise<ClassificationResult[]> {
        if (!this.model) {
            throw new Error('Model not loaded. Call load() first.');
        }

        const predictions = await this.model.classify(input, this.config.topK);

        return predictions.map(p => ({
            className: p.className,
            probability: p.probability
        }));
    }

    // Get intermediate features for custom classification
    async getFeatures(
        input: HTMLImageElement | HTMLVideoElement | HTMLCanvasElement
    ): Promise<tf.Tensor> {
        if (!this.model) {
            throw new Error('Model not loaded');
        }

        // Get the embedding from the model's internal layer
        const embedding = this.model.infer(input, true);
        return embedding;
    }

    // Custom classification with your own labels
    async classifyWithCustomLabels(
        input: HTMLImageElement,
        referenceEmbeddings: Map<string, tf.Tensor>,
        threshold: number = 0.7
    ): Promise<{ label: string; similarity: number } | null> {
        const inputEmbedding = await this.getFeatures(input);

        let bestMatch: { label: string; similarity: number } | null = null;

        for (const [label, refEmbedding] of referenceEmbeddings) {
            const similarity = tf.tidy(() => {
                // Cosine similarity
                const dotProduct = tf.sum(tf.mul(inputEmbedding, refEmbedding));
                const normA = tf.norm(inputEmbedding);
                const normB = tf.norm(refEmbedding);
                return dotProduct.div(normA.mul(normB)).dataSync()[0];
            });

            if (similarity > threshold && (!bestMatch || similarity > bestMatch.similarity)) {
                bestMatch = { label, similarity };
            }
        }

        inputEmbedding.dispose();
        return bestMatch;
    }

    dispose(): void {
        if (this.model) {
            // MobileNet doesn't have explicit dispose, but we clear the reference
            this.model = null;
        }
    }
}

// React hook for image classification
export function useImageClassifier(config?: ClassifierConfig) {
    const [classifier, setClassifier] = useState<ImageClassifier | null>(null);
    const [isLoading, setIsLoading] = useState(true);
    const [error, setError] = useState<Error | null>(null);

    useEffect(() => {
        const loadClassifier = async () => {
            try {
                const clf = new ImageClassifier(config);
                await clf.load();
                setClassifier(clf);
            } catch (err) {
                setError(err as Error);
            } finally {
                setIsLoading(false);
            }
        };
        loadClassifier();

        return () => classifier?.dispose();
    }, []);

    const classify = useCallback(async (
        input: HTMLImageElement | HTMLVideoElement | HTMLCanvasElement
    ) => {
        if (!classifier) return [];
        return classifier.classify(input);
    }, [classifier]);

    return { classify, isLoading, error };
}

Real-Time Object Detection with COCO-SSD

Object detection identifies and localizes multiple objects within an image. The COCO-SSD model can detect 80 different object categories with bounding boxes:

// object-detector.ts
import * as cocoSsd from '@tensorflow-models/coco-ssd';

interface DetectedObject {
    class: string;
    score: number;
    bbox: {
        x: number;
        y: number;
        width: number;
        height: number;
    };
}

interface DetectionConfig {
    base?: 'mobilenet_v1' | 'mobilenet_v2' | 'lite_mobilenet_v2';
    maxDetections?: number;
    scoreThreshold?: number;
}

class ObjectDetector {
    private model: cocoSsd.ObjectDetection | null = null;
    private config: DetectionConfig;
    private isProcessing: boolean = false;

    constructor(config: DetectionConfig = {}) {
        this.config = {
            base: config.base || 'lite_mobilenet_v2', // Fastest option
            maxDetections: config.maxDetections || 20,
            scoreThreshold: config.scoreThreshold || 0.5
        };
    }

    async load(): Promise<void> {
        this.model = await cocoSsd.load({
            base: this.config.base
        });
    }

    async detect(
        input: HTMLImageElement | HTMLVideoElement | HTMLCanvasElement
    ): Promise<DetectedObject[]> {
        if (!this.model || this.isProcessing) return [];

        this.isProcessing = true;

        try {
            const predictions = await this.model.detect(
                input,
                this.config.maxDetections
            );

            return predictions
                .filter(p => p.score >= this.config.scoreThreshold!)
                .map(p => ({
                    class: p.class,
                    score: p.score,
                    bbox: {
                        x: p.bbox[0],
                        y: p.bbox[1],
                        width: p.bbox[2],
                        height: p.bbox[3]
                    }
                }));
        } finally {
            this.isProcessing = false;
        }
    }

    // Real-time video detection with frame rate control
    async detectFromVideo(
        video: HTMLVideoElement,
        onDetection: (objects: DetectedObject[]) => void,
        targetFPS: number = 30
    ): Promise<() => void> {
        let animationId: number;
        let lastTime = 0;
        const interval = 1000 / targetFPS;

        const detectFrame = async (currentTime: number) => {
            if (currentTime - lastTime >= interval) {
                lastTime = currentTime;
                const objects = await this.detect(video);
                onDetection(objects);
            }
            animationId = requestAnimationFrame(detectFrame);
        };

        animationId = requestAnimationFrame(detectFrame);

        // Return cleanup function
        return () => {
            cancelAnimationFrame(animationId);
        };
    }
}

// Drawing utilities for detection visualization
class DetectionVisualizer {
    private ctx: CanvasRenderingContext2D;
    private colors: Map<string, string> = new Map();

    constructor(canvas: HTMLCanvasElement) {
        this.ctx = canvas.getContext('2d')!;
    }

    private getColor(className: string): string {
        if (!this.colors.has(className)) {
            // Generate consistent color based on class name
            const hash = className.split('').reduce(
                (acc, char) => char.charCodeAt(0) + ((acc << 5) - acc), 0
            );
            const hue = Math.abs(hash) % 360;
            this.colors.set(className, `hsl(${hue}, 70%, 50%)`);
        }
        return this.colors.get(className)!;
    }

    draw(objects: DetectedObject[], sourceWidth: number, sourceHeight: number): void {
        const scaleX = this.ctx.canvas.width / sourceWidth;
        const scaleY = this.ctx.canvas.height / sourceHeight;

        objects.forEach(obj => {
            const color = this.getColor(obj.class);
            const x = obj.bbox.x * scaleX;
            const y = obj.bbox.y * scaleY;
            const width = obj.bbox.width * scaleX;
            const height = obj.bbox.height * scaleY;

            // Draw bounding box
            this.ctx.strokeStyle = color;
            this.ctx.lineWidth = 3;
            this.ctx.strokeRect(x, y, width, height);

            // Draw label background
            const label = `${obj.class} ${(obj.score * 100).toFixed(0)}%`;
            this.ctx.font = 'bold 14px Inter, sans-serif';
            const textWidth = this.ctx.measureText(label).width;

            this.ctx.fillStyle = color;
            this.ctx.fillRect(x, y - 24, textWidth + 12, 24);

            // Draw label text
            this.ctx.fillStyle = 'white';
            this.ctx.fillText(label, x + 6, y - 7);
        });
    }

    clear(): void {
        this.ctx.clearRect(0, 0, this.ctx.canvas.width, this.ctx.canvas.height);
    }
}

// React component for real-time object detection
function ObjectDetectionCamera() {
    const videoRef = useRef<HTMLVideoElement>(null);
    const canvasRef = useRef<HTMLCanvasElement>(null);
    const [detector, setDetector] = useState<ObjectDetector | null>(null);
    const [objects, setObjects] = useState<DetectedObject[]>([]);

    useEffect(() => {
        const initDetector = async () => {
            const det = new ObjectDetector({ scoreThreshold: 0.6 });
            await det.load();
            setDetector(det);
        };
        initDetector();
    }, []);

    useEffect(() => {
        if (!detector || !videoRef.current) return;

        let cleanup: (() => void) | null = null;

        const startDetection = async () => {
            const stream = await navigator.mediaDevices.getUserMedia({
                video: { facingMode: 'environment', width: 640, height: 480 }
            });
            videoRef.current!.srcObject = stream;
            await videoRef.current!.play();

            cleanup = await detector.detectFromVideo(
                videoRef.current!,
                setObjects,
                15 // 15 FPS for detection
            );
        };
        startDetection();

        return () => {
            cleanup?.();
            const stream = videoRef.current?.srcObject as MediaStream;
            stream?.getTracks().forEach(t => t.stop());
        };
    }, [detector]);

    useEffect(() => {
        if (!canvasRef.current || !videoRef.current) return;
        const visualizer = new DetectionVisualizer(canvasRef.current);
        visualizer.clear();
        visualizer.draw(objects, videoRef.current.videoWidth, videoRef.current.videoHeight);
    }, [objects]);

    return (
        <div className="detection-container">
            <video ref={videoRef} style={{ display: 'none' }} />
            <canvas ref={canvasRef} width={640} height={480} />
            <div className="detection-stats">
                Detected: {objects.length} objects
            </div>
        </div>
    );
}

Face Detection and Facial Landmarks with MediaPipe

MediaPipe provides highly optimized solutions for face detection, hand tracking, and pose estimation. Here's how to implement face detection with facial landmarks for features like virtual try-on or face filters:

// face-detection.ts
import { FaceDetector, FilesetResolver, Detection } from '@mediapipe/tasks-vision';

interface FaceDetectionResult {
    boundingBox: {
        x: number;
        y: number;
        width: number;
        height: number;
    };
    keypoints: Array<{
        x: number;
        y: number;
        name: string;
    }>;
    confidence: number;
}

class MediaPipeFaceDetector {
    private detector: FaceDetector | null = null;

    async initialize(): Promise<void> {
        const vision = await FilesetResolver.forVisionTasks(
            'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@latest/wasm'
        );

        this.detector = await FaceDetector.createFromOptions(vision, {
            baseOptions: {
                modelAssetPath: 'https://storage.googleapis.com/mediapipe-models/face_detector/blaze_face_short_range/float16/1/blaze_face_short_range.tflite',
                delegate: 'GPU'
            },
            runningMode: 'VIDEO',
            minDetectionConfidence: 0.5
        });
    }

    detect(
        image: HTMLImageElement | HTMLVideoElement,
        timestampMs?: number
    ): FaceDetectionResult[] {
        if (!this.detector) {
            throw new Error('Detector not initialized');
        }

        const results = timestampMs !== undefined
            ? this.detector.detectForVideo(image, timestampMs)
            : this.detector.detect(image);

        return results.detections.map(detection => ({
            boundingBox: {
                x: detection.boundingBox!.originX,
                y: detection.boundingBox!.originY,
                width: detection.boundingBox!.width,
                height: detection.boundingBox!.height
            },
            keypoints: detection.keypoints?.map(kp => ({
                x: kp.x,
                y: kp.y,
                name: kp.name || 'unknown'
            })) || [],
            confidence: detection.categories?.[0]?.score || 0
        }));
    }

    close(): void {
        this.detector?.close();
    }
}

// Face mesh for detailed facial landmarks (468 points)
import { FaceLandmarker, FaceLandmarkerResult } from '@mediapipe/tasks-vision';

class FaceMeshDetector {
    private landmarker: FaceLandmarker | null = null;

    async initialize(): Promise<void> {
        const vision = await FilesetResolver.forVisionTasks(
            'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@latest/wasm'
        );

        this.landmarker = await FaceLandmarker.createFromOptions(vision, {
            baseOptions: {
                modelAssetPath: 'https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task',
                delegate: 'GPU'
            },
            runningMode: 'VIDEO',
            numFaces: 1,
            minFaceDetectionConfidence: 0.5,
            minTrackingConfidence: 0.5,
            outputFaceBlendshapes: true, // For expression detection
            outputFacialTransformationMatrixes: true // For 3D positioning
        });
    }

    detectLandmarks(
        video: HTMLVideoElement,
        timestampMs: number
    ): FaceLandmarkerResult {
        if (!this.landmarker) {
            throw new Error('Landmarker not initialized');
        }
        return this.landmarker.detectForVideo(video, timestampMs);
    }

    // Get facial expressions from blend shapes
    getExpressions(result: FaceLandmarkerResult): Map<string, number> {
        const expressions = new Map<string, number>();

        if (result.faceBlendshapes && result.faceBlendshapes.length > 0) {
            const blendshapes = result.faceBlendshapes[0].categories;

            // Map common expressions
            const expressionMap: Record<string, string[]> = {
                'smile': ['mouthSmileLeft', 'mouthSmileRight'],
                'frown': ['mouthFrownLeft', 'mouthFrownRight'],
                'surprise': ['browInnerUp', 'jawOpen'],
                'eyesClosed': ['eyeBlinkLeft', 'eyeBlinkRight']
            };

            for (const [expr, shapeNames] of Object.entries(expressionMap)) {
                const values = shapeNames
                    .map(name => blendshapes.find(b => b.categoryName === name)?.score || 0);
                expressions.set(expr, values.reduce((a, b) => a + b, 0) / values.length);
            }
        }

        return expressions;
    }

    close(): void {
        this.landmarker?.close();
    }
}

// Virtual try-on component example
function VirtualTryOn({ overlayImage }: { overlayImage: string }) {
    const videoRef = useRef<HTMLVideoElement>(null);
    const canvasRef = useRef<HTMLCanvasElement>(null);
    const [detector] = useState(() => new FaceMeshDetector());

    useEffect(() => {
        let animationId: number;

        const init = async () => {
            await detector.initialize();

            const stream = await navigator.mediaDevices.getUserMedia({
                video: { width: 640, height: 480, facingMode: 'user' }
            });
            videoRef.current!.srcObject = stream;
            await videoRef.current!.play();

            const overlay = new Image();
            overlay.src = overlayImage;
            await overlay.decode();

            const ctx = canvasRef.current!.getContext('2d')!;

            const processFrame = () => {
                const result = detector.detectLandmarks(
                    videoRef.current!,
                    performance.now()
                );

                // Draw video frame
                ctx.drawImage(videoRef.current!, 0, 0);

                if (result.faceLandmarks && result.faceLandmarks.length > 0) {
                    const landmarks = result.faceLandmarks[0];

                    // Get face bounding landmarks for positioning overlay
                    const leftEye = landmarks[33];
                    const rightEye = landmarks[263];
                    const nose = landmarks[1];

                    // Calculate overlay position and size
                    const eyeDistance = Math.sqrt(
                        Math.pow((rightEye.x - leftEye.x) * 640, 2) +
                        Math.pow((rightEye.y - leftEye.y) * 480, 2)
                    );

                    const overlayWidth = eyeDistance * 2.5;
                    const overlayHeight = overlayWidth * (overlay.height / overlay.width);
                    const overlayX = nose.x * 640 - overlayWidth / 2;
                    const overlayY = nose.y * 480 - overlayHeight * 0.6;

                    ctx.drawImage(overlay, overlayX, overlayY, overlayWidth, overlayHeight);
                }

                animationId = requestAnimationFrame(processFrame);
            };

            processFrame();
        };

        init();

        return () => {
            cancelAnimationFrame(animationId);
            detector.close();
        };
    }, [overlayImage]);

    return (
        <div className="try-on-container">
            <video ref={videoRef} style={{ display: 'none' }} />
            <canvas ref={canvasRef} width={640} height={480} />
        </div>
    );
}

OCR (Optical Character Recognition) Implementation

OCR extracts text from images, enabling features like document scanning, receipt processing, and ID verification. Tesseract.js provides browser-based OCR, while cloud APIs offer higher accuracy for production use:

// ocr-service.ts
import Tesseract from 'tesseract.js';

interface OCRResult {
    text: string;
    confidence: number;
    blocks: TextBlock[];
    processingTime: number;
}

interface TextBlock {
    text: string;
    confidence: number;
    bbox: { x: number; y: number; width: number; height: number };
    words: Word[];
}

interface Word {
    text: string;
    confidence: number;
    bbox: { x: number; y: number; width: number; height: number };
}

class BrowserOCR {
    private worker: Tesseract.Worker | null = null;
    private languages: string[];

    constructor(languages: string[] = ['eng']) {
        this.languages = languages;
    }

    async initialize(): Promise<void> {
        this.worker = await Tesseract.createWorker(this.languages, 1, {
            logger: m => {
                if (m.status === 'recognizing text') {
                    console.log(`OCR Progress: ${(m.progress * 100).toFixed(0)}%`);
                }
            }
        });

        // Configure for better accuracy
        await this.worker.setParameters({
            tessedit_pageseg_mode: Tesseract.PSM.AUTO, // Automatic page segmentation
            preserve_interword_spaces: '1'
        });
    }

    async recognize(
        image: HTMLImageElement | HTMLCanvasElement | File | Blob | string
    ): Promise<OCRResult> {
        if (!this.worker) {
            throw new Error('OCR worker not initialized');
        }

        const startTime = performance.now();
        const result = await this.worker.recognize(image);
        const processingTime = performance.now() - startTime;

        return {
            text: result.data.text.trim(),
            confidence: result.data.confidence,
            blocks: result.data.blocks?.map(block => ({
                text: block.text,
                confidence: block.confidence,
                bbox: {
                    x: block.bbox.x0,
                    y: block.bbox.y0,
                    width: block.bbox.x1 - block.bbox.x0,
                    height: block.bbox.y1 - block.bbox.y0
                },
                words: block.words?.map(word => ({
                    text: word.text,
                    confidence: word.confidence,
                    bbox: {
                        x: word.bbox.x0,
                        y: word.bbox.y0,
                        width: word.bbox.x1 - word.bbox.x0,
                        height: word.bbox.y1 - word.bbox.y0
                    }
                })) || []
            })) || [],
            processingTime
        };
    }

    // Preprocess image for better OCR accuracy
    preprocessImage(canvas: HTMLCanvasElement): HTMLCanvasElement {
        const ctx = canvas.getContext('2d')!;
        const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height);
        const data = imageData.data;

        // Convert to grayscale and apply adaptive thresholding
        for (let i = 0; i < data.length; i += 4) {
            const gray = 0.299 * data[i] + 0.587 * data[i + 1] + 0.114 * data[i + 2];
            // Apply contrast enhancement
            const enhanced = gray > 128 ? 255 : 0;
            data[i] = data[i + 1] = data[i + 2] = enhanced;
        }

        ctx.putImageData(imageData, 0, 0);
        return canvas;
    }

    async terminate(): Promise<void> {
        await this.worker?.terminate();
    }
}

// Cloud Vision API integration for production use
class CloudVisionOCR {
    private apiKey: string;
    private endpoint: string;

    constructor(apiKey: string, provider: 'google' | 'azure' = 'google') {
        this.apiKey = apiKey;
        this.endpoint = provider === 'google'
            ? 'https://vision.googleapis.com/v1/images:annotate'
            : 'https://your-region.api.cognitive.microsoft.com/vision/v3.2/ocr';
    }

    async recognize(imageBase64: string): Promise<OCRResult> {
        const startTime = performance.now();

        const response = await fetch(`${this.endpoint}?key=${this.apiKey}`, {
            method: 'POST',
            headers: { 'Content-Type': 'application/json' },
            body: JSON.stringify({
                requests: [{
                    image: { content: imageBase64 },
                    features: [
                        { type: 'DOCUMENT_TEXT_DETECTION' },
                        { type: 'TEXT_DETECTION' }
                    ]
                }]
            })
        });

        const data = await response.json();
        const annotation = data.responses[0].fullTextAnnotation;

        return {
            text: annotation?.text || '',
            confidence: 0.95, // Cloud APIs typically don't return overall confidence
            blocks: this.parseBlocks(annotation?.pages || []),
            processingTime: performance.now() - startTime
        };
    }

    private parseBlocks(pages: any[]): TextBlock[] {
        const blocks: TextBlock[] = [];

        for (const page of pages) {
            for (const block of page.blocks || []) {
                const blockText = block.paragraphs
                    ?.flatMap((p: any) => p.words?.map((w: any) =>
                        w.symbols?.map((s: any) => s.text).join('')
                    ).join(' '))
                    .join('\n') || '';

                blocks.push({
                    text: blockText,
                    confidence: block.confidence || 0.9,
                    bbox: this.verticesToBbox(block.boundingBox?.vertices),
                    words: []
                });
            }
        }

        return blocks;
    }

    private verticesToBbox(vertices: any[]): { x: number; y: number; width: number; height: number } {
        if (!vertices || vertices.length < 4) {
            return { x: 0, y: 0, width: 0, height: 0 };
        }
        return {
            x: vertices[0].x || 0,
            y: vertices[0].y || 0,
            width: (vertices[1].x || 0) - (vertices[0].x || 0),
            height: (vertices[2].y || 0) - (vertices[0].y || 0)
        };
    }
}

// Document scanner component
function DocumentScanner() {
    const [ocr] = useState(() => new BrowserOCR(['eng']));
    const [result, setResult] = useState<OCRResult | null>(null);
    const [isProcessing, setIsProcessing] = useState(false);
    const canvasRef = useRef<HTMLCanvasElement>(null);

    useEffect(() => {
        ocr.initialize();
        return () => { ocr.terminate(); };
    }, []);

    const handleImageUpload = async (file: File) => {
        setIsProcessing(true);

        const img = new Image();
        img.src = URL.createObjectURL(file);
        await img.decode();

        // Draw and preprocess
        const canvas = canvasRef.current!;
        canvas.width = img.width;
        canvas.height = img.height;
        const ctx = canvas.getContext('2d')!;
        ctx.drawImage(img, 0, 0);

        // Preprocess for better accuracy
        ocr.preprocessImage(canvas);

        // Run OCR
        const ocrResult = await ocr.recognize(canvas);
        setResult(ocrResult);
        setIsProcessing(false);
    };

    return (
        <div className="scanner-container">
            <input
                type="file"
                accept="image/*"
                onChange={e => e.target.files?.[0] && handleImageUpload(e.target.files[0])}
            />
            <canvas ref={canvasRef} style={{ maxWidth: '100%' }} />
            {isProcessing && <div>Processing...</div>}
            {result && (
                <div className="ocr-result">
                    <h3>Extracted Text (Confidence: {result.confidence.toFixed(0)}%)</h3>
                    <pre>{result.text}</pre>
                    <p>Processing time: {result.processingTime.toFixed(0)}ms</p>
                </div>
            )}
        </div>
    );
}

Visual search enables users to find products or content using images instead of text. This requires extracting feature embeddings and comparing them efficiently:

// visual-search.ts
import * as tf from '@tensorflow/tfjs';
import * as mobilenet from '@tensorflow-models/mobilenet';

interface SearchResult {
    id: string;
    similarity: number;
    metadata: Record<string, any>;
}

class VisualSearchEngine {
    private model: mobilenet.MobileNet | null = null;
    private index: Map<string, { embedding: Float32Array; metadata: Record<string, any> }> = new Map();

    async initialize(): Promise<void> {
        this.model = await mobilenet.load({ version: 2, alpha: 1.0 });
    }

    // Extract embedding from image
    async extractEmbedding(image: HTMLImageElement | HTMLCanvasElement): Promise<Float32Array> {
        if (!this.model) throw new Error('Model not initialized');

        const embedding = tf.tidy(() => {
            const tensor = this.model!.infer(image, true) as tf.Tensor;
            // Normalize the embedding
            const normalized = tensor.div(tensor.norm());
            return normalized;
        });

        const data = await embedding.data() as Float32Array;
        embedding.dispose();
        return data;
    }

    // Add item to search index
    async indexItem(
        id: string,
        image: HTMLImageElement | HTMLCanvasElement,
        metadata: Record<string, any> = {}
    ): Promise<void> {
        const embedding = await this.extractEmbedding(image);
        this.index.set(id, { embedding, metadata });
    }

    // Search for similar images
    async search(
        queryImage: HTMLImageElement | HTMLCanvasElement,
        topK: number = 10
    ): Promise<SearchResult[]> {
        const queryEmbedding = await this.extractEmbedding(queryImage);
        const results: SearchResult[] = [];

        for (const [id, item] of this.index) {
            const similarity = this.cosineSimilarity(queryEmbedding, item.embedding);
            results.push({ id, similarity, metadata: item.metadata });
        }

        return results
            .sort((a, b) => b.similarity - a.similarity)
            .slice(0, topK);
    }

    private cosineSimilarity(a: Float32Array, b: Float32Array): number {
        let dotProduct = 0;
        for (let i = 0; i < a.length; i++) {
            dotProduct += a[i] * b[i];
        }
        return dotProduct; // Embeddings are already normalized
    }

    // Export index for persistence
    exportIndex(): string {
        const data: Record<string, { embedding: number[]; metadata: Record<string, any> }> = {};

        for (const [id, item] of this.index) {
            data[id] = {
                embedding: Array.from(item.embedding),
                metadata: item.metadata
            };
        }

        return JSON.stringify(data);
    }

    // Import index from persisted data
    importIndex(jsonData: string): void {
        const data = JSON.parse(jsonData);

        for (const [id, item] of Object.entries(data)) {
            const typedItem = item as { embedding: number[]; metadata: Record<string, any> };
            this.index.set(id, {
                embedding: new Float32Array(typedItem.embedding),
                metadata: typedItem.metadata
            });
        }
    }
}

// Product visual search component
function ProductVisualSearch({ products }: { products: Product[] }) {
    const [searchEngine] = useState(() => new VisualSearchEngine());
    const [results, setResults] = useState<SearchResult[]>([]);
    const [isIndexing, setIsIndexing] = useState(true);

    useEffect(() => {
        const initSearch = async () => {
            await searchEngine.initialize();

            // Index all products
            for (const product of products) {
                const img = new Image();
                img.crossOrigin = 'anonymous';
                img.src = product.imageUrl;
                await img.decode();
                await searchEngine.indexItem(product.id, img, {
                    name: product.name,
                    price: product.price
                });
            }

            setIsIndexing(false);
        };
        initSearch();
    }, [products]);

    const handleSearch = async (file: File) => {
        const img = new Image();
        img.src = URL.createObjectURL(file);
        await img.decode();

        const searchResults = await searchEngine.search(img, 5);
        setResults(searchResults);
    };

    return (
        <div className="visual-search">
            {isIndexing ? (
                <p>Indexing products...</p>
            ) : (
                <>
                    <input
                        type="file"
                        accept="image/*"
                        onChange={e => e.target.files?.[0] && handleSearch(e.target.files[0])}
                    />
                    <div className="search-results">
                        {results.map(result => (
                            <div key={result.id} className="result-item">
                                <span>{result.metadata.name}</span>
                                <span>Similarity: {(result.similarity * 100).toFixed(1)}%</span>
                            </div>
                        ))}
                    </div>
                </>
            )}
        </div>
    );
}

Content Moderation with AI

Content moderation ensures uploaded images comply with platform policies. Here's how to implement NSFW detection and content filtering:

// content-moderation.ts
import * as nsfwjs from 'nsfwjs';

interface ModerationResult {
    safe: boolean;
    scores: {
        neutral: number;
        drawing: number;
        hentai: number;
        porn: number;
        sexy: number;
    };
    flaggedCategories: string[];
}

class ContentModerator {
    private model: nsfwjs.NSFWJS | null = null;
    private thresholds: Record<string, number>;

    constructor(thresholds?: Record<string, number>) {
        this.thresholds = thresholds || {
            porn: 0.8,
            hentai: 0.8,
            sexy: 0.9
        };
    }

    async initialize(): Promise<void> {
        // Use the inception v3 model for better accuracy
        this.model = await nsfwjs.load(
            'https://nsfwjs-model.s3.amazonaws.com/inception_v3/model.json',
            { size: 299 }
        );
    }

    async moderate(
        image: HTMLImageElement | HTMLCanvasElement | HTMLVideoElement
    ): Promise<ModerationResult> {
        if (!this.model) throw new Error('Model not initialized');

        const predictions = await this.model.classify(image);

        const scores = {
            neutral: 0,
            drawing: 0,
            hentai: 0,
            porn: 0,
            sexy: 0
        };

        predictions.forEach(p => {
            scores[p.className.toLowerCase() as keyof typeof scores] = p.probability;
        });

        const flaggedCategories: string[] = [];

        if (scores.porn >= this.thresholds.porn) flaggedCategories.push('porn');
        if (scores.hentai >= this.thresholds.hentai) flaggedCategories.push('hentai');
        if (scores.sexy >= this.thresholds.sexy) flaggedCategories.push('sexy');

        return {
            safe: flaggedCategories.length === 0,
            scores,
            flaggedCategories
        };
    }

    // Moderate video by sampling frames
    async moderateVideo(
        video: HTMLVideoElement,
        sampleRate: number = 1 // frames per second
    ): Promise<{ safe: boolean; flaggedFrames: number[] }> {
        const flaggedFrames: number[] = [];
        const duration = video.duration;
        const interval = 1 / sampleRate;

        for (let time = 0; time < duration; time += interval) {
            video.currentTime = time;
            await new Promise(resolve => video.onseeked = resolve);

            const result = await this.moderate(video);
            if (!result.safe) {
                flaggedFrames.push(time);
            }
        }

        return {
            safe: flaggedFrames.length === 0,
            flaggedFrames
        };
    }
}

// Image upload with moderation
function ModeratedUpload({ onUpload }: { onUpload: (file: File) => void }) {
    const [moderator] = useState(() => new ContentModerator());
    const [status, setStatus] = useState<'idle' | 'checking' | 'approved' | 'rejected'>('idle');

    useEffect(() => {
        moderator.initialize();
    }, []);

    const handleFileSelect = async (file: File) => {
        setStatus('checking');

        const img = new Image();
        img.src = URL.createObjectURL(file);
        await img.decode();

        const result = await moderator.moderate(img);

        if (result.safe) {
            setStatus('approved');
            onUpload(file);
        } else {
            setStatus('rejected');
            alert(`Image rejected: Contains ${result.flaggedCategories.join(', ')}`);
        }
    };

    return (
        <div className="moderated-upload">
            <input
                type="file"
                accept="image/*"
                onChange={e => e.target.files?.[0] && handleFileSelect(e.target.files[0])}
                disabled={status === 'checking'}
            />
            {status === 'checking' && <span>Checking content...</span>}
            {status === 'approved' && <span className="success">Image approved</span>}
            {status === 'rejected' && <span className="error">Image rejected</span>}
        </div>
    );
}

Performance Optimization Strategies

Running computer vision models efficiently requires careful optimization:

// cv-performance.ts

// 1. Model quantization and compression
interface OptimizationConfig {
    quantization: '16bit' | '8bit' | 'none';
    inputSize: number;
    batchSize: number;
}

// 2. Efficient frame processing
class FrameProcessor {
    private frameSkip: number;
    private frameCount: number = 0;
    private lastResults: any = null;

    constructor(targetFPS: number, processingFPS: number) {
        this.frameSkip = Math.max(1, Math.floor(targetFPS / processingFPS));
    }

    shouldProcess(): boolean {
        this.frameCount++;
        return this.frameCount % this.frameSkip === 0;
    }

    getLastResults(): any {
        return this.lastResults;
    }

    setResults(results: any): void {
        this.lastResults = results;
    }
}

// 3. Web Worker for off-main-thread processing
// cv-worker.ts
self.onmessage = async (e: MessageEvent) => {
    const { type, imageData, config } = e.data;

    switch (type) {
        case 'classify':
            // Run classification in worker
            const result = await classifyImage(imageData, config);
            self.postMessage({ type: 'result', result });
            break;
    }
};

// 4. Memory management
class MemoryManager {
    private tensors: tf.Tensor[] = [];
    private maxTensors: number = 100;

    track(tensor: tf.Tensor): tf.Tensor {
        this.tensors.push(tensor);

        if (this.tensors.length > this.maxTensors) {
            const toDispose = this.tensors.splice(0, this.tensors.length - this.maxTensors);
            toDispose.forEach(t => t.dispose());
        }

        return tensor;
    }

    cleanup(): void {
        this.tensors.forEach(t => t.dispose());
        this.tensors = [];
    }

    getStats(): { numTensors: number; numBytes: number } {
        return tf.memory();
    }
}

// 5. Lazy loading and code splitting
const loadMobileNet = () => import('@tensorflow-models/mobilenet');
const loadCocoSsd = () => import('@tensorflow-models/coco-ssd');
const loadNSFWJS = () => import('nsfwjs');

Key Takeaways

Remember These Points

  • Choose the right architecture: Use client-side for privacy-sensitive, real-time applications; server-side for accuracy-critical tasks; hybrid for best of both
  • TensorFlow.js for flexibility: Supports custom models and a wide range of pre-trained models for various CV tasks
  • MediaPipe for performance: Offers highly optimized solutions for face, hand, and pose detection with real-time performance
  • OCR tradeoffs: Tesseract.js works offline with 85-95% accuracy; cloud APIs offer 95-99% accuracy for complex documents
  • Visual search requires indexing: Extract embeddings once, compare efficiently at query time
  • Content moderation is essential: Implement before any user-generated image upload feature
  • Optimize aggressively: Use quantized models, skip frames, leverage Web Workers, and manage memory carefully

Conclusion

Computer vision capabilities have become accessible to web developers through powerful frameworks like TensorFlow.js and MediaPipe. Whether you're building virtual try-on features, document scanners, visual search engines, or content moderation systems, the tools and patterns covered in this guide provide a solid foundation for implementing production-grade visual AI features.

For further learning, explore the TensorFlow.js Models Gallery for pre-trained models, the MediaPipe Solutions documentation, and Google Cloud Vision API for enterprise-grade accuracy. Consider also exploring Hugging Face's vision models which can be converted for browser use.

The key to success is matching the right tool to your specific use case while carefully considering the tradeoffs between accuracy, latency, privacy, and cost. Start with simple implementations, measure real-world performance, and iterate based on user feedback.