Visual & Multimodal Retrieval System

Vision-text retrieval with FAISS, learned reranking, Ray Serve latency optimization, and responsible AI evaluation

Overview

Built a multimodal retrieval system that combines visual and textual encoders with dense FAISS indexing over 1M+ vectors. The system is designed for visual search workflows where relevance, latency, and grounded failure analysis all matter.

Key Work

  • Designed a vision-text embedding pipeline with transformer-based visual encoders, dense retrieval, and learned reranking.
  • Improved retrieval quality by 8-12% Recall@100 through encoder and reranking experiments.
  • Reduced p95 latency by 15-20% with GPU-optimized Ray Serve batching.
  • Integrated claim-level faithfulness checks and responsible AI evaluation harnesses to compare quality, latency, and failure modes across model variants.

Stack

  • Python, PyTorch, Hugging Face Transformers
  • FAISS / ANN indexing, multimodal embeddings, ranking and reranking
  • Ray Serve, GPU batching, evaluation pipelines

Impact

The project connects core visual search techniques with production-facing constraints: reliable retrieval, measurable ranking quality, fast serving, and responsible evaluation.