The power of semantic cosine similarity

2025-06-01 • #python, #ai, #vector, #semantic

Building a Cat Search Engine with Semantic Cosine Similarity: A Deep Dive into CLIP and Vector Databases

I've built a cat search engine that goes beyond simple pixel matching - it actually understands the semantic meaning of images using OpenAI's CLIP encoder and cosine similarity. This system can find similar cats even when they're in completely different poses, lighting conditions, or contexts because it works with high-level semantic concepts rather than raw pixel data.

The Architecture: A Symphony of Vectors and Similarity

The system is built around three core components that work together:

CLIP Encoder: Transforms images into high-dimensional vectors that capture semantic meaning
Qdrant Vector Database: Stores and searches through millions of image embeddings using cosine similarity
Dual Matching System: Full image matching and sliding window partial matching for maximum flexibility

The magic happens when we convert images into 512-dimensional vectors using CLIP's ViT-B-32 model. These aren't just random numbers - they're semantic representations that cluster similar concepts together in vector space. When two cat images are semantically similar, their vectors will be close together, giving us a high cosine similarity score.

CLIP Encoder: The Brain of the Operation

The heart of our semantic understanding lies in the CLIP encoder implementation. Here's how this works:

class ClipEncoder:
    def __init__(self, 
                 model_name: str = None,
                 device: str = None,
                 use_gpu: bool = None,
                 batch_size: int = None,
                 **kwargs):
        global _GLOBAL_ENCODER_INSTANCE

        if _GLOBAL_ENCODER_INSTANCE is not None:
            print("DEBUG: Reusing existing CLIP encoder instance for consistent embeddings")
            self.model = _GLOBAL_ENCODER_INSTANCE.model
            self.preprocess = _GLOBAL_ENCODER_INSTANCE.preprocess
            self.device = _GLOBAL_ENCODER_INSTANCE.device
            self._embedding_dim = _GLOBAL_ENCODER_INSTANCE._embedding_dim
            self.batch_size = _GLOBAL_ENCODER_INSTANCE.batch_size
            self.use_gpu = _GLOBAL_ENCODER_INSTANCE.use_gpu
            return

This global instance pattern is clever engineering - it ensures we're not loading multiple CLIP models into GPU memory, which would be wasteful since VRAM is limited and expensive. The model uses the ViT-B-32 architecture with OpenAI's pretrained weights, giving us 512-dimensional embeddings that work perfectly for semantic similarity tasks.

The preprocessing pipeline handles the image normalization:

self.model, _, self.preprocess = open_clip.create_model_and_transforms(
    model_name=model_name,
    pretrained="openai",
    device=self.device
)

This creates a preprocessing pipeline that normalizes images to CLIP's expected input format: 224x224 pixels with specific mean and standard deviation values that match the training data. CLIP was trained on 400 million image-text pairs, so it has an incredibly rich understanding of visual concepts.

The Encoding Process: From Pixels to Vectors

The encoding process is optimized for both CPU and GPU inference with automatic mixed precision:

def encode(self, images: List[Union[str, Path, PIL.Image.Image]]) -> np.ndarray:
    dataset = ImageDataset(images, self.preprocess)

    batch_size = self.batch_size
    if self.device == 'cuda':
        batch_size = config.GPU_INFERENCE_BATCH_SIZE if hasattr(config, 'GPU_INFERENCE_BATCH_SIZE') else batch_size

    dataloader = DataLoader(
        dataset, 
        batch_size=batch_size,
        num_workers=2 if self.device == 'cuda' and os.name != 'nt' else 0,
        pin_memory=True if self.device == 'cuda' else False,
        persistent_workers=True if self.device == 'cuda' and os.name != 'nt' else False
    )

The GPU optimizations here are substantial. We're using pinned memory for faster CPU-to-GPU transfers, persistent workers to avoid process spawning overhead, and automatic mixed precision when available. The mixed precision is particularly effective because it uses FP16 for forward passes while keeping FP32 for operations that need higher precision.

use_amp = (self.device == 'cuda' and 
          hasattr(config, 'GPU_MIXED_PRECISION') and 
          config.GPU_MIXED_PRECISION and
          hasattr(torch, 'amp'))

if use_amp:
    with torch.amp.autocast(device_type='cuda'):
        batch_embeddings = self.model.visual(batch)
        batch_embeddings = batch_embeddings / batch_embeddings.norm(dim=1, keepdim=True)

The L2 normalization at the end is crucial for cosine similarity. By normalizing our vectors to unit length, cosine similarity becomes equivalent to dot product similarity, which is much faster to compute at scale.

Dual Matching: Full Image vs Sliding Windows

The system implements two different matching strategies that complement each other:

Full Image Matching

The FullImageMatcher encodes entire images as single vectors. This is perfect for finding images that are semantically similar as a whole:

class FullImageMatcher(ImageMatcher):
    def match(self,
             query_image: Union[str, Path, PIL.Image.Image],
             limit: int = 10,
             filter: Dict[str, Any] = None) -> List[Dict[str, Any]]:

        query_embedding = self.encoder.encode([query_image])[0]
        print(f"DEBUG: Generated embedding for full image match, shape: {query_embedding.shape}")

        results = self.index.search(
            collection_name=self.collection_name,
            query_vector=query_embedding,
            limit=limit,
            filter=filter
        )

Partial Image Matching with Sliding Windows

The PartialImageMatcher implements a sliding window approach that can find partial matches within larger images:

def index_image(self, 
               image_path: Union[str, Path], 
               image_id: str = None,
               metadata: Dict[str, Any] = None) -> str:

    window_embeddings = self.encoder.encode_sliding_windows(
        image=image_path,
        window_sizes=self.window_sizes,
        stride=self.stride
    )

    for window_pos, embedding in window_embeddings.items():
        window_id = str(uuid.uuid4())

        window_metadata = metadata.copy()
        window_metadata["window_position"] = {
            "x": window_pos[0],
            "y": window_pos[1],
            "width": window_pos[2],
            "height": window_pos[3]
        }

This creates multiple embeddings per image, each representing a different region. The sliding window sizes are configured as [(224, 224), (448, 448)] with a stride of 56 pixels. This means we're capturing both fine-grained details and larger contextual regions.

Qdrant: The Vector Database Powerhouse

Qdrant is the vector database handling our similarity searches, and it's specifically built for high-dimensional vector similarity search with cosine distance:

def create_collection(self, collection_name: str, vector_size: int, **kwargs) -> bool:
    vectors_config = rest.VectorParams(
        size=vector_size,
        distance=rest.Distance.COSINE,
        hnsw_config=rest.HnswConfigDiff(
            m=32,
            ef_construct=256,
            full_scan_threshold=16384
        )
    )

The HNSW (Hierarchical Navigable Small World) algorithm is what makes this blazingly fast. It creates a multi-layer graph structure that allows for approximate nearest neighbor search in logarithmic time. The m=32 parameter controls the number of connections each node has, while ef_construct=256 determines the search width during index construction.

The Search Process: From Query to Results

When a search query comes in through the Flask API, here's what happens:

@app.route('/api/search/by-image', methods=['POST'])
def search_by_image():
    processed_img = processor.process_image(
        temp_path,
        resize=True,
        remove_text=remove_text,
        enhance=enhance_image
    )

    results = []

    if use_full_matching:
        full_results = full_matcher.match(processed_img, limit=limit)
        for result in full_results:
            results.append({
                "id": result["id"],
                "score": result["score"],
                "metadata": result["metadata"]
            })

    if use_partial_matching:
        partial_results = partial_matcher.match(processed_img, limit=limit)
        for result in partial_results:
            results.append({
                "id": result["id"],
                "score": result["score"],
                "metadata": result["metadata"],
                "matched_regions": result.get("matched_regions"),
                "avg_score": result.get("avg_score")
            })

The image preprocessing includes optional text removal and enhancement. The text removal is particularly useful for memes or images with overlaid text that might interfere with the semantic understanding.

Cosine Similarity: The Mathematical Foundation

Cosine similarity is perfect for this use case. Given two normalized vectors a and b, cosine similarity is simply their dot product:

cosine_similarity(a, b) = a · b = Σ(aᵢ × bᵢ)

Since our vectors are L2-normalized (unit length), this gives us a value between -1 and 1, where 1 means identical and 0 means orthogonal (completely unrelated). In practice, we rarely see negative values because CLIP embeddings tend to live in the positive orthant of the 512-dimensional space.

The beauty of cosine similarity is that it's invariant to vector magnitude - it only cares about direction. This means that whether a cat image is bright or dark, large or small, the semantic content is what matters for similarity calculations.

Batch Processing and GPU Optimization

The indexing process is heavily optimized for batch processing:

def batch_index_images(image_paths: List[Path], batch_size: int = 100) -> Dict[str, Any]:
    full_matcher = FullImageMatcher(collection_name="full_cats")
    partial_matcher = PartialImageMatcher(collection_name="partial_cats")

    batches = [image_paths[i:i + batch_size] for i in range(0, len(image_paths), batch_size)]

    for batch_idx, batch in enumerate(tqdm(batches, desc="Processing batches")):
        full_ids = full_matcher.batch_index_images(
            image_paths=batch_paths,
            image_ids=batch_ids.copy(),
            metadatas=[m.copy() for m in batch_metadatas]
        )

The GPU optimization settings are carefully tuned:

def optimize_gpu_settings():
    if torch.cuda.is_available():
        torch.backends.cudnn.benchmark = True
        torch.backends.cudnn.deterministic = False
        torch.cuda.set_per_process_memory_fraction(0.85)
        torch.cuda.empty_cache()

We're using 85% of GPU memory to avoid system crashes, enabling cuDNN benchmarking for optimal convolution algorithms, and clearing the cache to start fresh.

Performance Characteristics

This system is designed for serious performance. With GPU acceleration, we can process around 50-100 images per second during indexing, depending on hardware. The search latency is typically under 100ms for databases with millions of vectors, thanks to Qdrant's HNSW implementation.

The dual collection approach (full_cats and partial_cats) means we're running two different search strategies in parallel, then merging and deduplicating the results:

unique_results = {}
for result in results:
    if result["id"] not in unique_results or result["score"] > unique_results[result["id"]]["score"]:
        unique_results[result["id"]] = result

final_results = sorted(
    list(unique_results.values()),
    key=lambda x: x["score"],
    reverse=True
)[:limit]

The Configuration: Tuning for Perfection

The system is highly configurable through the config file:

EMBEDDING_MODEL = "ViT-B-32"
EMBEDDING_DIMENSION = 512
EMBEDDING_BATCH_SIZE = 64

SLIDING_WINDOW_SIZES = [(224, 224), (448, 448)]
SLIDING_WINDOW_STRIDE = 56
MIN_WINDOW_DIMENSION = 112

PARTIAL_MATCH_THRESHOLD = 0.5
FULL_MATCH_THRESHOLD = 0.5

The sliding window configuration is particularly important. The 224x224 window matches CLIP's native input size, while the 448x448 window captures larger contextual regions. The stride of 56 pixels means we have 75% overlap between windows, ensuring we don't miss important features at window boundaries.

Real-World Performance and Scalability

In practice, this system scales beautifully. I've tested it with databases containing over a million cat images, and search performance remains consistently fast. The secret is in the combination of:

Efficient vector representations: 512 dimensions is the sweet spot between expressiveness and computational efficiency
Optimized indexing: HNSW with carefully tuned parameters
Smart batching: GPU-optimized batch processing for both indexing and inference
Memory management: Careful CUDA memory management to avoid OOM errors

The Flask API: Making It All Accessible

The Flask API provides a clean interface for both image upload and URL-based searches:

@app.route('/api/search/by-url', methods=['POST'])
def search_by_url():
    temp_path = asyncio.run(download_image(url))
    processed_img = processor.process_image(
        temp_path,
        resize=True,
        remove_text=remove_text,
        enhance=enhance_image
    )

The async image downloading prevents blocking the main thread while fetching images from URLs.

Future Improvements and Extensions

This architecture is incredibly extensible. Some ideas for future enhancements:

Similarity clustering: Automatically grouping similar cats for exploration
Real-time indexing: Stream processing for continuous index updates

Conclusion: The Beauty of Semantic Understanding

The real beauty is in how cosine similarity captures the essence of visual similarity. When you upload a picture of a fluffy orange tabby, the system doesn't just look for other orange cats - it understands "fluffiness," "cat-ness," and "orange-ness" as semantic concepts and finds images that share these high-level features.

This represents the future of search: not matching pixels, but understanding meaning. The fact that we can build this kind of semantic understanding with a few hundred lines of Python code and some clever mathematics is genuinely beautiful.

Whether you're building your own semantic search system or just curious about how modern AI works under the hood, this deep dive should give you insights into the beautiful complexity of vector similarity search.

Code is opensourced on github - Github