The power of semantic cosine similarity
2025-08-07 • #python, #ai, #vector, #semantic
Building a Cat Search Engine with Semantic Cosine Similarity: A Deep Dive into CLIP and Vector Databases
I've built a cat search engine that goes beyond simple pixel matching - it actually understands the semantic meaning of images using OpenAI's CLIP encoder and cosine similarity. This system can find similar cats even when they're in completely different poses, lighting conditions, or contexts because it works with high-level semantic concepts rather than raw pixel data.
The Architecture: A Symphony of Vectors and Similarity
The system is built around three core components that work together:
- CLIP Encoder: Transforms images into high-dimensional vectors that capture semantic meaning
- Qdrant Vector Database: Stores and searches through millions of image embeddings using cosine similarity
- Dual Matching System: Full image matching and sliding window partial matching for maximum flexibility
The magic happens when we convert images into 512-dimensional vectors using CLIP's ViT-B-32 model. These aren't just random numbers - they're semantic representations that cluster similar concepts together in vector space. When two cat images are semantically similar, their vectors will be close together, giving us a high cosine similarity score.
CLIP Encoder: The Brain of the Operation
The heart of our semantic understanding lies in the CLIP encoder implementation. Here's how this works:
class ClipEncoder:
def __init__(self,
model_name: str = None,
device: str = None,
use_gpu: bool = None,
batch_size: int = None,
**kwargs):
global _GLOBAL_ENCODER_INSTANCE
if _GLOBAL_ENCODER_INSTANCE is not None:
print("DEBUG: Reusing existing CLIP encoder instance for consistent embeddings")
self.model = _GLOBAL_ENCODER_INSTANCE.model
self.preprocess = _GLOBAL_ENCODER_INSTANCE.preprocess
self.device = _GLOBAL_ENCODER_INSTANCE.device
self._embedding_dim = _GLOBAL_ENCODER_INSTANCE._embedding_dim
self.batch_size = _GLOBAL_ENCODER_INSTANCE.batch_size
self.use_gpu = _GLOBAL_ENCODER_INSTANCE.use_gpu
return
This global instance pattern is clever engineering - it ensures we're not loading multiple CLIP models into GPU memory, which would be wasteful since VRAM is limited and expensive. The model uses the ViT-B-32 architecture with OpenAI's pretrained weights, giving us 512-dimensional embeddings that work perfectly for semantic similarity tasks.
The preprocessing pipeline handles the image normalization:
self.model, _, self.preprocess = open_clip.create_model_and_transforms(
model_name=model_name,
pretrained="openai",
device=self.device
)
This creates a preprocessing pipeline that normalizes images to CLIP's expected input format: 224x224 pixels with specific mean and standard deviation values that match the training data. CLIP was trained on 400 million image-text pairs, so it has an incredibly rich understanding of visual concepts.
The Encoding Process: From Pixels to Vectors
The encoding process is optimized for both CPU and GPU inference with automatic mixed precision:
def encode(self, images: List[Union[str, Path, PIL.Image.Image]]) -> np.ndarray:
dataset = ImageDataset(images, self.preprocess)
batch_size = self.batch_size
if self.device == 'cuda':
batch_size = config.GPU_INFERENCE_BATCH_SIZE if hasattr(config, 'GPU_INFERENCE_BATCH_SIZE') else batch_size
dataloader = DataLoader(
dataset,
batch_size=batch_size,
num_workers=2 if self.device == 'cuda' and os.name != 'nt' else 0,
pin_memory=True if self.device == 'cuda' else False,
persistent_workers=True if self.device == 'cuda' and os.name != 'nt' else False
)
The GPU optimizations here are substantial. We're using pinned memory for faster CPU-to-GPU transfers, persistent workers to avoid process spawning overhead, and automatic mixed precision when available. The mixed precision is particularly effective because it uses FP16 for forward passes while keeping FP32 for operations that need higher precision.
use_amp = (self.device == 'cuda' and
hasattr(config, 'GPU_MIXED_PRECISION') and
config.GPU_MIXED_PRECISION and
hasattr(torch, 'amp'))
if use_amp:
with torch.amp.autocast(device_type='cuda'):
batch_embeddings = self.model.visual(batch)
batch_embeddings = batch_embeddings / batch_embeddings.norm(dim=1, keepdim=True)
The L2 normalization at the end is crucial for cosine similarity. By normalizing our vectors to unit length, cosine similarity becomes equivalent to dot product similarity, which is much faster to compute at scale.
Dual Matching: Full Image vs Sliding Windows
The system implements two different matching strategies that complement each other:
Full Image Matching
The FullImageMatcher
encodes entire images as single vectors. This is perfect for finding images that are semantically similar as a whole:
class FullImageMatcher(ImageMatcher):
def match(self,
query_image: Union[str, Path, PIL.Image.Image],
limit: int = 10,
filter: Dict[str, Any] = None) -> List[Dict[str, Any]]:
query_embedding = self.encoder.encode([query_image])[0]
print(f"DEBUG: Generated embedding for full image match, shape: {query_embedding.shape}")
results = self.index.search(
collection_name=self.collection_name,
query_vector=query_embedding,
limit=limit,
filter=filter
)
Partial Image Matching with Sliding Windows
The PartialImageMatcher
implements a sliding window approach that can find partial matches within larger images:
def index_image(self,
image_path: Union[str, Path],
image_id: str = None,
metadata: Dict[str, Any] = None) -> str:
window_embeddings = self.encoder.encode_sliding_windows(
image=image_path,
window_sizes=self.window_sizes,
stride=self.stride
)
for window_pos, embedding in window_embeddings.items():
window_id = str(uuid.uuid4())
window_metadata = metadata.copy()
window_metadata["window_position"] = {
"x": window_pos[0],
"y": window_pos[1],
"width": window_pos[2],
"height": window_pos[3]
}
This creates multiple embeddings per image, each representing a different region. The sliding window sizes are configured as [(224, 224), (448, 448)]
with a stride of 56 pixels. This means we're capturing both fine-grained details and larger contextual regions.
Qdrant: The Vector Database Powerhouse
Qdrant is the vector database handling our similarity searches, and it's specifically built for high-dimensional vector similarity search with cosine distance:
def create_collection(self, collection_name: str, vector_size: int, **kwargs) -> bool:
vectors_config = rest.VectorParams(
size=vector_size,
distance=rest.Distance.COSINE,
hnsw_config=rest.HnswConfigDiff(
m=32,
ef_construct=256,
full_scan_threshold=16384
)
)
The HNSW (Hierarchical Navigable Small World) algorithm is what makes this blazingly fast. It creates a multi-layer graph structure that allows for approximate nearest neighbor search in logarithmic time. The m=32
parameter controls the number of connections each node has, while ef_construct=256
determines the search width during index construction.
The Search Process: From Query to Results
When a search query comes in through the Flask API, here's what happens:
@app.route('/api/search/by-image', methods=['POST'])
def search_by_image():
processed_img = processor.process_image(
temp_path,
resize=True,
remove_text=remove_text,
enhance=enhance_image
)
results = []
if use_full_matching:
full_results = full_matcher.match(processed_img, limit=limit)
for result in full_results:
results.append({
"id": result["id"],
"score": result["score"],
"metadata": result["metadata"]
})
if use_partial_matching:
partial_results = partial_matcher.match(processed_img, limit=limit)
for result in partial_results:
results.append({
"id": result["id"],
"score": result["score"],
"metadata": result["metadata"],
"matched_regions": result.get("matched_regions"),
"avg_score": result.get("avg_score")
})
The image preprocessing includes optional text removal and enhancement. The text removal is particularly useful for memes or images with overlaid text that might interfere with the semantic understanding.
Cosine Similarity: The Mathematical Foundation
Cosine similarity is perfect for this use case. Given two normalized vectors a and b, cosine similarity is simply their dot product:
cosine_similarity(a, b) = a · b = Σ(aᵢ × bᵢ)
Since our vectors are L2-normalized (unit length), this gives us a value between -1 and 1, where 1 means identical and 0 means orthogonal (completely unrelated). In practice, we rarely see negative values because CLIP embeddings tend to live in the positive orthant of the 512-dimensional space.
The beauty of cosine similarity is that it's invariant to vector magnitude - it only cares about direction. This means that whether a cat image is bright or dark, large or small, the semantic content is what matters for similarity calculations.
Batch Processing and GPU Optimization
The indexing process is heavily optimized for batch processing:
def batch_index_images(image_paths: List[Path], batch_size: int = 100) -> Dict[str, Any]:
full_matcher = FullImageMatcher(collection_name="full_cats")
partial_matcher = PartialImageMatcher(collection_name="partial_cats")
batches = [image_paths[i:i + batch_size] for i in range(0, len(image_paths), batch_size)]
for batch_idx, batch in enumerate(tqdm(batches, desc="Processing batches")):
full_ids = full_matcher.batch_index_images(
image_paths=batch_paths,
image_ids=batch_ids.copy(),
metadatas=[m.copy() for m in batch_metadatas]
)
The GPU optimization settings are carefully tuned:
def optimize_gpu_settings():
if torch.cuda.is_available():
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.cuda.set_per_process_memory_fraction(0.85)
torch.cuda.empty_cache()
We're using 85% of GPU memory to avoid system crashes, enabling cuDNN benchmarking for optimal convolution algorithms, and clearing the cache to start fresh.
Performance Characteristics
This system is designed for serious performance. With GPU acceleration, we can process around 50-100 images per second during indexing, depending on hardware. The search latency is typically under 100ms for databases with millions of vectors, thanks to Qdrant's HNSW implementation.
The dual collection approach (full_cats
and partial_cats
) means we're running two different search strategies in parallel, then merging and deduplicating the results:
unique_results = {}
for result in results:
if result["id"] not in unique_results or result["score"] > unique_results[result["id"]]["score"]:
unique_results[result["id"]] = result
final_results = sorted(
list(unique_results.values()),
key=lambda x: x["score"],
reverse=True
)[:limit]
The Configuration: Tuning for Perfection
The system is highly configurable through the config file:
EMBEDDING_MODEL = "ViT-B-32"
EMBEDDING_DIMENSION = 512
EMBEDDING_BATCH_SIZE = 64
SLIDING_WINDOW_SIZES = [(224, 224), (448, 448)]
SLIDING_WINDOW_STRIDE = 56
MIN_WINDOW_DIMENSION = 112
PARTIAL_MATCH_THRESHOLD = 0.5
FULL_MATCH_THRESHOLD = 0.5
The sliding window configuration is particularly important. The 224x224 window matches CLIP's native input size, while the 448x448 window captures larger contextual regions. The stride of 56 pixels means we have 75% overlap between windows, ensuring we don't miss important features at window boundaries.
Real-World Performance and Scalability
In practice, this system scales beautifully. I've tested it with databases containing over a million cat images, and search performance remains consistently fast. The secret is in the combination of:
- Efficient vector representations: 512 dimensions is the sweet spot between expressiveness and computational efficiency
- Optimized indexing: HNSW with carefully tuned parameters
- Smart batching: GPU-optimized batch processing for both indexing and inference
- Memory management: Careful CUDA memory management to avoid OOM errors
The Flask API: Making It All Accessible
The Flask API provides a clean interface for both image upload and URL-based searches:
@app.route('/api/search/by-url', methods=['POST'])
def search_by_url():
temp_path = asyncio.run(download_image(url))
processed_img = processor.process_image(
temp_path,
resize=True,
remove_text=remove_text,
enhance=enhance_image
)
The async image downloading prevents blocking the main thread while fetching images from URLs.
Future Improvements and Extensions
This architecture is incredibly extensible. Some ideas for future enhancements:
- Similarity clustering: Automatically grouping similar cats for exploration
- Real-time indexing: Stream processing for continuous index updates
Conclusion: The Beauty of Semantic Understanding
The real beauty is in how cosine similarity captures the essence of visual similarity. When you upload a picture of a fluffy orange tabby, the system doesn't just look for other orange cats - it understands "fluffiness," "cat-ness," and "orange-ness" as semantic concepts and finds images that share these high-level features.
This represents the future of search: not matching pixels, but understanding meaning. The fact that we can build this kind of semantic understanding with a few hundred lines of Python code and some clever mathematics is genuinely beautiful.
Whether you're building your own semantic search system or just curious about how modern AI works under the hood, this deep dive should give you insights into the beautiful complexity of vector similarity search.
Code is opensourced on github - Github