Multi-Modal Spaces

Every modern AI model, whether it processes images or text, lives in a world of Embeddings.

A Simple View of CNNs or Transformers

Vision Models

If you consider a Convolutional Neural Network (CNN) or a Vision Transformer that are commonly used for image classification, you can conceptually split it into two distinct functional units:

  • The Feature Extractor (The Encoder): Consider images of dogs: they come in different breeds, sizes, and colors. They can be photographed from different angles, in different lighting, or even obscured by noise and clutter. However, they all relate to a specific concept “dog” despite having large variance in pixels and their orgnizations.

    The goal of a robust Encoder is to ignore these superficial variations and focus on the semantic content. It must map a blurry or a top-down photo of a Golden Retriever, or a high-resolution side profile of a Husky to high-dimensional vectors. Such that these vectors when views in a geometric space (embedding space) they cluster together into groups represnting semantic concepts. The encoder in essence is performing a mathematical “distillation” process to cast semantically equalent concepts into clusters.

    In the chart below, see how the “Dogs” cluster stays together despite being “Noisy” (spread out), while the “Vehicles” cluster remains distinct. This demonstrates how the encoder handles variance while maintaining semantic grouping:

{
  "data": [
    {
      "x": [1.1, 1.5, 0.8, 1.2, 1.4, 0.9, 1.3, 1.1],
      "y": [5.1, 5.5, 4.9, 5.2, 5.6, 4.8, 5.3, 5.0],
      "mode": "markers",
      "type": "scatter",
      "name": "Images of Dogs",
      "marker": { "color": "rgb(200, 16, 46)", "size": 10 }
    },
    {
      "x": [4.1, 4.5, 3.8, 4.2, 4.4, 3.9, 4.3, 4.1],
      "y": [1.1, 1.5, 0.9, 1.2, 1.6, 0.8, 1.3, 1.0],
      "mode": "markers",
      "type": "scatter",
      "name": "Images of Vehicles",
      "marker": { "color": "rgb(0, 123, 255)", "size": 10 }
    }
  ],
  "layout": {
    "title": "Robust Semantic Clustering in 2D",
    "xaxis": { "title": "Feature Dimension 1", "range": [0, 6] },
    "yaxis": { "title": "Feature Dimension 2", "range": [0, 7] },
    "shapes": [
      {
        "type": "circle",
        "xref": "x", "yref": "y",
        "x0": 0.6, "y0": 4.5, "x1": 1.7, "y1": 5.8,
        "line": { "color": "rgb(200, 16, 46)", "dash": "dash" },
        "opacity": 0.3
      },
      {
        "type": "circle",
        "xref": "x", "yref": "y",
        "x0": 3.6, "y0": 0.6, "x1": 4.7, "y1": 1.8,
        "line": { "color": "rgb(0, 123, 255)", "dash": "dash" },
        "opacity": 0.3
      }
    ],
    "annotations": [
      {
        "x": 1.1, "y": 6.2, "text": "Canine Neighborhood", "showarrow": false, "font": { "color": "rgb(200, 16, 46)" }
      },
      {
        "x": 4.1, "y": 2.1, "text": "Vehicle Neighborhood", "showarrow": false, "font": { "color": "rgb(0, 123, 255)" }
      }
    ]
  }
}
  • The Classifier (The Head): The Classifier is the “Decision Maker.” It takes the high-dimensional embedding and computes the probability that it belongs to a particular class (e.g., Dog vs. Car).

    However, there is a fundamental rule in machine learning: The Classifier can only be as good as the robustness of the embedding features. If the Encoder is weak, the “Canine” and “Vehicle” embeddings will overlap in a messy, inseparable cloud. No matter how complex your Classifier is, it will struggle to draw a boundary between them. But if the Encoder is robust, it creates “Semantic clusters” that are distinct and can allow for good accuracy.

Embedding Spaces for Retrieval

Once we have a robust Embedding Space, we can do more than just classify images. We can use the geometry of the space for Matching and Retrieval.

In a retrieval system, we don’t ask the model “What objects is in this image?” Instead, we ask, “what other images contain this object”. In order to achive this, we take a “Query Image,” find its coordinate, and look for its “Nearest Neighbors” in the database. This allows us to find visually and semantically similar items even if they don’t have a label.

Imagine you are building a search engine for images in a database. A user uploads a query photo of a German Shepherd they saw on the street.

  • The Extraction: Your Encoder processes the photo and generates a unique embedding vector—a specific coordinate in your “Canine” neighborhood.

  • The Search: The system calculates the mathematical distance (often using Cosine Similarity) between the query’s coordinate and every other image embedding in your database.

  • The Result: The images with the “closest” coordinates are returned. Because your space is semantically robust, the top results won’t just be vehicles or any dogs; they will be German Shepherds with similar frame geometries, even if those specific attributes were never manually tagged by a human.

{
  "data": [
    {
      "x": [1.5, 1.6, 1.4, 4.0],
      "y": [5.2, 5.3, 5.1, 1.5],
      "mode": "markers+text",
      "type": "scatter",
      "name": "Database Images",
      "text": ["GS Alpha", "GS Beta", "GS Gamma", "Truck"],
      "textposition": "bottom center",
      "marker": { "color": "rgba(0, 0, 0, 0.3)", "size": 10 }
    },
    {
      "x": [1.55],
      "y": [5.25],
      "mode": "markers+text",
      "type": "scatter",
      "name": "Query: German Shepherd",
      "text": ["User Query"],
      "textposition": "top center",
      "marker": { "color": "rgb(200, 16, 46)", "size": 14}
    }
  ],
  "layout": {
    "title": "Retrieval: Finding a German Shepherd",
    "xaxis": { "title": "Feature Dimension 1", "range": [0, 5] },
    "yaxis": { "title": "Feature Dimension 2", "range": [0, 7] },
    "shapes": [
      {
        "type": "circle",
        "xref": "x", "yref": "y",
        "x0": 1.1, "y0": 4.7, "x1": 1.9, "y1": 5.8,
        "line": { "color": "rgb(200, 16, 46)", "dash": "dot" },
        "label": { "text": "Canine Neighborhood", "font": { "size": 10 } }
      }
    ]
  }
}

Textual Models

While Vision models were learning to group pixels, Natural Language Processing (NLP) was undergoing a similar revolution. Instead of looking at pixels, models like BERT or GPT-style Transformers look at textual corpus to understand relationships between words, sentences, and contexts.

The goal remains the same: to learn an embedding space where semantically similar words/sentences are grouped together. For example,

  • The embeddings correspoding to “Dog” and “Puppy” should be close to each other compared to a non-related concept such as a Truck. (or)
  • “Canine” and “German Shepherd” should fall into the same “neighborhood,” even if they don’t share a single letter in common.

This allows for Semantic Search. If you search a database for the word “canine,” a robust text space knows to retrieve documents containing “dog,” because the model has learned that these concepts live in the same geometric region of the embedding map.

{
  "data": [
    {
      "x": [4.2, 4.4, 4.0, 1.2],
      "y": [1.2, 1.5, 0.9, 5.2],
      "mode": "markers+text",
      "type": "scatter",
      "name": "Text Embeddings",
      "text": ["Dog", "Puppy", "Canine", "Truck"],
      "textposition": "top center",
      "marker": { "color": "rgb(128, 0, 128)", "size": 12, "symbol": "star"  }
    }
  ],
  "layout": {
    "title": "The Semantic Text Space (Unaligned)",
    "xaxis": { "title": "Language Dimension 1", "range": [0, 5] },
    "yaxis": { "title": "Language Dimension 2", "range": [0, 7] },
    "shapes": [
      {
        "type": "circle",
        "xref": "x", "yref": "y",
        "x0": 3.7, "y0": 0.6, "x1": 4.7, "y1": 1.8,
        "line": { "color": "rgb(128, 0, 128)", "dash": "dot" }
      }
    ],
    "annotations": [
      {
        "x": 4.2, "y": 2.2, "text": "Linguistic 'Canine' Cluster", "showarrow": false, "font": { "color": "rgb(128, 0, 128)" }
      }
    ]
  }
}

Mult-Modal Spaces

At this point, we have built two incredibly powerful geometric spaces: a Visual Space where images are orgnaized based on semantic content and a Textual Space where words/sentences with same semantic meaning are grouped together.

The Cross-Modal Search Problem

We already know how to use these spaces individually:

  • We can use an image of a dog to search for other images of dogs in the Visual embedding space (Image-to-Image).

  • Similarly, we can use keywords to search for related documents or sentences in the Textual space (Text-to-Text).

🔍What if we want to search for all images of a dog by specifying a keyword such as “German Shepherd”?

Multimodal spaces

To solve the cross-modal search problem, We will need a joint space where image and text embedding can live in harmony. This is a unified geometric space where an image of a German Shepherd and the text string “German Shepherd” are mapped to the same (or very similar) coordinates.

{
  "data": [
    {
      "x": [1.2, 1.4, 0.9],
      "y": [5.2, 5.5, 4.9],
      "mode": "markers",
      "type": "scatter",
      "name": "Image Embeddings (Dogs)",
      "marker": { "color": "rgb(200, 16, 46)", "size": 12, "symbol": "circle" }
    },
    {
      "x": [1.25, 1.45, 0.85],
      "y": [5.15, 5.45, 4.85],
      "mode": "markers",
      "type": "scatter",
      "name": "Text Embeddings (Dogs)",
      "marker": { "color": "rgb(128, 0, 128)", "size": 14, "symbol": "star" }
    },
    {
      "x": [4.2, 4.4, 4.0],
      "y": [1.2, 1.5, 0.9],
      "mode": "markers",
      "type": "scatter",
      "name": "Image Embeddings (Vehicles)",
      "marker": { "color": "rgb(0, 123, 255)", "size": 12, "symbol": "circle" }
    },
    {
      "x": [4.25, 4.45, 4.05],
      "y": [1.15, 1.45, 0.85],
      "mode": "markers",
      "type": "scatter",
      "name": "Text Embeddings (Vehicles)",
      "marker": { "color": "rgb(0, 128, 128)", "size": 14, "symbol": "star" }
    }
  ],
  "layout": {
    "title": "A Harmonious Multimodal Space",
    "xaxis": { "title": "Unified Dimension 1", "range": [0, 6] },
    "yaxis": { "title": "Unified Dimension 2", "range": [0, 7] },
    "legend": { "orientation": "h", "y": -0.2 }
  }
}
Challenges to Achieving Harmony
  • The Dimensionality Mismatch: Vision and text models are often trained independently using different architectures. This means their “outputs” (the embedding vectors) rarely share the same shape.
    • A ResNet or Vision Transformer might produce a 512-dimensional vector.
    • A BERT or GPT-based text encoder might produce a 768 or 1024-dimensional vector. Mathematically, you cannot compare a 512D point to a 768D point. They don’t even exist in the same space.
  • The Semantic Disconnect: Even if we force both models to output the same number of dimensions, there is no reason for independently trained models to align.
    • The Vision model might learn that top-left space correspond to dogs.
    • The Text model might learn bottom-right space correspond to dogs.

    Without a shared training objective, an image of a German Shepherd and the word “German Shepherd” will be scattered in completely different parts of their respective maps.

  • Scaling & Generalization: Even if we solve the dimensionality and alignment issues, training on small, curated datasets (like ImageNet) often fails to produce a truly “generalized” embedding space. To learn a space where any text can map to any image, we need hundreds of millions of diverse image-text pairs. Furthermore, Aligning these “Web-Scale” datasets requires massive computational resources to process the billions of contrastive relationships during training.

CLIP (Contrastive Language-Image Pre-training)

CLIP is a foundational model from OpenAI that addresses these challenges with elegance. By leveraging Contrastive Learning, it bridges the gap between vision and language to build a truly robust multi-modal embedding space.

In our next post, we will look at exactly how CLIP’s architecture works and why “Contrastive” is the secret sauce for its success.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • ADOC | Pranav Mantini
  • UHCTD | Pranav Mantini
  • BiCLIP
  • Visual, Audio, and Textual Datasets
  • Imagenet-16