Pranav Mantini

Multi-Modal Spaces

2026-04-03T14:24:00+00:00

A Simple View of CNNs or Transformers
- Vision Models
  - Embedding Spaces for Retrieval
    - Example: Image-Based Search
- Textual Models
Mult-Modal Spaces
- The Cross-Modal Search Problem
- Multimodal spaces
  - Challenges to Achieving Harmony
CLIP (Contrastive Language-Image Pre-training)

Every modern AI model, whether it processes images or text, lives in a world of Embeddings.

A Simple View of CNNs or Transformers

Vision Models

If you consider a Convolutional Neural Network (CNN) or a Vision Transformer that are commonly used for image classification, you can conceptually split it into two distinct functional units:

The Feature Extractor (The Encoder): Consider images of dogs: they come in different breeds, sizes, and colors. They can be photographed from different angles, in different lighting, or even obscured by noise and clutter. However, they all relate to a specific concept “dog” despite having large variance in pixels and their orgnizations.

The goal of a robust Encoder is to ignore these superficial variations and focus on the semantic content. It must map a blurry or a top-down photo of a Golden Retriever, or a high-resolution side profile of a Husky to high-dimensional vectors. Such that these vectors when views in a geometric space (embedding space) they cluster together into groups represnting semantic concepts. The encoder in essence is performing a mathematical “distillation” process to cast semantically equalent concepts into clusters.

In the chart below, see how the “Dogs” cluster stays together despite being “Noisy” (spread out), while the “Vehicles” cluster remains distinct. This demonstrates how the encoder handles variance while maintaining semantic grouping:

{
  "data": [
    {
      "x": [1.1, 1.5, 0.8, 1.2, 1.4, 0.9, 1.3, 1.1],
      "y": [5.1, 5.5, 4.9, 5.2, 5.6, 4.8, 5.3, 5.0],
      "mode": "markers",
      "type": "scatter",
      "name": "Images of Dogs",
      "marker": { "color": "rgb(200, 16, 46)", "size": 10 }
    },
    {
      "x": [4.1, 4.5, 3.8, 4.2, 4.4, 3.9, 4.3, 4.1],
      "y": [1.1, 1.5, 0.9, 1.2, 1.6, 0.8, 1.3, 1.0],
      "mode": "markers",
      "type": "scatter",
      "name": "Images of Vehicles",
      "marker": { "color": "rgb(0, 123, 255)", "size": 10 }
    }
  ],
  "layout": {
    "title": "Robust Semantic Clustering in 2D",
    "xaxis": { "title": "Feature Dimension 1", "range": [0, 6] },
    "yaxis": { "title": "Feature Dimension 2", "range": [0, 7] },
    "shapes": [
      {
        "type": "circle",
        "xref": "x", "yref": "y",
        "x0": 0.6, "y0": 4.5, "x1": 1.7, "y1": 5.8,
        "line": { "color": "rgb(200, 16, 46)", "dash": "dash" },
        "opacity": 0.3
      },
      {
        "type": "circle",
        "xref": "x", "yref": "y",
        "x0": 3.6, "y0": 0.6, "x1": 4.7, "y1": 1.8,
        "line": { "color": "rgb(0, 123, 255)", "dash": "dash" },
        "opacity": 0.3
      }
    ],
    "annotations": [
      {
        "x": 1.1, "y": 6.2, "text": "Canine Neighborhood", "showarrow": false, "font": { "color": "rgb(200, 16, 46)" }
      },
      {
        "x": 4.1, "y": 2.1, "text": "Vehicle Neighborhood", "showarrow": false, "font": { "color": "rgb(0, 123, 255)" }
      }
    ]
  }
}

The Classifier (The Head): The Classifier is the “Decision Maker.” It takes the high-dimensional embedding and computes the probability that it belongs to a particular class (e.g., Dog vs. Car).

However, there is a fundamental rule in machine learning: The Classifier can only be as good as the robustness of the embedding features. If the Encoder is weak, the “Canine” and “Vehicle” embeddings will overlap in a messy, inseparable cloud. No matter how complex your Classifier is, it will struggle to draw a boundary between them. But if the Encoder is robust, it creates “Semantic clusters” that are distinct and can allow for good accuracy.

Embedding Spaces for Retrieval

Once we have a robust Embedding Space, we can do more than just classify images. We can use the geometry of the space for Matching and Retrieval.

In a retrieval system, we don’t ask the model “What objects is in this image?” Instead, we ask, “what other images contain this object”. In order to achive this, we take a “Query Image,” find its coordinate, and look for its “Nearest Neighbors” in the database. This allows us to find visually and semantically similar items even if they don’t have a label.

Example: Image-Based Search

Imagine you are building a search engine for images in a database. A user uploads a query photo of a German Shepherd they saw on the street.

The Extraction: Your Encoder processes the photo and generates a unique embedding vector—a specific coordinate in your “Canine” neighborhood.
The Search: The system calculates the mathematical distance (often using Cosine Similarity) between the query’s coordinate and every other image embedding in your database.
The Result: The images with the “closest” coordinates are returned. Because your space is semantically robust, the top results won’t just be vehicles or any dogs; they will be German Shepherds with similar frame geometries, even if those specific attributes were never manually tagged by a human.

{
  "data": [
    {
      "x": [1.5, 1.6, 1.4, 4.0],
      "y": [5.2, 5.3, 5.1, 1.5],
      "mode": "markers+text",
      "type": "scatter",
      "name": "Database Images",
      "text": ["GS Alpha", "GS Beta", "GS Gamma", "Truck"],
      "textposition": "bottom center",
      "marker": { "color": "rgba(0, 0, 0, 0.3)", "size": 10 }
    },
    {
      "x": [1.55],
      "y": [5.25],
      "mode": "markers+text",
      "type": "scatter",
      "name": "Query: German Shepherd",
      "text": ["User Query"],
      "textposition": "top center",
      "marker": { "color": "rgb(200, 16, 46)", "size": 14}
    }
  ],
  "layout": {
    "title": "Retrieval: Finding a German Shepherd",
    "xaxis": { "title": "Feature Dimension 1", "range": [0, 5] },
    "yaxis": { "title": "Feature Dimension 2", "range": [0, 7] },
    "shapes": [
      {
        "type": "circle",
        "xref": "x", "yref": "y",
        "x0": 1.1, "y0": 4.7, "x1": 1.9, "y1": 5.8,
        "line": { "color": "rgb(200, 16, 46)", "dash": "dot" },
        "label": { "text": "Canine Neighborhood", "font": { "size": 10 } }
      }
    ]
  }
}

Textual Models

While Vision models were learning to group pixels, Natural Language Processing (NLP) was undergoing a similar revolution. Instead of looking at pixels, models like BERT or GPT-style Transformers look at textual corpus to understand relationships between words, sentences, and contexts.

The goal remains the same: to learn an embedding space where semantically similar words/sentences are grouped together. For example,

The embeddings correspoding to “Dog” and “Puppy” should be close to each other compared to a non-related concept such as a Truck. (or)
“Canine” and “German Shepherd” should fall into the same “neighborhood,” even if they don’t share a single letter in common.

This allows for Semantic Search. If you search a database for the word “canine,” a robust text space knows to retrieve documents containing “dog,” because the model has learned that these concepts live in the same geometric region of the embedding map.

{
  "data": [
    {
      "x": [4.2, 4.4, 4.0, 1.2],
      "y": [1.2, 1.5, 0.9, 5.2],
      "mode": "markers+text",
      "type": "scatter",
      "name": "Text Embeddings",
      "text": ["Dog", "Puppy", "Canine", "Truck"],
      "textposition": "top center",
      "marker": { "color": "rgb(128, 0, 128)", "size": 12, "symbol": "star"  }
    }
  ],
  "layout": {
    "title": "The Semantic Text Space (Unaligned)",
    "xaxis": { "title": "Language Dimension 1", "range": [0, 5] },
    "yaxis": { "title": "Language Dimension 2", "range": [0, 7] },
    "shapes": [
      {
        "type": "circle",
        "xref": "x", "yref": "y",
        "x0": 3.7, "y0": 0.6, "x1": 4.7, "y1": 1.8,
        "line": { "color": "rgb(128, 0, 128)", "dash": "dot" }
      }
    ],
    "annotations": [
      {
        "x": 4.2, "y": 2.2, "text": "Linguistic 'Canine' Cluster", "showarrow": false, "font": { "color": "rgb(128, 0, 128)" }
      }
    ]
  }
}

At this point, we have built two incredibly powerful geometric spaces: a Visual Space where images are orgnaized based on semantic content and a Textual Space where words/sentences with same semantic meaning are grouped together.

We already know how to use these spaces individually:

We can use an image of a dog to search for other images of dogs in the Visual embedding space (Image-to-Image).
Similarly, we can use keywords to search for related documents or sentences in the Textual space (Text-to-Text).

🔍What if we want to search for all images of a dog by specifying a keyword such as “German Shepherd”?

Multimodal spaces

To solve the cross-modal search problem, We will need a joint space where image and text embedding can live in harmony. This is a unified geometric space where an image of a German Shepherd and the text string “German Shepherd” are mapped to the same (or very similar) coordinates.

{
  "data": [
    {
      "x": [1.2, 1.4, 0.9],
      "y": [5.2, 5.5, 4.9],
      "mode": "markers",
      "type": "scatter",
      "name": "Image Embeddings (Dogs)",
      "marker": { "color": "rgb(200, 16, 46)", "size": 12, "symbol": "circle" }
    },
    {
      "x": [1.25, 1.45, 0.85],
      "y": [5.15, 5.45, 4.85],
      "mode": "markers",
      "type": "scatter",
      "name": "Text Embeddings (Dogs)",
      "marker": { "color": "rgb(128, 0, 128)", "size": 14, "symbol": "star" }
    },
    {
      "x": [4.2, 4.4, 4.0],
      "y": [1.2, 1.5, 0.9],
      "mode": "markers",
      "type": "scatter",
      "name": "Image Embeddings (Vehicles)",
      "marker": { "color": "rgb(0, 123, 255)", "size": 12, "symbol": "circle" }
    },
    {
      "x": [4.25, 4.45, 4.05],
      "y": [1.15, 1.45, 0.85],
      "mode": "markers",
      "type": "scatter",
      "name": "Text Embeddings (Vehicles)",
      "marker": { "color": "rgb(0, 128, 128)", "size": 14, "symbol": "star" }
    }
  ],
  "layout": {
    "title": "A Harmonious Multimodal Space",
    "xaxis": { "title": "Unified Dimension 1", "range": [0, 6] },
    "yaxis": { "title": "Unified Dimension 2", "range": [0, 7] },
    "legend": { "orientation": "h", "y": -0.2 }
  }
}

Challenges to Achieving Harmony

The Dimensionality Mismatch: Vision and text models are often trained independently using different architectures. This means their “outputs” (the embedding vectors) rarely share the same shape.
- A ResNet or Vision Transformer might produce a 512-dimensional vector.
- A BERT or GPT-based text encoder might produce a 768 or 1024-dimensional vector. Mathematically, you cannot compare a 512D point to a 768D point. They don’t even exist in the same space.
The Semantic Disconnect: Even if we force both models to output the same number of dimensions, there is no reason for independently trained models to align.
- The Vision model might learn that top-left space correspond to dogs.
- The Text model might learn bottom-right space correspond to dogs.
Without a shared training objective, an image of a German Shepherd and the word “German Shepherd” will be scattered in completely different parts of their respective maps.
Scaling & Generalization: Even if we solve the dimensionality and alignment issues, training on small, curated datasets (like ImageNet) often fails to produce a truly “generalized” embedding space. To learn a space where any text can map to any image, we need hundreds of millions of diverse image-text pairs. Furthermore, Aligning these “Web-Scale” datasets requires massive computational resources to process the billions of contrastive relationships during training.

CLIP (Contrastive Language-Image Pre-training)

CLIP is a foundational model from OpenAI that addresses these challenges with elegance. By leveraging Contrastive Learning, it bridges the gap between vision and language to build a truly robust multi-modal embedding space.

In our next post, we will look at exactly how CLIP’s architecture works and why “Contrastive” is the secret sauce for its success.

BiCLIP

2026-03-08T00:00:00+00:00

Pranav Mantini, Shishir K. Shah+University of Houston, +The University of OklahomaBiCLIP realigns visual features to the textual manifold.We present BiCLIP, a novel approach for few-shot domain adaptation. By utilizing a structured geometric transformation via an upper-triangular matrix, BiCLIP achieves superior canonicalization across diverse datasets… Recent advances in vision-language models (VLMs) have demonstrated remarkable zero-shot capabilities, yet adapting these models to specialized domains remains a significant challenge. Building on recent theoretical insights suggesting that independently trained VLMs are related by a canonical transformation, we extend this understanding to the concept of domains. We hypothesize that image features across disparate domains are related by a canonicalized geometric transformation that can be recovered using a small set of anchors. Few-shot classification provides a natural setting for this alignment, as the limited labeled samples serve as the anchors required to estimate this transformation. Motivated by this hypothesis, we introduce BiCLIP, a framework that applies a targeted transformation to multimodal features to enhance cross-modal alignment. Our approach is characterized by its extreme simplicity and low parameter footprint. Extensive evaluations across 11 standard benchmarks, including EuroSAT, DTD, and FGVCAircraft, demonstrate that BiCLIP consistently achieves state-of-the-art results. Furthermore, we provide empirical verification of existing geometric findings by analyzing the orthogonality and angular distribution of the learned transformations, confirming that structured alignment is the key to robust domain adaptation. Code is available at this https URL Figure 1. The BiCLIP Adaptation Framework. Unlike standard CLIP which relies on a fixed dot product, BiCLIP introduces a trainable, structured transformation matrix W between the image and text modalities. As shown in the schematic, standard Prompts and Images are projected through their respective encoders. Unlike standard CLIP, which uses a direct dot product (creating a static similarity matrix), BiCLIP introduces a learnable Structured Upper-Triangular Matrix (W).

                    This matrix is applied between the image features (Ii) and text features (Tj).
                    The final matrix of Bilinear Products (shown on the right) represents the geometrically realigned similarity scores, where the term Ii W Tj is computed for every pair.
                    The matrix is determined via standard loss functions used for CLIP and SigLIP.
                Main Results (16 Shot Performance): Performance comparison of BiCLIP across 11 diverse datasets using 16-shot adaptation. BiCLIP consistently outperforms zero-shot baselines, with particularly massive gains on specialized domains like EuroSAT and DTD.
            We conduct experiments across the standard 1, 2, 4, 8, and 16 shots settings.
            Figure below illustrates the performance curves of BiCLIP and BiSigLIP
            compared to five state-of-the-art baselines, including classic Linear Probe adaptation
            methods, prompt tuning variants CoOp and CoCoOp, and more recent
            multimodal prompt learning techniques like MaPLe and PromptSRC.
        To better understand the effectiveness of BiCLIP, we analyze the angular distribution between positive and negative image-text pairs. A smaller overlap between these distributions indicates superior alignment and better class discriminability.Select a dataset to compare the angular distribution of positive and negative pairs between the CLIP baseline and BiCLIP.
            Recent research (Gupta et. al 2026) into Vision-Language Models suggests that independently trained modalities are related by a shared orthogonal map.
            Our analysis confirms that the BiCLIP transformation matrix W preserves this property, maintaining near-orthogonality even after convergence.
        
                    We quantify this by computing the normalized Frobenius norm deviation for W with dimensions DXD as:
                    ||WᵀW - I|| / D.
                BiCLIP achieves an average improvement of +15.24% over zero-shot CLIP and +8.69% over SigLIP. The most significant gains occur in specialized domains, such as +42.15% on EuroSAT, proving its robustness to extreme distribution shifts.Our analysis confirms that the learned transformation W remains nearly orthogonal, with an average normalized error of only 0.022. This validates that BiCLIP performs a structured rotation rather than arbitrary warping.By learning a single structured matrix, BiCLIP requires fewer parameters than state-of-the-art CLIP adaptation methods.The empirical results support our hypothesis: domain shift in VLMs can be recovered by a canonical geometric transformation. This bridges the gap between pre-trained multimodal spaces and specialized visual manifolds.Theoretical Alignment: The near-orthogonality of W aligns with the Multimodal Canonicalization theory [Gupta et. al 2026], proving that BiCLIP restores the relative alignment intended during pre-training.Built for the BiCLIP Research Project © 2026

Visual, Audio, and Textual Datasets

2026-03-03T14:24:00+00:00

Annotations

Datasets are the foundational component of modern machine learning. They provide the raw data and corresponding labels that allow AI models to learn and make predictions. The following content explores the nature of these datasets, examining the different types of annotations—from simple, coarse-grained labels to complex, fine-grained ones—and highlighting influential datasets that have driven breakthroughs in visual, audio, and textual AI.

Annotations

Datasets consist of raw data and accompanying annotations per sample. In general, the raw data is the independent variable, and the annotations are the dependent variable; the most common objective of an AI/ML model is to learn functions over the independent variable to predict/estimate the dependent variables.

While the annotations are specific to a certain problem, it is worthwhile to review a few examples to get a better insight into it.

Visual Data

Visual data can either be images or videos. Sometimes, visual data are pseudo-color images; it can be volumetric cubes like seismic data in O&G, PET scans in healthcare, etc.

Annotations for video and images do not vary that much. Annotations can be coarse or fine-grained.

Coarse annotations

In general, they are per sample and do not provide image spatial information. A very consistent example for this type of annotation is the ones used for image classification, scene classification, video classification, Action recognition, etc. The dataset format will include the following information:

		Image1: class a
		Image2: class b
		Image3: class a
		…

Image classification is the problem where images with a single object (person, animal, etc.) are presented, and the objective is to identify what is present. The dataset for this type of problem contains images and one corresponding label for each image. Sometimes multiple labels are used in multi-class classification problems. Examples:

CIFAR10/CIFAR100: Link
Imagenet: Link
Bird Species Classification: Link

Video Classification is similar in the sense that the input is a video, and the objective is to predict the label for the video. The label can just be the video genre, or it could be action performed by a human in action recognition. Examples:

Youtube8M: Link
UCF101 (action recognition): Link

Fine-grained annotations

In the context of images, datasets may include image spatial annotations. For example, bounding box annotations, where objects in an image are labeled with a box. The dataset can include the following information:

	Image1: class a, x, y, height, width
	Image2: class b, x, y, height, width
	Image3: class a, x, y, height, width

Figure: Image with multiple bounding box annotations per image

Another example is mask annotations, where each image is accompanied by a binary mask. The mask indicates what pixels in the image correspond to an object. The dataset can include the following information:

	Image1: class a, binary mask
	Image2: class b, binary mask
	Image3: class a, binary mask

Figure: Image with multiple mask annotations per image (source: [Link](https://www.superannotate.com/blog/guide-to-semantic-segmentation))

Object detection is a popular example of a problem that requires bounding box annotations. Given an image, all the objects of interest are labeled with a bounding box. The task of object detection is to take an image as input and to identify and localize objects in the image. Examples:

COCO: Link
PASCAL VOC: Link

Segmentation is an example where mask annotations are needed. Some segmentation problems are well defined; examples include background segmentation, object segmentation, scene segmentation, etc. Example

COCO (also provides masks for objects): Link

Fine-grained annotations, in the context of videos, can simply include bounding box annotations or mask annotations for each image in the video.

Other types of annotation may include annotations for temporal video segmentation, where a video is divided into distinct scenes. Or labels for anomaly detection that can be both fine-grained or coarse-grained, etc. Other examples of video datasets:

UCF-crime: Link

Accurate fine-grained annotations are of great value for many problems, but are also extremely time-consuming and tedious to generate.

Audio Data

Audio data, like visual data, can be annotated to train machine learning models. The annotations range from coarse, whole-audio labels to fine-grained, time-specific annotations. The choice of annotation type depends on the specific problem being addressed, such as music genre classification, speech recognition, or sound event detection.

Coarse Annotations

Coarse annotations for audio data typically apply to the entire audio clip and don’t provide temporal information. This is similar to how image classification provides a single label for a whole image.

Audio classification is a broad category where the goal is to assign a single label to an entire audio clip. This can include: Music Genre Classification, where the objective is to identify the genre of a song (e.g., rock, classical, jazz). The dataset for this problem consists of audio files and one corresponding genre label per file. Examples:

GTZAN: Link
Million Song Dataset: Link

Speech/Speaker Identification determines who is speaking or if the audio contains speech. The label could be the speaker’s identity or simply “speech” or “non-speech.” Examples:

LibriSpeech: Link
VoxCeleb: Link

Fine-Grained Annotations

Fine-grained annotations for audio data include temporal information, pinpointing where specific sounds or events occur within an audio file. This is crucial for tasks that require precise localization. This type of annotation provides timestamps for specific events or segments within an audio file. An audio file of a busy street might have annotations like:

5s - 2.0s: 'car horn'
2s - 4.5s: 'siren'
0s - 7.5s: 'dog bark'

Sound Event Detection (SED) is a good example of a problem that requires fine-grained annotations. The goal is to identify and localize sounds in an audio clip by providing start and end times for each sound event. This is similar to object detection in images. Examples:

DCASE Datasets: Link
AudioSet: Link

Like fine-grained visual annotations, creating high-quality fine-grained audio annotations is extremely time-consuming and labor-intensive. It requires human annotators to listen carefully and mark the precise start and end points of events, making it a valuable but costly part of dataset creation.

Textual Data

Just like images, video, and audio, textual data requires annotation to become useful for training AI models, especially in the field of Natural Language Processing (NLP). The process, often called text annotation or data labeling, involves tagging or labeling text to give it structure and meaning that a machine can understand.

Coarse Annotations

Similar to image or audio classification, coarse annotations for text involve assigning a single label to an entire document, paragraph, or sentence. This type of annotation is used for problems where the overall meaning or characteristic of the text is the primary focus. Many problems fall under the broad area of text classification. Sentiment Analysis: This is a very common example. An entire block of text, like a product review or a social media post, is labeled as “positive,” “negative,” “neutral,” or even with a more specific emotion like “joy,” “anger,” or “sadness.” Example: Social Media Sentiment: Link Amazon Reviews: Link

Fine-Grained Annotations

Fine-grained annotations for text are more detailed, providing information about specific words, phrases, or relationships within the text. This is analogous to bounding boxes or masks in visual data. Named Entity Recognition (NER): This is one of the most fundamental fine-grained text annotation tasks. It involves identifying and labeling proper nouns or other specific entities. Example annotation: In the sentence “Steve Jobs co-founded Apple in Cupertino,” the annotations would be:

"Steve Jobs": PERSON
"Apple": ORGANIZATION
"Cupertino": LOCATION

Example:

NER: Link

Influential Datasets

Not all datasets are the same; some have such a deep influence on today’s advancements that the majority of the AI/ML models that one may find in multimedia research are usually pretrained on these datasets. Let’s consider the following:

Visual understanding problems

Imagenet is a large-scale dataset with 14 million images spanning across 20k categories. The dataset initially contained coarse labels, one category name for each image, and was later expanded to have fine-grained bounding box annotations. It was accomplished by crowd-sourcing the annotation work using Amazon Mechanical Turk, and it took nearly 2 years and around 50K participants to have the first version of the dataset.

The majority of the visual recognition tasks today, be it for video or images, use neural nets trained on Imagenet as a backbone to extract features to use them in downstream tasks.

COCO from Microsoft is also a cornerstone dataset for object detection, segmentation, and image captioning. It contains over 330,000 images with detailed annotations for 80 object categories. Also annotated using crowdsourcing.

Audio understanding problems

AudioSet is a massive dataset of over 5 million YouTube videos, each annotated with labels from an ontology of 527 sound events. It’s the most common dataset used for pre-training models for tasks like sound event detection and audio classification.

LibriSpeech is the definitive dataset for speech recognition. It’s a corpus of nearly 1,000 hours of English speech from audiobooks, all with corresponding text transcripts. Models like Wav2Vec and its successors, which have revolutionized speech recognition, are frequently pre-trained on this and similar large, unlabeled speech corpora to learn general speech features before being fine-tuned for specific tasks.

While these are large-scale datasets, they are not annotated via crowdsourcing. AudioSet is manually labeled by human annotators, and LibriSpeech is derived from existing audiobooks.

Textual analysis problems

BooksCorpus (800 million words) and the English Wikipedia (2.5 billion words) are two large-scale datasets. BERT (Bidirectional Encoder Representations from Transformers) by Google was a breakthrough in NLP. Fine-tuning a pre-trained BERT model has become a standard approach for a wide range of tasks, from question answering to sentiment analysis.

The Common Crawl dataset contains petabytes of web data and a variety of books, articles, and other internet texts. OpenAI’s GPT series is trained on these datasets and is often used as a starting point for developing new language models.

Textual datasets like these are annotations in themselves, or do not need annotations for tasks like training LLMs.

Emerging directions of research

A few directions of research that may be of interest to us. Multimodal AI AI models that are trained on multiple modalities (text, audio, video) are becoming increasingly relevant. Research in these areas focuses.

Methods to extract common features from two or more modalities with the same semantic meaning.
Methods to fuse features from various modalities.
Cross-attention models that dynamically understand the importance of features across modalities. etc.

Example datasets and applications

COCO’s VQA dataset is a text and image modal dataset for AI. These datasets allow for solving Vision-and-Language (V&L) tasks like Visual Question Answering (VQA), multimedia data retrieval, image captioning, etc.

YouTube-8M and MuSe (Multimodal Sentiment Analysis), which combines video and audio to analyze emotions Human-Centric AI Datasets are becoming more focused on capturing complex human behavior, interactions, and safety-critical scenarios. This is particularly relevant for applications in social robotics, healthcare, and safety systems.

Example datasets and applications Egocentric Video Datasets (e.g., Ego4D) are collected from a first-person perspective (e.g., using a camera on a headset). They are crucial for understanding human-object interaction and for training models that can assist with daily tasks

Some insights.

The most influential datasets are generally large, containing hundreds of thousands to millions of samples, and often rely on human annotators for fine-grained labels in image and audio modalities. Historically, these influential datasets have been made publicly available, fostering widespread research and innovation.
However, it’s crucial to note that domain-specific, application-specific proprietary datasets can hold significant commercial value. The entities interested in these datasets are typically found in the private, commercial, retail, and public sectors, as they are tailored for specific business problems rather than general academic research.

Imagenet-16

2026-02-03T14:24:00+00:00

ImageNet-16: A Lightweight Benchmark for Few-Shot Learning

Download ImageNet-16 (1.8 GB)

ImageNet-16: A Lightweight Benchmark for Few-Shot Learning

What is Few-Shot Classification?

Image classification is a fundamental problem in computer vision, where the objective is to categorize images into predefined semantic classes. In standard deep learning, we typically assume that “data is cheap”. To teach a model to recognize a “Golden Retriever,” we feed it thousands of labeled examples. However, in the real world, data is often expensive, rare, or private.

Few-shot classification is a more restrictive formulation where only a limited number of labeled samples are available for training—a constraint that is far more representative of real-world scenarios.

The Standard Protocol: 1, 2, 4, 8, and 16 Shots

To measure how well an algorithm learns from limited data, the community has settled on a standardized “shot” protocol.

k-shot: This refers to the number of training samples available per class.
The Benchmarking Ladder: Most papers, including our recent work on BiCLIP, evaluate models across 1, 2, 4, 8, and 16 shots.

By testing across this spectrum, we can see the “learning curve” of an architecture. Does the model accuracy improve from 1 shot to 16 shot? This delta tells us how efficiently the model utilizes every new piece of information.

The ImageNet Hurdle

ImageNet (specifically ImageNet-1K) remains the “Gold Standard” for internalizing visual concepts. However, it presents a massive logistical hurdle for many:

Size: The full dataset is ~150GB.
Time: Downloading and extracting it can take hours, and training even a simple linear probe on the full set requires significant GPU memory and time.
Redundancy: For few-shot research, you don’t actually need 1,300 images per class if you are only ever going to use 16 for training.

Most researchers end up writing their own scripts to “subsample” ImageNet, but these scripts vary, leading to slight inconsistencies in which specific images are used for the 16-shot sets.

Introducing ImageNet-16

ImageNet-16 addresses this issue.

What is it?

It is a curated subset of the original ImageNet-1K. It contains exactly 16 samples per class for all 1,000 classes.

Total Images: 16,000 (compared to 1.2 million).

Total Size: Under 2GB (compared to 150GB).

Evaluation: The evaluation is performed on the original ImageNet validation set. This ensures that your few-shot results are directly comparable to the state-of-the-art results reported in major CVPR/ICCV papers.

License and Integrity

It is important to note that ImageNet-16 is a derivative work.

License: This subset is distributed under the same non-commercial research license dictated by the original ImageNet (Stanford/Princeton).

Fixed Seeds: The 16 samples were selected using a fixed random seed to ensure that anyone using this dataset is training on the exact same “anchors.”

Why Scale Matters for Multimodal Alignment

This dataset was born out of my work on BiCLIP. In that paper, we hypothesized that the gap between different domains (like satellite imagery vs. generic objects) can be bridged using a geometric transformation.

To recover this transformation, we need “anchors”—the few-shot samples. By using a lightweight dataset like ImageNet-16, we can iterate on these alignment algorithms in minutes rather than days, proving that you don’t need “Big Data” to achieve “Big Results” in cross-modal alignment.

Access the Dataset

The 16-shot ImageNet subset is hosted externally for easy access:

Dataset: Download ImageNet-16 (.zip)
Labels: synset_words.txt
Paper: BiCLIP on arXiv
Imagenet-1K: Official Imagenet website

Note: Please ensure you comply with the original ImageNet Research License.

Citation

Cite the original Imagenet dataset if using this dataset

@inproceedings{deng2009imagenet,
  title={ImageNet: A Large-Scale Hierarchical Image Database},
  author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li},
  booktitle={IEEE Conference on Computer Vision and Pattern Recognition},
  pages={248--255},
  year={2009}
}

Additionally, if this specific 16-shot sampling was useful for your work, please consider citing the BiCLIP paper where this subset was first introduced:

@misc{mantini2026biclipdomaincanonicalizationstructured,
      title={BiCLIP: Domain Canonicalization via Structured Geometric Transformation}, 
      author={Pranav Mantini and Shishir K. Shah},
      year={2026},
      eprint={2603.08942},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.08942}, 
}

ADOC | Pranav Mantini

2026-01-08T00:00:00+00:00

A Day on Campus Dataset. An Anomaly Detection Dataset for Events in a Single Camera.Visual examples from the ADOC dataset showing diverse events under varying conditions.Golf Cart on WalkwayPerson on GrassPerson on SkateboardHaving a ConversationRiding Bike (Day)Person with UmbrellaMobility ScooterRiding Bike (Night)The ADOC dataset is a foundational resource for anomaly detection in single-camera surveillance. We acquired 24 continuous hours of 1080p, H.264 video from a live university campus. The data encapsulates extreme variations in illumination, background clutter, and crowd density.We define anomalies as statistically “rare” events (low-frequency occurrences). Our data is meticulously annotated with 875 events, including high-frequency actions and specific rare activities like “golf cart on walkway” or “person with a mobility scooter.”

UHCTD | Pranav Mantini

2026-01-08T00:00:00+00:00

Camera Tampering Dataset. A Comprehensive Dataset for Synthetic Camera Tampering Detection.The University of Houston Camera Tampering Dataset (UHCTD) is a large-scale resource designed to train and evaluate algorithms for detecting unauthorized or accidental changes in surveillance views. Tampering—whether caused by natural phenomena like sunlight reflection and fog, or malicious human intent like spray painting and lens blocking—compromises the integrity of public safety systems.UHCTD features over 26GB of data captured from two high-resolution outdoor cameras (2048x1536 and 1280x960). We utilized advanced image processing techniques to synthesize three primary categories of tampering into real-world surveillance footage:The project includes a comprehensive evaluation of standard deep learning architectures for tampering detection, including:

Pranav Mantini

Multi-Modal Spaces

A Simple View of CNNs or Transformers

Vision Models

Embedding Spaces for Retrieval

Example: Image-Based Search

Textual Models

Mult-Modal Spaces

The Cross-Modal Search Problem

Multimodal spaces

Challenges to Achieving Harmony

CLIP (Contrastive Language-Image Pre-training)

BiCLIP

Visual, Audio, and Textual Datasets

Annotations

Visual Data

Coarse annotations

Fine-grained annotations

Audio Data

Coarse Annotations

Fine-Grained Annotations

Textual Data

Coarse Annotations

Fine-Grained Annotations

Influential Datasets

Visual understanding problems

Audio understanding problems

Textual analysis problems

Emerging directions of research

Example datasets and applications

Some insights.

Imagenet-16

ImageNet-16: A Lightweight Benchmark for Few-Shot Learning

What is Few-Shot Classification?

The Standard Protocol: 1, 2, 4, 8, and 16 Shots

The ImageNet Hurdle

Introducing ImageNet-16

License and Integrity

Why Scale Matters for Multimodal Alignment

Access the Dataset

Citation

ADOC | Pranav Mantini

UHCTD | Pranav Mantini