Imagenet-16 | Pranav Mantini

ImageNet-16: A Lightweight Benchmark for Few-Shot Learning

ImageNet-16: A Lightweight Benchmark for Few-Shot Learning

What is Few-Shot Classification?

Image classification is a fundamental problem in computer vision, where the objective is to categorize images into predefined semantic classes. In standard deep learning, we typically assume that “data is cheap”. To teach a model to recognize a “Golden Retriever,” we feed it thousands of labeled examples. However, in the real world, data is often expensive, rare, or private.

Few-shot classification is a more restrictive formulation where only a limited number of labeled samples are available for training—a constraint that is far more representative of real-world scenarios.

The Standard Protocol: 1, 2, 4, 8, and 16 Shots

To measure how well an algorithm learns from limited data, the community has settled on a standardized “shot” protocol.

k-shot: This refers to the number of training samples available per class.
The Benchmarking Ladder: Most papers, including our recent work on BiCLIP, evaluate models across 1, 2, 4, 8, and 16 shots.

By testing across this spectrum, we can see the “learning curve” of an architecture. Does the model accuracy improve from 1 shot to 16 shot? This delta tells us how efficiently the model utilizes every new piece of information.

The ImageNet Hurdle

ImageNet (specifically ImageNet-1K) remains the “Gold Standard” for internalizing visual concepts. However, it presents a massive logistical hurdle for many:

Size: The full dataset is ~150GB.
Time: Downloading and extracting it can take hours, and training even a simple linear probe on the full set requires significant GPU memory and time.
Redundancy: For few-shot research, you don’t actually need 1,300 images per class if you are only ever going to use 16 for training.

Most researchers end up writing their own scripts to “subsample” ImageNet, but these scripts vary, leading to slight inconsistencies in which specific images are used for the 16-shot sets.

Introducing ImageNet-16

ImageNet-16 addresses this issue.

What is it?

It is a curated subset of the original ImageNet-1K. It contains exactly 16 samples per class for all 1,000 classes.

Total Images: 16,000 (compared to 1.2 million).

Total Size: Under 2GB (compared to 150GB).

Evaluation: The evaluation is performed on the original ImageNet validation set. This ensures that your few-shot results are directly comparable to the state-of-the-art results reported in major CVPR/ICCV papers.

License and Integrity

It is important to note that ImageNet-16 is a derivative work.

License: This subset is distributed under the same non-commercial research license dictated by the original ImageNet (Stanford/Princeton).

Fixed Seeds: The 16 samples were selected using a fixed random seed to ensure that anyone using this dataset is training on the exact same “anchors.”

Why Scale Matters for Multimodal Alignment

This dataset was born out of my work on BiCLIP. In that paper, we hypothesized that the gap between different domains (like satellite imagery vs. generic objects) can be bridged using a geometric transformation.

To recover this transformation, we need “anchors”—the few-shot samples. By using a lightweight dataset like ImageNet-16, we can iterate on these alignment algorithms in minutes rather than days, proving that you don’t need “Big Data” to achieve “Big Results” in cross-modal alignment.

Access the Dataset

The 16-shot ImageNet subset is hosted externally for easy access:

Dataset: Download ImageNet-16 (.zip)
Labels: synset_words.txt
Paper: BiCLIP on arXiv
Imagenet-1K: Official Imagenet website

Note: Please ensure you comply with the original ImageNet Research License.

Citation

Cite the original Imagenet dataset if using this dataset

@inproceedings{deng2009imagenet,
  title={ImageNet: A Large-Scale Hierarchical Image Database},
  author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li},
  booktitle={IEEE Conference on Computer Vision and Pattern Recognition},
  pages={248--255},
  year={2009}
}

Additionally, if this specific 16-shot sampling was useful for your work, please consider citing the BiCLIP paper where this subset was first introduced:

@misc{mantini2026biclipdomaincanonicalizationstructured,
      title={BiCLIP: Domain Canonicalization via Structured Geometric Transformation}, 
      author={Pranav Mantini and Shishir K. Shah},
      year={2026},
      eprint={2603.08942},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.08942}, 
}