The Zero benchmark is entirely open and freely accessible.

WARNING:This large-scale benchmark is built for research purposes only to enable large-scale model training for a broad range of researchers and other interested communities, and is not suitable for any real-world production or application.


Zero, a large-scale Chinese cross-modal benchmark, contains two pre-training datasets called Zero-Corpus and five downstream datasets.

Pre-training datasets

  • 23 million dataset (Zero-Corpus). Zero-Corpus is collected from the search engine and contains images and corresponding textual descriptions, which is filtered from 5 billion image-text pairs by user click-through rate.
  • 2.3 million dataset (Zero-Corpus-Sub). A sub-dataset of Zero-Corpus. Training VLP models on Zero-Corpus may demand overwhelming GPU resources, thus a sub-dataset with 10% image-text pairs is also provided for research purpose.

Downstream datasets

  • ICM It is curated for the image-text matching task. It contains 400,000 image-text pairs, including 200,000 positive cases and 200,000 negative cases.
  • IQM It is a dataset also for the image-text matching task. Different from ICM, we use the search query instead of detailed description text. Similarly, IQM contains 200,000 positive cases and 200,000 negative cases.
  • ICR We collect 200,000 image-text pairs. It contains image-to-text retrieval and text-to-image retrieval tasks.
  • IQR IQR is also proposed for the image-text retrieval task. We randomly select 200,000 queries and the corresponding images as the annotated image-query pairs similar to IQM.
  • Flickr30k-CNA We gather professional English and Chinese linguists to meticulously re-translate all data of Flickr30k and double-check each sentence. Beijing Magic Data Technology Co., Ltd. contributes for the translation of this dataset.


