Homepage

Description

This site is the homepage for PicasaWeb image dataset. The goal of this dataset is to construct a large-scale dataset with a lot of images in each category. A huge number of images and frequent tags for image collections are both required.

Image Collection

PicasaWeb Dataset is constructed with 6.8 million PicasaWeb images with a creative commons license offered by users to share for non-commercial purposes.

Tag Selection

A data-driven scheme is applied for tag selection. 30, 000 images are randomly selected from the dataset and annotated manually by 40 raters. Each rator explores each image to annotate as many tags as possible. These tags are then counted and sorted by frequency. The top tags are frequent tags for the selected images. After a manual filtering to remove meaningless ones, the tags are processed according to this order in the following steps.

Image Annotation for Each Tag

The most frequent tag is selected for annotation first. Among the correlated images with the tag from the annotated 30, 000 preprocessing set, up to 500 images are selected as “seeds”. Color histograms are extracted and applied in a k-nearest neighbor manner to sort all the remaining images in the 6.8 million image dataset. These images are scanned by the raters one by one to check whether it is a positive appearance of the corresponding tag. If there are already 11, 000 images containing the tag, this tag is marked as a “successful” and we move on to the next tag with the same procedure. Otherwise if 200, 000 images have already been scanned but we still do not have 11, 000 positive ones, we give up for this tag, and move on to the next one. Note that we assume that the negative appearances of tags always ournumber the positive ones; this is also the real case
in practice. Therefore, we have 11, 000 positive images and at least 11, 000 negative images for each “successful” tag.


Therefore, each image is manually annotated with one or multiple tags. The tags are from WordNet, and selected with a data-driven scheme from the PicasaWeb image annotation results.

Downloads

This dataset can be downloaded for non-commercial usages. Now this first released version only contains 10 tags for demo. Full version is expected nearly.

Tags

The tag list of this dataset can be downloaded here, where each tag contains 3 lines: tag id (from 1 to 540), the tag, tag description from WordNet.

Images

The total image size of 6.8 million is 1.8TB. Due to the website file size limit, we only provide URLS here (part1, part2). Each line contains the ID of the image (from 1 to 6779241) and the URL to download the image.

Tag-Image Correlation

From the view of each tag, one file for each tag is given, which contains exactly 6779241 bytes. Each byte is one of 'P' (positive), 'N' (negative), '0' (not annotated), corresponding to each image ID with the increasing order of the ID. They can be downloaded here.

References

[1] Zhiyu Wang, Fangtao Li, Edward Y. Chang, Shiqiang Yang. A Data-Driven Study on Image Feature Extraction and Fusion. PDF

Contacts

zhiyuwang@google.com