1. California-ND: An annotated dataset for near-duplicates in personal photo-collections

One challenging issue regarding research in near-duplicate detection is the lack of annotated datasets, taken from real personal photo-collections. Existing datasets comprise frames taken from news clips, movies, sports events, buildings, objects etc. These images can be very different, compared to personal photo-collections, in which images include mostly people travelling, in family moments or activities with friends. In many of the existing datasets, artificial degradations are applied to the original set of images, like cropping, blurring, or other kinds of filtering, in order to create variations of the originals, with the latter serving as ground truth (GT). This kind of degradations however, may not reflect the reality. For example, cropping an image extensively, is used to simulate the zooming action which takes place in real-life conditions. This however, is not accurate since by the time the camera lens has zoomed and focused, the scene may have changed. Furthermore, a cropped image inevitably is less sharp, compared to the real one. The most important omission in existing datasets is the lack of annotation by many observer. Near-duplicate detection is a highly subjective process. in summary, existing datasets have the following shortcomings:

  1. Their images are very different compared to the ones found in personal photo-collections.

  2. They include artificial degradations, which may be different from the real-life scenario.

  3. They provide a binary ground truth, which is not adequate for the subjective task of near-duplicate detection.

In order to provide solutions to the above issues, we have created a new dataset, called California-ND, that may assist researchers in testing algorithms for the detection of near-duplicates in personal photo libraries. California-ND is derived directly from an actual personal travel photo collection. It contains many difficult cases and types of near-duplicates. More importantly, in order to deal with the inevitable ambiguity that the near-duplicate cases exhibit, the dataset is annotated by 10 different subjects. These annotations are combined into a non-binary GT, which indicates the probability that a pair of images may be considered a near-duplicate by an observer. The following image depicts come of the cases included in the dataset.

Apart from the images, EXIF and face recognition data are included, along with the individual annotations by each of the 10 observers.

California-ND is freely available in this website:

2. Phos: The first public dataset for evaluating illumination invariance under real lighting conditions

In the field of computer vision, there are many datasets focusing on different viewpoints, rotation and zooming of the scenes, in order to test the invariance of systems in these categories. However, very little attention is given to the actual illumination conditions, which may exist outdoors. The vast majority of previously presented benchmarks, regarding illumination invariance, are done by manually adjusting image brightness with image processing software. This approach, however, is far from realistic. The algorithm that adjusts the brightness in image processing software, does not necessarily exhibit the same results as those resulting from the exposure of a camera under real imaging conditions. This is even more true, if one considers the fact that the signal to noise ratio is always lower in underexposed image regions. Consequently, underexposed image regions captured by a real camera will have more noise, compared to images with manually-adjusted brightness.

In order to fill this gap in the existing benchmark databases, a new dataset is proposed to realistically test the illumination invariance of algorithms. The dataset is called "Phos" (which means "light" in Greek) and contains various scenes under different combinations of uniform and non-uniform illumination. More particularly, every one of the 15 scenes of the database contains 15 different images: 9 images captured under various strengths of uniform illumination, and 6 images under different degrees of non-uniform illumination. The images contain objects of different shapes, colors and textures. Moreover, the objects are positioned in random locations inside the scene.

The Phos dataset is freely available in the following address:

3. TM-DIED: The Most Difficult Image Enhancement Dataset

As the name implies, this is an extended collection of 222 of my personal travel photos, constituting some of the most challenging cases for image enhancement and tone-mapping algorithms. Most of the photos include strong under/overexposed region, along with correctly exposed ones. The challenge is to automatically enhance these regions, without affecting the correctly exposed ones, and without any visible halo or other artifacts. If you want to check how your algorithm performs in real-life conditions, this the dataset to try.

Here are the details of the TM-DIED dataset:

  • 222 high quality JPEG photos

  • Full EXIF data (except from GPS)

  • Many different cameras (point and shoot, mobile phones, DSLR)

  • Different aspect ratios and sizes (e.g. panoramas, portrait, landscape)

  • Both underexposed/overexposed and correctly exposed image regions

  • Many different types of intensity transitions between under/overexposed and correctly exposed image regions.

  • Different lighting conditions (night, sunset, day, cloudy, sunlight etc.)

  • No visible identifiable faces (you can post them without issues of exposing someone's identity)

  • Free for any type of research (academic or not)! Just mention the source :)

The dataset is hosted in Flickr and you can download it in the following link:

Also: Free Python code for processing this type of images is available in this repository.

4. Busting image enhancement and tone-mapping algorithms: A collection of the most challenging cases

(OUTDATED! please check the following dataset: TM-DIED)

You think that your enhancement or tone-mapping algorithm is very good? Wait until you try the following images.

Local exposure correction and enhancement is my passion. During all those years I collected the most difficult cases I came across. The important characteristic of these images is the fact that they have a part that is correctly exposed and a part that is severely under/overexposed. And here is the tricky part: A good enhancement algorithm will have to enhance the under/overexposed regions, not affect the correctly exposed one, avoid halo artifacts or gradient reversals and retain good local contrast at the same time (while it is reasonably fast)! Most of the cases that I have seen until now in papers, are images which are uniformly under/overexposed, without a correctly exposed region. These are the easy cases! The difficult part is to differentiate between the correctly exposed regions and the degraded ones. Here are some examples from this dataset.

If your algorithm gives good results for these images, without manual tuning, then congratulations! It rocks! :)

To download the dataset, click here.

Free Python code for processing this type of images is available in this repository.