I have a broad experience in different aspects of Computer Vision, Image Processing and Machine Learning. I like creating algorithms and systems that work in real-life conditions, solving actual real-world problems. As such, most of my research has resulted in systems that are (or have been) part of actual products. This is a list of the topics that I have been working on:
Transforming categorical to dimensional emotion datasets using morphing (Arousal/Valence estimation)
Emotion recognition and understanding is a vital component in human-machine interaction. Dimensional models of affect such as those using Valence and Arousal have advantages over traditional categorical ones due to the complexity of emotional states in humans. However, dimensional emotion annotations are difficult and expensive to collect, therefore they are not as prevalent in the affective computing community. To address these issues, we propose a method to generate synthetic images from existing categorical emotion datasets using face morphing as well as dimensional labels in the circumplex space with full control over the resulting sample distribution, while achieving augmentation factors of at least 20x or more.
Our main contributions can be summarized as follows:
A new dataset augmentation framework that can transform a typical categorical facial expression dataset into a balanced augmented dimensional datasets.
The framework can generate hundreds of different expressions per subject with full user control over their distribution.
The augmented dataset comes with automatically generated, highly consistent Valence/Arousal annotations of continuous dimensional affect.
(On the left): Proposed dataset augmentation framework based on face morphing. Intensity of expression is represented with size and color saturation. Outlined shapes indicate apex expressions.
Examples of the 2 types of face morphings utilized in the proposed augmentation framework, using images from the Radboud Faces Dataset. In this example, all images are synthesized out of 4 given images from the original dataset (outlined in black).
Top: Neutral to Apex (Happy) morphing. Approximating different intensities of expression.
Middle: Apex1 (Sad) to Apex2 (Disgusted) morphing. Approximating mixtures of expressions.
Bottom: Neutral to mix Apex (50% Apex1 + 50% Apex2). Approximating different intensities of mixtures of expressions.
In the following animations, all frames are synthesized, except from the first and last. All others are generated through morphing and have automatically generated Arousal-Valence annotations.
Neutral -> Angry
Neutral -> Disgusted
Neutral -> Afraid
Neutral -> Happy
Neutral -> Sad
Neutral -> Surprised
Happy -> Surprised
Sad -> Disgusted
V. Vonikakis, D. Neo Yuan Rong, S. Winkler. (2021). MorphSet: Augmenting categorical emotion datasets with dimensional affect labels using face morphing. To appear Proc ICIP2021, Alaska USA, September 2021.
Repo demonstrating the morphing process: https://github.com/dexterdley/MorphSet
Repo demonstrating how to build a facial expression analysis system from such morphing dataset: https://github.com/bbonik/facial-expression-analysis
Facial landmarks are important information in many forms of Face Analysis such as Face Recognition and Facial Expression Analysis. The state-of-the-art landmark estimation methods are usually Deep Learning based, and estimate 3D coordinates. As such, apart from landmark detection, they also solve the frontalization problem, because a simple transformation can project the 3D landmarks to a frontal view. Although efficient, these methods are computationally demanding and as such, it is difficult to be used in real-time applications where computational resources do not allow the use of GPUs. In these cases, 2D facial landmark estimation methods may still be very useful, since they can achieve over 60fps with commodity hardware. The downside of 2D facial landmark techniques, is that, they require a separate step for frontalization (estimate how the points will look like if viewed from a frontal orientation of the face). Face frontalization is very important, because it improves many of the subsequent algorithms following, such as facial expression analysis (estimating expressions on a non-frontal face is way more challenging and inaccurate compared to a frontal one, especially when training datasets have approximately frontal training faces).
We propose a fully data-driven frontalization technique for 2D facial landmarks, designed to aid in the analysis of facial expressions. It employs a new normalization strategy aiming to minimize identity variations, by displacing groups of facial landmarks to standardized locations. The technique operates directly on 2D landmark coordinates, does not require additional feature extraction and as such is computationally light. The technique is based on the assumption that, any set of non-frontal 2D landmarks can approximate the coordinates of the frontal view as a linear combination of the non-frontal coordinates. We use a combination of datasets which include different faces captured at the same time under many variations of viewpoints (different Yaw/Pitch combinations). We combine the datasets and try to find a linear model where all non-frontal views of a face are mapped to the frontal one. Essentially we are learning a set of weights, that, given 2D non-frontal landmark coordinates, they will estimate the frontal view as a linear combination of the input coordinates.
The approach is fully data-driven and directly learns from datasets (different viewpoints of the same face). It is computationally very "light" since it entails only one vector x matrix multiplication. In the below image and video you can see results of the proposed approach. The video shows the trained system running in real-time (in Matlab) processing a live video stream from a web camera and frontalizing the non-frontal input 2D landmarks. This work is accespted in the IEEE ICIP2020 conference.
Python code is freely available here. The function takes landmarks detected by the DLIB library and can return the frontalized version of them.
V. Vonikakis, S. Winkler. (2020). Identity Invariant Facial Landmark Frontalization for Facial Expression Analysis. Proc. ICIP2020, Abu Dhabi, October 2020.
Repo demonstrating the frontalization process: https://github.com/bbonik/facial-landmark-frontalization
Data is abundant nowadays. Advances in big data analytics have contributed to the notion that "bigger is better". However, little attention is usually given to feature distributions in the dataset. Many times, datasets can be highly unbalanced: some values/categories may be over-represented, while others may be under-represented. Such imbalance may have a negative impact on many machine learning techniques: the learning algorithm may be very accurate for the over-represented classes, while exhibiting a very high error for the under-represented ones. Oversampling (replicating the under-represented classes) or undersampling (reducing the over-represented classes) are two typical approaches to address this problem.
We introduce a new undersampling MILP-based dataset shaping technique. The proposed optimization leverages on the (possible) redundancies in a large dataset to generate a more compact version of the original dataset with a specified target distribution across each dimension, while simultaneously minimizing linear correlations among dimensions.
In summary, given a large dataset and a required target distribution, our MILP optimisation method creates a compact subset of the original dataset by finding the optimal combination of datapoints that:
Enforces the target distribution across all dimensions.
Minimizes linear correlations between dimensions.
As such, our technique can be seen as complementary to dimensionality reduction: instead of reducing feature dimensions while maintaining the number of observations, we reduce the number of observations while imposing distributional constraints on the dimensions.
The following figure depicts covariance scatter plots for a 6-dimensional dataset with 11K data points. Distribution for each dimension is given by a histogram, while Pearson correlation rho between dimensions and corresponding p-value (in parentheses) are mentioned for each scatter plot. Dimension 6 (D6) is a linear combination of D1 and D4. Three subsets of 1K datapoints are generated with our data shaping technique, so as to have Uniform, Gaussian and Weibull distributions, while minimising correlations between different dimensions.
Our approach may be used for dataset shaping in many different domains, such as in:
Machine learning to (i) create balanced training subsets, with uniform distributions, out of larger unbalanced datasets, or (ii) evaluate performance of an algorithm across datasets with different distributions.
Crowdsourcing, to compile compact-yet-representative subsets of data, with which, crowd workers will interact with. Fewer items would make workers' task manageable, improving the quality of the results, while reducing the cost of the study.
Vonikakis, V., Subramanian, R., Arnfred, J., & Winkler, S. (2017). A Probabilistic Approach to People-CentricPhoto Selection and Sequencing. IEEE Transactions in Multimedia. DOI: 10.1109/TMM.2017.2699859.
Vonikakis, V., Subramanian, R., Winkler, S. (2016). Shaping datasets: Optimal data selection for specific target distributions across dimensions. Proc. IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, Sept. 25-28.
Facial Expression Analysis - Realtime Continuous Estimation of Emotion Attributes
Human decisions/actions can be influenced by emotions. Emotions are conveyed more by non-verbal cues (e.g. facial expressions), rather than actual words and can reveal the true impressions, thoughts or internal state of a person. To this extend, we have developed a prototype system that automatically analyzes human facial expressions, in images or videos, and provides quantitative measures about their characteristics.
Most approaches follow the categorical approach, identifying 7 prototypical discrete emotions/expressions (happy, sad, surprised, angry, neutral, disgusted, afraid). In everyday interactions however, people exhibit non-basic, subtle and rather complex affective states. As such, a single label (or any number of discrete classes) may not reflect this complexity. In the following examples, it is evident that these methods do not differentiate between variations in intensity (slightly happy, very happy etc.) or variations in emotion (positively surprised or negatively surprised).
On the other hand, we follow the dimensional paradigm, used in the psychology domain. We estimate emotional states according to 3 continuous emotion attributes.
As such, emotions are represented as points on the Arousal-Valence plane, and instead of classification we approach the problem as a regression. We believe that this approach is a “richer” way of describing emotions, because it can differentiates between subtle intensity expressions and subtle expression variations.
In order to implement this dimensional approach for facial expression analysis, we utilized a new dataset augmentation/transformation approach, which converts categorical emotion datasets to dimensional datasets with automatic Arousal, Valence and Intensity annotations. The approach offers an augmentation factor of 44x. We combined 3 datasets of 2597 original categorical images, which resulted in a dimensional dataset of 114268 training images. This approach has been described in one of the sections above.
In order to address differences in headpose, we also employed a landmark-based frontalization. For that we used 5 different datasets in order to train the frontalization module. This approach also has been described in one of the sections above.
The dimensional estimation of emotions, is based solely on geometrical features, learned by training a Partial Least Squares (PLS) model.
In the above image we see the top 500 positive and negative weights (distances) for Arousal, Valence and Intensity, learnt using the PLS approach. Thicker and darker lines indicate larger weights.
Arousal features: Distances that encode the position of the eyebrows in relation to the eyes, seem to be associated more with high Arousal. The inner parts of the eyebrow seem to be more important than the outer parts.
Valence features: Increases in Valence are linked to lengthening of distances that encode the eccentricity of the mouth, especially in relation to the mouth corners. Additionally, increasing the distance between the middle of the eyebrows, as well as, slightly lifting them (relatively to the eyes), also contributes to higher Valence. Interestingly, this coincides with previous reports about discriminating genuine from fake smiles, showing that the major difference lies in muscular activity around eyes. On the other hand, distances that lead to a squarish opening of the mouth, along with increasing the outer corner–to–corner distances between mouth–eyebrow pairs (either by lowering the mouth corners, or raising the outer eyebrow corners), are associated with negative Valence.
Intensity features: Intensity seems to be a combination of the characteristics of Arousal and Valence. Again, the inner part of the eyebrows, in relation to the eyes, seems to be important, along with the eccentricity of the mouth.
Here are 2 videos demonstrating the system that we developed. The system is able to analyse faces in images or in videos (in real-time) and give estimations about their Arousal, Valence and Intensity of expression. It is robust to head movements, it can estimate the gaze of a person and has minimal hardware requirements (a typical PC and a simple webcam). The second video demonstrates the results of the algorithm in natural, non-posed expressions taken from an interview.
Possible applications include:
Interactive advertising - “How many people are actually looking at the board?”
Crowd affective analytics. Estimating the emotional distribution in a specific place/room - “What is the vibe in this room/space?”
Emotional feedback on mobile apps (news, social networks, online shopping) - “What is the reaction of the users using the app?”
Customer satisfaction/profiling - “What is the customer mood before/after the counter?”
Also, an article describing this work can be found here: http://adsc.illinois.edu/news/newly-developed-software-can-interpret-emotions-real-time
Case Study: Analysis of the 1st presidential debate (Sep 26, 2016)
We downloaded and analyzed the video of the 1st presidential debate between Hilary Clinton and Donald Trump from YouTube.
We separately analyzed the emotion profile of each candidate. The following images depict the results:
The above image depicts the emotion distribution of each candidate throughout the duration of the debate. The distribution is depicted in the Arousal-Valence space and it is superimposed with the locations of emotion-related adjectives, as defined by the paper:
G. Paltoglou and M. Thelwall, "Seeing Stars of Valence and Arousal in Blog Posts," in IEEE Transactions on Affective Computing, 4(1), pp. 116-123, 2013.
Donald Trump exhibits a very unique emotion distribution, resembling almost a straight line in the negative part of the Arousal-Valence space. Hilary Clinton exhibits a distribution closer to neutral, with a small part extending to the positive side of the space. The above case study is a simple demonstration of the usefulness of the proposed dimensional facial expression analysis approach.
The cartoon character used in the interface is provided by: http://studiolab.ide.tudelft.nl/studiolab/pmri/
Vastenburg MH, Romero NA, van Bel DT, Desmet PMA (2011) PMRI: development of a pictorial mood reporting instrument. In: Proceedings of CHI 2011, Vancouver, BC, Canada.
V. Vonikakis, S. Winkler. (2020). Identity Invariant Facial Landmark Frontalization for Facial Expression Analysis. Proc. ICIP2020, Abu Dhabi, October 2020.
Free Python code repo for Dimensional Facial Expression Analysis with DLIB. https://github.com/bbonik/facial-expression-analysis
Continuous Happiness Intensity Estimation
This research is our submission to the 2016 EmotiW competition for the Group Happiness assessment sub-challenge. Our team was ranked 2nd among 7 other teams. We were the only winning team that did not use deep learning techniques and still got competitive results.
The objective of the Group Happiness assessment is to estimate the happiness of a group of people and quantify its intensity into 6 levels, ranging from 0 (neutral) up to 5 (thrilled). The challenge is triple. First, we have to deal with "in the wild" conditions of faces, which means a lot of variability in the data (different ages, races, headposes, illumination, resolutions etc). Second, the provided dataset (HAPPEI) is quite unbalanced: 2 happiness intensity annotations occupy ~80% of the whole dataset. Third, the dataset is small. Training and validation sets combined, are ~2.6K.
Our approach is based on geometric facial features. Geometric features are largely overlooked in the facial expression analysis community, over appearance-based features, (LBPs, Gabor filters or even plain pixel values). This is because geometric features require one additional step of facial landmark detection, which until recently it was not very accurate. The problem with appearance-based features is that they are affected by appearance :) . This means that they are affected by age, gender, race, facial hair, illumination, texture, small occlusions like hair or spectacles. Facial expression analysis is about analyzing facial expressions and not identity detection (like in face recognition). Consequently, the aim is to be invariant to appearance and to all the above variations. In order to overcome this (while using appearance-based features) you need very large datasets that include many instances of the above appearance variations. This is required in order for your machine learning algorithm to generalize correctly and learn facial expressions which are invariant to appearance. On the other side, if high quality facial landmarks are available, then the problem of appearance variations is largely gone! Good quality landmarks are mostly invariant to age, gender, race, texture or even small occlusions. This means that you don't need large datasets to train your machine learning algorithm.
Our geometric features are based on 49 facial landmarks. Once the points are detected, we normalize them to a common scale and compute all the possible combinations of Euclidean distances between them. This results to a feature vector of 1176 dimensions, which is invariant to rotation (roll) and to scale. The following picture depicts these 1176 distances.
We use Partial Least Squares (PLS) regression to learn a mapping between the geometric features and the provided happiness annotations of the ground truth faces. PLS has 2 main advantages. First, it performs dimensionality reduction and estimation in a single step. Second, it works by projecting the data to latent components, which is very useful for discovering underlying factors and relations.
In order to deal with the unbalanced HAPPEI dataset, we use a Mixed-Integer Linear Programming (MILP) technique that creates balanced subsets of data. The main idea is that, instead of directly training on the given unbalanced dataset, we train on a balanced subset of it. The intuition is that, since we do not know the distribution of the test set, it is best to train with a uniform distribution. This way you may achieve lower errors in the under-represented classes. The following image depicts the balanced subset of HAPPEI with which we trained our system.
After training on a balanced subset of the HAPPEI dataset, our system learned to estimate a real number in the interval [0,5], which reflects the happiness intensity of a face. The important thing here is that, if there is enough headpose variation in the data, then the system will learn to estimate happiness intensity invariantly of headpose.
Our approach can also give important insights regarding the contribution of each particular facial distance to the whole perceived happiness. The following image depicts the top 100 facial distances that contribute positively and negatively to the perception of happiness. It seems that distances that encode at the same time mouth openness and eccentricity contribute a lot to the perception of happiness on a face. Other than that, the internal points of the eyebrows seem also important.
Once trained, the method is very fast and is real-time even for a Matlab implementation. The following video is a screen capture of the actual system. Our estimation is very tolerant in regards to headpose, rotation and scaling.
Vonikakis, V., Subramanian, R., Winkler, S. (2016). Shaping datasets: Optimal data selection for specific target distributions across dimensions. Proc. IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, Sept. 25-28. (Matlab code available here
Vonikakis, V., Yazici, Y., V. D. Nguyen, Winkler, S. (2016). Group Happiness Assessment Using Geometric Features and Dataset Balancing. Proc. ICMI2016, Tokyo, Japan, Nov. 9-13. 2nd Place in EmotiW2016 Group Happiness Challenge.
Emotion Recognition using Deep Learning (Convolutional Neural Networks)
This research is our submission to the 2015 EmotiW competition for the Static image sub-challenge. Our team was ranked 3rd among 17 other teams. Our best performance was 55.6%.
The challenge of the EmotiW dataset is double. First, the images are "in the wild", meaning that they have large variability in terms of illumination, headpose, gender, age, race, resolution and intensity of expression. Second, the dataset is very small, comprising 1.5K images. In this context, we used a Deep Learning approach, training Convolutional Neural Networks (CNNs), in order to classify unseen images of faces into the 7 prototypical universal emotions. In order to avoid overfitting, we employed Transfer Learning and supervised fine-tuning. Instead of training the whole network with the EmotiW dataset, we pre-trained first with generic images (from the ImageNet competition) and then fine-tuned the network weights with emotion-related images. Fine-tuning was done in 2 consecutive stages, first with the FER2013 facial expression dataset and then with the EmotiW2015 dataset.
More specifically, we start with a pre-trained CNN and we randomly initialize only the last fully connected layer. Then we train with the FER2013 dataset with a very low learning rate, which diminishes over time. This is done in order to avoid over-aggressively changing the weights of the whole network. Once the loss plateaus, then we train again with the EmotiW dataset using the same learning rate profile.
The pre-trained CAFFE models of our submission can be freely downloaded from here.
H.-W. Ng, V. D. Nguyen, V. Vonikakis, S. Winkler. (2015). Deep learning for emotion recognition on small datasets using transfer learning. Proc. ICMI2015, Seattle, WA, Nov. 9-13.
FashionMatch - Fashion Optimized Image Retrieval in the Wild
This is a work for an actual real-life working system. The use case is the following:
The user sees a garment that she likes, either in a photo, window, or on another person. She snaps a photo of it with her phone. The photo is uploaded to the server and the server returns visually similar items, along with links (or GPS directions) of where to buy the garment.
Since the system is designed to work "on the wild" it incorporates many state of the art technologies:
Human pose estimation in order to identify the skeleton of the person wearing the item
Automatic garment/background segmentation; no human intervention is needed (e.g. drawing a bounding box)
Extraction of fashion optimized features: sleeve length, dress length, neckline profile, general proportions, shape, pattern, color
Shadow compensation for underexposed image regions: recovering the actual color of the object in real-life imaging conditions
Adjustable metric for defining the "similarity" between images
The system consists of a front-end part (either a browser or an iOS/Android application) and a back-end server. Searching for similar items within a database of 150,000 items takes approximately 4sec in a typical PC with 8Gb of RAM. The following video is an actual recording of the working system (iOS app).
The technology has already been licensed to 3 start-up companies and has received media exposure:
Near-Duplicate Detection in Personal Photo-Collections
Most of us have at least one digital camera with us all times (our smartphones). It is so easy and affordable nowadays to take digital images that many times, we overdo: we tend to capture more than one picture of the same scene, in order to increase the chances of having a good-quality shot.
This however, has led to a constant increase of photolibrary size and has introduced a new important problem: our photo-collections are cluttered with images that are slightly different, but depict the same or almost the same scene. These images are known as “near-duplicates” (NDs) and have a negative impact not only on the size of photolibraries, but generally on the quality of photo-managing and browsing experience.
Take a look at the above near-duplicate photos. Would you like to include all 4 in your photolibrary (if this was your photo) or would you rather prefer to select only one of them? (probably the one with highest image quality). Before however you can select between them, there is a very important step: identify which images in a photo-collection are near-duplicates.
Identifying near-duplicates is a challenging task. Existing methods end up taking binary decisions: two images can either be or not near-duplicates; no intermediate cases can exist. However, this task is highly subjective. Our user studies have shown that only in 18% of images, do all observers totally agree on whether or not a pair of photos may be near-duplicates (QoMEX2013). This essentially means that if an algorithm identifies a pair of images as near-dupliacates, there is 82% chance that some users will disagree. For example, take a look at the following 4 images. Most of you would agree that images A and B are near-duplicates. But what about A and C, or A and D? There would be a lot of disagreement for these cases.
We have developed a new algorithm, named PhotoCluster (to appear VISAPP2014), which attempts to tackle this problem: It estimates the probability that a pair of images may be considered to be near-duplicates by users. First, PhotoCluster partitions the photolibrary into groups of semantically similar photos, using only global features (timestamps, color histograms, geotagging etc.). Then, a multiple clustering step is applied within the images of these groups, using a combination of global and local features. Computationally expensive comparisons between local-features are applied only to a limited part of the library, resulting into a low overall computational cost. Tests have shown that PhotoCluster exhibits promising results compared to existing methods, especially in identifying ambiguous near-duplicate cases, with faster execution times.
The above image includes the results of PhotoCluster, in comparison to other methods and to ground truth (GT). PhotoCluster not only identifies the obvious near-duplicate pair (AB), but also predicts the ambiguous cases (AC, AD, BC, BD).
One challenging issue in this kind of research is the lack of annotated datasets for near-duplicates, taken from real personal photo-collections. Existing datasets comprise frames taken from news clips, movies, sports events, buildings, objects etc. These images can be very different, compared to personal photo-collections, in which images include mostly people travelling, in family moments or activities with friends. In many of the existing datasets, artificial degradations are applied to the original set of images, like cropping, blurring, or other kinds of filtering, in order to create variations of the originals, with the latter serving as ground truth (GT). This kind of degradations however, may be different compared to the real ones.
We have created a new dataset, called California-ND, that may assist researchers in testing algorithms for the detection of near-duplicates in personal photo libraries. California-ND is derived directly from an actual personal travel photo collection. It contains many difficult cases and types of near-duplicates. More importantly, in order to deal with the inevitable ambiguity that the near-duplicate cases exhibit, the dataset is annotated by 10 different subjects. These annotations are combined into a non-binary GT, which indicates the probability that a pair of images may be considered a near-duplicate by an observer. The following image depicts come of the cases included in the dataset.
Vonikakis, V., Jinda-Apiraksa, A., & Winkler, S. (2014). PhotoCluster:A Multi-clustering Technique For Near-duplicate Detection In Personal PhotoCollections. VISAPP 2014. (pp. 153-161). Lisbon, Portugal.
Jinda-Apiraksa, A., Vonikakis, V. & Winkler, S. (2013). California-ND: An annotated dataset for near-duplicate detection in personal photo collections. QoMEX 2013. (pp. 142-147). Klagenfurt, Austria. (Dataset available)
California-ND is freely available from this website: http://vintage.winklerbros.net/californiaND.html
Illumination invariant feature detection: iiDoG (illumination invariant Difference of Gaussians)
Although almost all computer vision tasks are shifting towards deep learning nowadays, SIFT is still used in some cases where complete control is required and low computational resources are available, e.g. registration tasks. Although SIFT has been designed to exhibit a degree of illumination invariance, it falls short when uneven illumination conditions exist. This can be the case in many outdoor scenes, where HDR imaging conditions are usual. This problem can be summarized in the following figure (right click, and 'open in a new tab' in order to see a larger version of the image).
This is a simple scene captured under 3 different illumination conditions: a. strong uniform illumination (well exposed image), b. low uniform illumination (underexposed image) and c. Non-uniform illumination (partially underexposed image). Since it is exactly the same scene, if SIFT was indeed illumination invariant, it should extract approximately the same number of features in all cases. The results are depicted for 5 different detector thresholds. It is obvious, that there are considerable differences between most of the cases. The reason for this discrepancy can be seen in the following figure.
This is a scene containing 2 color checkers (yes there is a color checker in the shadow if you look carefully). The scene is designed in such a way, in order to have a strong uneven illumination. You can see the image intensity (B), as well as the result of the DoG operator (C), for a single scanline passing through the achromatic boxes of the the color checkers. The magnitude of gradient in the shadowed area is considerably smaller than the magnitude of gradient in the well exposed region. SIFT detector is based on the local minima and maxima of the gradient extracted by the DoG operator. The local minima and maxima are indeed invariant to illumination changes. However, in practice, a global threshold is used in order to filter out the points corresponding to noise and not to surface properties.
Now take a look at image C of the above figure. Can you find a global threshold that will include the correct features in the shadowed area, as well as in the well-exposed one? It is very difficult indeed. Because the threshold value should be set very low and this inevitably would result into the extraction of noisy features. More importantly, as soon as the camera would move, and no shadow would be visible, this low threshold would be totally inappropriate for the new scene. Now try to find a global threshold for the operator depicted in image D. It is much easier, since the gradient magnitude in the shadowed region is carefully amplified. This makes it easier to set a global threshold in order to filter out any noisy features, even in the shadows areas.
As a conclusion: The local minima and maxima of the SIFT detector, are indeed invariant to illumination changes. However the global threshold used, in order to filter out the noisy feature points, is greatly affected by the magnitude of gradient. Setting a global threshold will inevitably result into either not detecting keypoints in the shadow areas, or detecting many noisy points.
The proposed iiDoG operator is combination of the classic DoG and the non-linear responses of the center-surround cells of the Human Visual System (here refereed as normalized DoG - nDoG).
The most important advantages of iiDoG are:
The magnitude of gradient is normalized. As a result, one global detector threshold can be used for filtering out the noisy feature points. This will result into the extraction of keypoints in both the underexposed and the correctly exposed image regions.
The improvements are specifically targeted ONLY in the shadowed image regions. The response is exactly the same with DoG in the well exposed areas. Thus, no unpredictable departures will occur from the classic DoG operator in the correctly exposed regions.
It is very simple to implement (as the above figure indicates), since it is based on simple operations in the Gaussian pyramid, already computed by SIFT.
You can see a comparison between a DoG-based SIFT system and an iiDoG-based SIFT. For any threshold value, the proposed operator exhibits more correct detected feature points. Additionally, there are less amount of incorrect matches. More information on this research can be found in the paper "A biologically inspired scale-space for illumination invariant feature detection" which is available here.
In the field of computer vision, there are many datasets focusing on different viewpoints, rotation and zooming of the scenes, in order to test the invariance of systems in these categories. However, very little attention is given to the actual illumination conditions, which may exist outdoors. The vast majority of previously presented benchmarks, regarding illumination invariance, are done by manually adjusting image brightness with image processing software. This approach, however, is far from realistic. The algorithm that adjusts the brightness in image processing software, does not necessarily exhibit the same results as those resulting from the exposure of a camera under real imaging conditions. This is even more true, if one considers the fact that the signal to noise ratio is always lower in underexposed image regions. Consequently, underexposed image regions captured by a real camera will have more noise, compared to images with manually-adjusted brightness.
In order to fill this gap in the existing benchmark databases, a new dataset is proposed to realistically test the illumination invariance of algorithms. The dataset is called "Phos" (which means "light" in Greek) and contains various scenes under different combinations of uniform and non-uniform illumination. More particularly, every one of the 15 scenes of the database contains 15 different images: 9 images captured under various strengths of uniform illumination, and 6 images under different degrees of non-uniform illumination. The images contain objects of different shapes, colors and textures. Moreover, the objects are positioned in random locations inside the scene.
Vonikakis, V., Chrysostomou, D., Kouskouridas, R., & Gasteratos, A. (2013). A biologically inspired scale-space for illumination invariant feature detection. Measurement Science and Technology, 24(7), p. 074024 (13pp).
Vonikakis, V., Chrysostomou, D., Kouskouridas, R., & Gasteratos, A. (2012). Improving the Robustness in Feature Detection by LocalContrast Enhancement. IEEE International Conference on Imaging Systems and Techniques (IST 2012). (pp.158-163). Manchester, UK.
The Phos dataset is freely available in the following address: http://robotics.pme.duth.gr/phos2.html
Capturing High Dynamic Range (HDR) scenes with the contemporary imaging technology, inevitably results to degraded images, since there is not a single exposure time adequate for both the light and the dark regions. This not only degrades the artistic quality of images but can have a severe effect to many machine vision algorithms as well (e.g. stereo, feature extraction, motion estimation etc.).
On the contrary, the HVS can distinguish many visual details both in the shadows and highlights. Recent neurophysiological data imply that the shunting characteristics of the retinal ganglion cells play an important role to this mechanism. Based on these findings, a new enhancement function was introduced, which adjusts the intensity of every pixel according to its surround, similarly to the responses of the ganglion cells. As a result, the proposed function compensates for the under/overexposed image regions, without affecting the correctly exposed ones. Possible applications include consumer electronics (cameras, mobile phones, digital TV) or as preprocessing to other vision algorithms.
As it is evident, the proposed method compensates both for the underexposed and overexposed image regions, caused by various HDR scenes. One very important characteristic of the proposed algorithm is that it does not affect the correctly exposed image regions. This algorithm is part of the Orasis imaging software, which you can freely download it from the Software page.
During the 3rd CREATE conference in Bristol (14 - 17 October 2008), an important psychophysical experiment was conducted, which attempted to measure the sensation of appearance, generated by the HVS, in both Low Dynamic Range (LDR) and HDR scenes. This experiment utilized two identical scenes, with 100 wooden facets, painted with 11 different pigments, illuminated with two different ways: a diffusive light (LDR scene) and a directional light (HDR scene). The scene was captured by different cameras. Additionally, an artist painted with watercolors the appearance that she was experiencing while observing the two scenes. An extensive description of the experiment, the available measurements, as well as its potential use for analyzing Spatial Rendering algorithms can be found here.
These data are very important because they can be used in order to evaluate the performance of algorithms which attempt to mimic human vision. The visual comparison between the original scenes, the processing of Orasis and the appearance of the scene is depicted in the following image:
It is evident that Orasis transforms the original image to a version which is closer to the scene's appearance, thus, justifying the motto "Brigding the gap between what you see and what the camera captures".
The following videos are actual recordings of the Orasis application for iOS devices and the desktop version for Windows.
Vonikakis, V., Andreadis, I., & Gasteratos, A. (2007). Fast Dynamic Range Compression for Grey Scale Images. International Workshop on Advanced Image Technology (IWAIT). (pp. 35-39). Bangkok, Thailand.
Python repo demonstrating all related techniques: https://github.com/bbonik/image_enhancement
iOS app for image enhancement based on this set of techniques: https://www.orasisapp.com/
High Dynamic Range Imaging (Exposure Fusion)
Although spatial image enhancement is a promising approach for dealing with under/overexposed regions, it has 2 major drawbacks. First, since it uses only one exposure, its results are bounded by the quality of this image. If the dynamic range of the scene is severely greater than the dynamic range of the camera, there will virtually be no available visual information in the under/overexposed regions. Second, since the underexposed regions are low light areas, the signal to noise ration is always low. Thus, any enhancement of these regions will bring out this embedded noise.
In cases that these 2 characteristics are important, the only alternative is capturing different exposures of the scene and combining them into one final image that uses the correctly exposed regions from all of them. The difficult part in this approach is to identify which regions are correctly exposed, combine them without introducing halo artifacts and maintaining good local contrast in the final result. We have developed a new technique that uses illumination estimation in order to identify the correctly exposed regions from multiple exposures and combine them into one final visually pleasing result. Its block diagram is visible in the following image.
For every exposure, the estimated illumination gives an indication about the type of region each pixel belongs to. A set of membership functions are transforming these estimations into weight maps, which then are used to combine all pixels from all exposures into the final result. Each pixel of the final image is a unique combination off all the relative pixels from all the exposures. The method can be used in any number of exposures and does not exhibit halo artifacts. Experimental results demonstrate that it gives very natural looking results, with increased local contrast. Here is some results of the proposed method.
The proposed method is implemented in the Orasis Windows desktop software which is freely available here.
Free MATLAB code of this HDR exposure fusion technique is available here (Matlab File Exchange).
A Python function implementing this technique is also available in this GitHub repository.
Vonikakis, V., Bouzos, O. & Andreadis, I. (2011). Multi-Exposure Image Fusion Based on Illumination Estimation, SIPA2011 (pp.135-142), Heraklion, Crete, Greece.
Extraction of Salient Contours
The edges of an image can be a valuable tool in its analysis and processing. Nevertheless, their vast number can pose a problem to an automatic system.
On the contrary, biological visual systems can easily distinguish between the most important contours in a scene and create its primal sketch. In this context, a new method is presented for the extraction of salient contours, based on the non‐classical receptive field of the simple cells of the primary visual cortex. Instead of using Gabor filters for the modeling of simple cells, a new set of linear filters is employed, which improves the convolution time by a factor of 20.
A new type of neural network analyzes the local spatial relations of the orientation filters and enhances any continuous exponential curves. The proposed neural network has also the ability to enhance certain degrees of curvature, in order to output only one type of contour (e.g. circles or straight lines).
Possible applications include robotics and autonomous systems, in which, the quick calculation of the primal sketch is crucial.
Additionally to the above results, the proposed method exhibits some of the fastest execution times among other similar algorithms. Unfortunately there is not yet a GUI version of this method. However, the non-GUI version is available upon request.
Vonikakis, V., Gasteratos, A., & Andreadis, I. (2006). Enhancement of Perceptually Salient Contours using a Parallel ArtificialCortical Network. Biological Cybernetics, 94(3), 192-214.
Vonikakis, V., Andreadis, I., & Gasteratos, A. (2006). Extraction of Salient Contours in Color Images. 4th Panhellenic Conference of Artificial Intelligence (SETN 2006), Lecture Notes in Computer Science. (pp. 400-410), Vol. 3955. Heraklion, Greece.
Document Binarization with OFF center-surround cells
The automatic segmentation between document characters and the background plays a crucial role in optical character recognition systems. A novel document binarization method is introduced, employing the basic biological mechanisms subserving the perception of brightness and darkness in the HVS. The method uses OFF center‐surround receptive fields and a new response function, in order to increase its sensitivity in the dark image regions, reducing the negative effects of shadows or smears. Two different scales are employed in this processing; each one specialized for different frequencies. The proposed method outperforms all the existing document binarization methods, exhibiting superior response to noise, shadows and low contrast. Possible applications include binarization of degraded historical documents, documents captured by digital cameras, outdoor OCR systems (e.g. license plate recognition).
Unfortunately, there is not yet a GUI version implemented by me. However, this method has been implemented as part of a greater document binerization software with many binarization methods (they name the method in this software as 'Bonikakis') which is available in the homepage of Nikos Papamarkos.
Supplementary material, including a DOS executable of the proposed method, as well as various test-images, can be found here.
Vonikakis, V., Andreadis, I., & Papamarkos, N. (2011). Robustdocument binarization with OFF center-surround cells. Pattern Analysis and Applications, 14(3), 219-234.
Vonikakis, V., Andreadis, I., Papamarkos, N., & Gasteratos, A. (2007). Adaptive Document Binarization: AHuman Vision Approach. Int. Conference on Computer Vision Theory and Applications. (pp. 104-110). Barcelona, Spain.
Free Python code repo for Document Binarization. https://github.com/bbonik/document-binarization
Multi-Scale Local Contrast Enhancement
Images captured under low-transmittance conditions (scenes with fog or smoke, aerial photographs, microscope photographs or medical images) can pose a considerable problem in automatic imaging systems. In these cases, the dynamic range of the scene is compressed to very few luminance levels. However, a simple stretching of the image values is not adequate, since there are usually pixels that occupy the maximum and minimum values of the channel. In order to deal with this problem, an algorithm is proposed which enhances locally, at a user‐defined degree, digital images with low local contrast, revealing visual information that otherwise would not be visible to the observer. It employs the enhancement characteristics of the center‐surround cells of the HVS. In essence, it is a similar approcach to section 1 (HDR imaging), but with very different results. Possible applications include contrast enhancement for digital cameras, digital TV, medical imaging or aerial/satellite images.
Medical Image Processing
Various Medical Imaging applications require processing of their dynamic range and at the same time good local contrast, in order to maximize the available visual information for a correct diagnosis. Certain combinations of 1 & 4, from the above research topics, can be used in many kinds of medical images, such as dermatoscopy, mammography or X-ray.
FPGA implementation for Real-Time Image Processing
A version of the algorithm described in (1) has been implemented in digital hardware. The proposed implementation, which is synthesized in Altera’s Stratix II GX: EP2SGX130GF1508C5 FPGA device, features pipeline architecture, allowing the real-time rendering of color video sequences (25fps) with frame sizes up to 2.5Mpixels.
Block Diagram of the FPGA
Pipeline of the implementation
Iakovidou, C., Vonikakis, V., & Andreadis, I. (2008). FPGA implementation of a real-time biologically inspired image enhancement algorithm. Journal of Real-Time Image Processing, 3(4), 269-287.
Iakovidou, C., Vonikakis, V., & Andreadis, I. (2008). A Hardware Module for Automatic Exposure Correctionin Real-time Vision Systems. VLSI-SOC 2008. (pp. 67-72). Rhodes Island, Greece.
Vonikakis, V., Iakovidou, C., & Andreadis, I. (2010). Real-Time Biologically-Inspired Image Exposure Correction. VLSI-SoC: Design Methodologies for SoC and SiP, Springer Boston, Book Editors: C. Piguet, R. Reis, and D. Soudris, (pp. 133-153).
Comparison between camera data and accurate scene information
Do you think that your consumer electronics camera gives you an accurate representation of the scene in your photographs? The answer is a definite 'no'. Commercial cameras, no matter how expensive they are, they have been designed to output "good looking images", and not accurately capture the real radiance of the scene. If you directly use these data in computer vision or image processing algorithms, you are including all the engineering errors intentionally introduced by the camera manufacturer in order to produce good looking images. In the final JPEG file that you have in your photo library, colors and edge ratios have been distorted by the algorithms that the camera engine has applied to the captured image. And if you think that shooting in RAW can save you, you are again wrong. In order to see the RAW file you will inevitably use a RAW-reading/processing software. This software will apply similar processing algorithms (demosaicking, color transformations etc) to the ones applied by the camera engine. As a result, you end up with similarly distorted scene data. The only difference is that the "damage" has now been done by the RAW-reading/processing software than the camera engine.
In this research we try to estimate the degree of distortion that these algorithms (either applied by the camera engine or the RAW-reading/processing software) have applied to the accurate scene data. A series of 12 RAW images of the ColorChecker have been taken under different exposures. We found out that chromaticities vary with exposure in the JPEG files (while they remain constant in reality and in film-captured photographs). To put it simply, colors change with different exposures in JPEG files. This means that scene color ratios also change. Consequently, if one directly applies an HDR, color correction or any other image processing algorithm to the JPEG image, the data that will be used will not be an accurate representation of the scene. The following picture demonstrates that chromaticity changes significantly with exposure in JPEG images. On the contrary, in linearly calibrated RAW data or film, chromaticity remains relatively constant.
But what happens in a single exposure? The situation is the same if the illumination is non-uniform. Non-uniform illumination essentially means different 'local' exposures in different regions of the image. As such, the chromaticity of the same item can change significantly under non-uniform illumination. The following simple experiment demonstrates this. It is a simple natural scene with lemons and oranges under strong directional illumination. Different samples have been taken from lemons and oranges. Although fruits are natural objects and may not have a constant chromaticity between each other, the variability observed in the following graph is quite high and is caused by the differences in the illumination.
We developed a RAW processing methodology and incorporated in a RAW-reading software. This methodology/software uses as an input the RAW captured image and outputs an image that is not affected by the usual rendering algorithms applied in the image processing chain. We call this image RAW*. No demosaicking, color transformations/profiles, denoising, sharpening etc. is applied to the RAW* images. The only transformations applied are a normalization of the sensor outputs (green filters have higher response to red and blue) and a linearisation of the responses. As a result, chromaticities remain constant for any exposure. RAW* images are not as "good looking" as the original JPEG images, but "beauty contest" is not our objective. It is rather an as accurate as possible representation of the scene radiances. Consequently, one can use the RAW* files for having access to more accurate scene data for scientific research.
Currently, the proposed RAW-reading software is fine-tuned for one camera model (Panasonic G2) and one illuminant (sun), so essentially is only for demonstration purposes. Later versions will allow the user to input his/her own linearisation LUTs, making it applicable to any other camera model or scene. More information can be found in the paper "Accurate information vs. looks good: Scientific vs. preferred rendering".
More details can be found in John McCann's page: http://www.mccannimaging.com/
McCann, J., Vonikakis, V. (2012). Accurate Information vs. Looks Good: Scientific vs. Preferred Rendering. CGIV 2012. Amsterdam.
McCann, J., Vonikakis, V., Bonanomi, C. & Rizzi, A. (2013). Chromaticity limits in color constancy calculations. Proc. 21st IS&T Color and Imaging Conference (CIC2013). (pp. 52-60). Albuquerque, NM, USA.