I am research scientist at TNO, Netherlands, at the Intelligent Imaging group. I am intrigued by human behaviour and how to recognize it from camera images. We have learned the computer how to recognize theft at Schiphol airport, violence in a prison, and stress at a service desk. We are designing novel algorithms that find the behaviour patterns of a thief, the interactions relating to a fight, and the gestures that indicate stress.

Gertjan Burghouts We have developed several real-time prototypes for proactive camera surveillance. We do this together with the end users, such as the police and health care professionals. We team with commercial partners to deliver solutions to various markets. Our academic partners include the TU Delft and University of Twente.

Our team won a prestigious research grant with DARPA (USA) during 2010-2014, to recognize 48 human behaviours in 7,000 videos. In the Mind's Eye program, we were top performer during the two benchmarks organized by DARPA and evaluated by MITRE. We publish in the high-impact journals and our papers have been cited over 1,000 times. We are proud to have been nominated for the EARTO Innovation Prize 2015.

Check out our demos!

The materials below are presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder. 

. Key Publications

Burghouts, SPIE, 2017 Automatically assessing properties of dynamic cameras for camera selection and rapid deployment of video-content-analysis tasks in large-scale ad-hoc networks
R.J.M. den Hollander, H. Bouma, J.H.C. van Rest, J.-M. ten Hove, F.B. ter Haar, G.J. Burghouts
SPIE, 2017

Video analytics is essential for managing large quantities of raw data that are produced by video surveillance systems (VSS) for the prevention, repression and investigation of crime and terrorism. Analytics is highly sensitive to changes in the scene, and for changes in the optical chain so a VSS with analytics needs careful configuration and prompt maintenance to avoid false alarms. However, there is a trend from static VSS consisting of fixed CCTV cameras towards more dynamic VSS deployments over public/private multi-organization networks, consisting of a wider variety of visual sensors, including pan-tilt-zoom (PTZ) cameras, body-worn cameras and cameras on moving platforms. This trend will lead to more dynamic scenes and more frequent changes in the optical chain, creating structural problems for analytics. If these problems are not adequately addressed, analytics will not be able to continue to meet end users’ developing needs. In this paper, we present a three-part solution for managing the performance of complex analytics deployments. The first part is a register containing meta data describing relevant properties of the optical chain, such as intrinsic and extrinsic calibration, and parameters of the scene such as lighting conditions or measures for scene complexity (e.g. number of people). A second part frequently assesses these parameters in the deployed VSS, stores changes in the register, and signals relevant changes in the setup to the VSS administrator. A third part uses the information in the register to dynamically configure analytics tasks based on VSS operator input. In order to support the feasibility of this solution, we give an overview of related stateof-the-art technologies for autocalibration (self-calibration), scene recognition and lighting estimation in relation to person detection. The presented solution allows for rapid and robust deployment of Video Content Analysis (VCA) tasks in large scale ad-hoc networks.

Burghouts, SPIE deception, 2016 Measuring cues for stand-off deception detection based on full-body non-verbal features in body-worn cameras
H. Bouma, G.J. Burghouts, R. den Hollander, S. Van Der Zee, J. Baan, J-M. ten Hove, S. van Diepen, P. van den Haak, J. van Rest
SPIE, 2016

Deception detection is valuable in the security domain to distinguish truth from lies. It is desirable in many security applications, such as suspect and witness interviews and airport passenger screening. Interviewers are constantly trying to assess the credibility of a statement, usually based on intuition without objective technical support. However, psychological research has shown that humans can hardly perform better than random guessing. Deception detection is a multi-disciplinary research area with an interest from different fields, such as psychology and computer science. In the last decade, several developments have helped to improve the accuracy of lie detection (e.g., with a concealed information test, increasing the cognitive load, or measurements with motion capture suits) and relevant cues have been discovered (e.g., eye blinking or fiddling with the fingers). With an increasing presence of mobile phones and bodycams in society, a mobile, stand-off, automatic deception detection methodology based on various cues from the whole body would create new application opportunities. In this paper, we study the feasibility of measuring these visual cues automatically on different parts of the body, laying the groundwork for stand-off deception detection in more flexible and mobile deployable sensors, such as body-worn cameras. We give an extensive overview of recent developments in two communities: in the behavioral-science community the developments that improve deception detection with a special attention to the observed relevant non-verbal cues, and in the computer-vision community the recent methods that are able to measure these cues. The cues are extracted from several body parts: the eyes, the mouth, the head and the full-body pose. We performed an experiment using several state-of-the-art video-content-analysis (VCA) techniques to assess the quality of robustly measuring these visual cues.

Burghouts, SPIE twitcam, 2016 Long-term behavior understanding based on the expert-based combination of short-term observations in high-resolution CCTV
K. Schutte, G.J. Burghouts, N. van der Stap, V. Westerwoudt, H. Bouma, M. Kruithof, J. Baan, J-M. ten Hove
SPIE, 2016

The bottleneck in situation awareness is no longer in the sensing domain but rather in the data interpretation domain, since the number of sensors is rapidly increasing and it is not affordable to increase human data-analysis capacity at the same rate. Automatic image analysis can assist a human analyst by alerting when an event of interest occurs. However, common state-of-the-art image recognition systems learn representations in high-dimensional feature spaces, which makes them less suitable to generate a user-comprehensive message. Such data-driven approaches rely on large amounts of training data, which is often not available for quite rare but high-impact incidents in the security domain. The key contribution of this paper is that we present a novel real-time system for image understanding based on generic instantaneous low-level processing components (symbols) and flexible user-definable and user-understandable combinations of these components (sentences) at a higher level for the recognition of specific relevant events in the security domain. We show that the detection of an event of interest can be enhanced by utilizing recognition of multiple short-term preparatory actions.

Burghouts, TAC, 2015 Recognizing stress using semantics and modulation of speech and gestures
I. Lefter, G.J. Burghouts, L.M.J. Rothkrantz
IEEE Transactions on Affective Computing, 2015

We investigate how speech and gestures convey stress, and how they can be used for automatically assessing stress. As a first step, we look into how humans use speech and gestures to convey stress. In particular, for both speech and gestures, we distinguish between stress conveyed by the intended semantic message (e.g. spoken words for speech, symbolic meaning for gestures), and stress conveyed by the modulation of either speech and gestures (e.g. intonation for speech, speed and rhythm for gestures). As a second step, we use this decomposition of stress as an approach to automatically predict stress. The four components provide an intermediate representation with intrinsic meaning, which helps bridging the semantic gap between the low level sensor representation and the high level context sensitive interpretation of behavior. Our experiments are run on an audiovisual dataset with service-desk interactions. The final goal is having a surveillance system that would notify when the stress level is high and extra assistance is needed. We find that speech modulation is the best performing intermediate level variable for automatic stress prediction. Using gestures increases the performance and is mostly beneficial when speech is lacking. The two-stage approach with intermediate variables has a better performance than baseline feature level or decision level fusion.

Burghouts, SIVP, 2014 Instantaneous Threat Detection based on a Semantic Representation of Activities, Zones and Trajectories
G.J. Burghouts, K. Schutte, R.J-M. ten Hove, S.P. van den Broek, J. Baan, O. Rajadell, J.R. van Huis, J. van Rest, P. Hanckmann, H. Bouma, G. Sanroma, M. Evans, J. Ferryman
Signal, Image and Video Processing, 2014

Threat detection is a challenging problem, because threats appear in many variations and differences to normal behaviour can be very subtle. In this paper, we consider threats on a parking lot, where theft of a truck’s cargo occurs. The theft takes place in very different forms, in the midst of many people who pose no threat. The threats range from explicit, e.g., a person attacking the truck driver, to implicit, e.g., somebody loitering and then fiddling with the exterior of the truck in order to open it. Our goal is a system that is able to recognize a threat instantaneously as they develop. Typical observables of the threats are a person’s activity, presence in a particular zone, and the trajectory. The novelty of this paper is an encoding of these threat observables in a semantic, intermediate-level representation, based on low-level visual features that have no intrinsic semantic meaning themselves. The semantic representation encodes the notions of trajectories, zones and activities. The aim of this representation is to bridge the semantic gap between the low-level tracks and motion and the higher-level notion of threats. In our experiments, we demonstrate that our semantic representation is more descriptive for threat detection than directly using low-level features. It is shown that each element in our representation contributes to its overall discriminative power. For instantaneous threat detection, we identify that a person’s activities are the most important elements of this semantic representation, followed by the person’s trajectory. The proposed threat detection system is very accurate: 96.6% of the tracks are correctly interpreted, when considering the temporal context.

Burghouts, IJPRAI, 2013 Soft-Assignment Random-Forest with an Application to Discriminative Representation of Human Actions in Videos
G.J. Burghouts
International Journal of Pattern Recognition and Artificial Intelligence, 2013

The bag-of-features model is a distinctive and robust approach to detect human actions in videos. The discriminative power of this model relies heavily on the quantization of the video features into visual words. The quantization determines how well the visual words describe the human action. Random forests have proven to efficiently transform the features into distinctive visual words. A major disadvantage of the random forest is that it makes binary decisions on the feature values, and thus not taking into account uncertainties of the values. We propose a soft-assignment random forest, which is a generalization of the random forest, by substitution of the binary decisions inside the tree nodes by a sigmoid function. The slope of the sigmoid models the degree of uncertainty about a feature's value. The results demonstrate that the soft-assignment random forest improves significantly the action detection accuracy compared to the original random forest. The human actions that are hard to detect -- because they involve interactions with or manipulations of some (typically small) item -- are structurally improved. Most prominent improvements are reported for a person handing, throwing, dropping, hauling, taking, closing or opening some item. Improvements are achieved for the state-of-the-art on the IXMAS and UT-Interaction datasets by using the soft-assignment random forest.

Burghouts, SPIE, 2013 Image processing in aerial surveillance and reconnaissance: from pixels to understanding
J. Dijk, A.W.M van Eekeren, O. Rajadell Rojas, G.J. Burghouts, K. Schutte
SPIE, 2013

Surveillance and reconnaissance tasks are currently often performed using an airborne platform such as a UAV. The airborne platform can carry different sensors. EO/IR cameras can be used to view a certain area from above. To support the task from the sensor analyst, different image processing techniques can be applied on the data, both in real-time or for forensic applications. These algorithms aim at improving the data acquired to be able to detect objects or events and make an interpretation of those detections. There is a wide range of techniques that tackle these challenges and we group them in classes according to the goal they pursue (image enhancement, modeling the world object information, situation assessment). An overview of these different techniques and different concepts of operations for these techniques are presented in this paper.

Burghouts, PRL, 2013 A Comparative Study on Automatic Audio-Visual Fusion for Aggression Detection Using Meta-Information
I. Lefter, L.J.M. Rothkrantz, G.J. Burghouts
Pattern Recognition Letters, 2013

Multimodal fusion is a complex topic. For surveillance applications audio-visual fusion is very promising given the complementary nature of the two streams. However, drawing the correct conclusion from multi-sensor data is not straightforward. In previous work we have analysed a database with audio-visual recordings of unwanted behavior in trains Lefter et al., 2012 and focused on a limited subset of the recorded data. We have collected multi- and unimodal assessments by humans, who have given aggression scores on a 3 point scale. We showed that there are no trivial fusion algorithms to predict the multimodal labels from the unimodal labels since part of the information is lost when using the unimodal streams. We proposed an intermediate step to discover the structure in the fusion process. This step is based upon meta-features and we find a set of five which have an impact on the fusion process. In this paper we extend the findings in Lefter et al. (2012) for the general case using the entire database. We prove that the meta-features have a positive effect on the fusion process in terms of labels. We then compare three fusion methods that encapsulate the meta-features. They are based on automatic prediction of the intermediate level variables and multimodal aggression from state of the art low level acoustic, linguistic and visual features. The first fusion method is based on applying multiple classifiers to predict intermediate level features from the low level features, and to predict the multimodal label from the intermediate variables. The other two approaches are based on probabilistic graphical models, one using (Dynamic) Bayesian Networks and the other one using Conditional Random Fields. We learn that each approach has its strengths and weaknesses in predicting specific aggression classes and using the meta-features yields significant improvements in all cases.

Burghouts, MVA, 2013 Selection of Negative Samples and Two-Stage Combination of Multiple Features for Action Detection in Thousands of Videos
G.J. Burghouts, K. Schutte, H. Bouma, R.J.M. den Hollander
Machine Vision and Applications, 2013

In this paper, a system is presented that can detect 48 human actions in realistic videos, ranging from simple actions such as ‘walk’ to complex actions such as ‘exchange’. We propose a method that gives a major contribution in performance. The reason for this major improvement is related to a different approach on three themes: sample selection, two-stage classification, and the combination of multiple features. First, we show that the sampling can be improved by smart selection of the negatives. Second, we show that exploiting all 48 actions’ posteriors by two-stage classification greatly improves its detection. Third, we show how low-level motion and high-level object features should be combined. These three yield a performance improvement of a factor 2.37 for human action detection in the test set of 1,294 realistic videos. In addition, we demonstrate that selective sampling and the two-stage setup improve on standard bag-of-feature methods on the UT-Interaction dataset, and our method outperforms state-of-the-art for the IXMAS dataset.

Burghouts, PRL, 2013 Spatio-Temporal Layout of Human Actions for Improved Bag-of-Words Action Detection
G.J. Burghouts, K. Schutte
Pattern Recognition Letters, 2013

We investigate how human action recognition can be improved by considering spatio-temporal layout of actions. From literature, we adopt a pipeline consisting of STIP features, a random forest to quantize the features into histograms, and an SVM classifier. Our goal is to detect 48 human actions, ranging from simple actions such as walk to complex actions such as exchange. Our contribution to improve the performance of this pipeline by exploiting a novel spatio-temporal layout of the 48 actions. Here each STIP feature does not in the video contributes to the histogram bins by a unity value, but rather by a weight given by its spatio-temporal probability. We propose 6 configurations of spatio-temporal layout, where the varied parameters are the coordinate system and the modeling of the action and its context. Our model of layout does not change any other parameter of the pipeline, it requires no re-learning of the random forest, yields a limited increase of the size of its resulting representation by only a factor two, and at a minimal additional computational cost of only a handful of operations per feature. Extensive experiments show that the layout is demonstrated to be distinctive of actions that involve trajectories, (dis)appearance, kinematics, and interactions. The visualization of each action’s layout illustrates that our approach is indeed able to model spatio-temporal patterns of each action. Each layout is experimentally shown to be optimal for a specific set of actions. Generally, the context has more effect than the choice of coordinate system. The most impressive improvements are achieved for complex actions involving items. For 43 out of 48 human actions, the performance is better or equal when spatio-temporal layout is included. In addition, we show our method outperforms state-of-the-art for the IXMAS and UT-Interaction datasets.

Burghouts, SMC, 2011 Reasoning about Threats: from Observables to Situation Assessment
G.J. Burghouts, J-W Marck
IEEE Transactions on Systems, Man and Cybernetics, 2011

We propose a mechanism to assess threats based on observables. Observables are properties of persons, their behavior and interaction with other persons and objects. We consider observables that can be extracted from sensor signals and intelligence. In this paper, we discuss situation assessment based on observables for threat assessment. In the experiments, the assessment is evaluated for scenarios relevant to anti-terrorism and crowd control. The experiments are performed within an evaluation framework, where the setup is such that conclusions can be drawn concerning: (i) the accuracy and robustness of an architecture to assess situations with respect to threats, and (ii) the architecture’s dependency of the underlying observables in terms of their false positive and negative rates. One of the interesting conclusions is that discriminative assessment of threatening situations can be achieved by combining generic observables. Situations can be assessed with a precision of 90% at a false positive and negative rate of 15% using only 8 learning examples. In a real-world experiment at a large train station, we have classified various types of crowd dynamics. Using simple video features of shape and motion, we have proposed a scheme to translate such features into observables that can be classified by a Conditional Random Field (CRF). The implemented CRF shows to classify successfully the crowd dynamics, up to 80% accuracy.

Burghouts, CVIU, 2009 Performance Evaluation of Local Colour Invariants
G.J. Burghouts, J-M Geusebroek
Computer Vision and Image Understanding, 2009


Sofware: Windows 32 bit
, Linux 32 bit, Linux 64 bit, Linux 32 bit (Mikolayzcyk compute_descriptors style).

In this paper, we compare local colour descriptors to grey-value descriptors. We adopt the evaluation framework of Mikolayzcyk and Schmid. We modify the framework in several ways. We decompose the evaluation framework to the level of local grey-value invariants on which common region descriptors are based. We compare the discriminative power and invariance of grey-value invariants to that of colour invariants. In addition, we evaluate the invariance of colour descriptors to photometric events such as shadow and highlights. We measure the performance over an extended range of common recording conditions including significant photometric variation. We demonstrate the intensity-normalized colour invariants and the shadow invariants to be highly distinctive, while the shadow invariants are more robust to both changes of the illumination colour, and to changes of the shading and shadows. Overall, the shadow invariants perform best: they are most robust to various imaging conditions while maintaining discriminative power. When plugged into the SIFT descriptor, they show to outperform other methods that have combined colour information and SIFT. The usefulness of C-colour-SIFT for realistic computer vision applications is illustrated for the classification of object categories from the VOC challenge, for which a significant improvement is reported.

Burghouts, NIPS, 2007 The Distribution Family of Similarity Distances
G.J. Burghouts, A.W.M. Smeulders, J-M Geusebroek
Neural Information Processing Systems, 2007

Assessing similarity between features is a key step in object recognition and scene categorization tasks. We argue that knowledge on the distribution of distances generated by similarity functions is crucial in deciding whether features are similar or not. Intuitively one would expect that similarities between features could arise from any distribution. In this paper, we will derive the contrary, and report the theoretical result that Lp-norms –a class of commonly applied distance metrics– from one feature vector to other vectors are Weibull-distributed if the feature values are correlated and non-identically distributed. Besides these assumptions being realistic for images, we experimentally show them to hold for various popular feature extraction algorithms, for a diverse range of images. This fundamental insight opens new directions in the assessment of feature similarity, with projected improvements in object and scene recognition algorithms.

Burghouts, TIP, 2005 Quasi-Periodic Spatiotemporal Filtering
G.J. Burghouts, J-M Geusebroek
IEEE Transactions on Image Processing, 2006

This paper presents the online estimation of temporal frequency to simultaneously detect and identify the quasiperiodic motion of an object.We introduce color to increase discriminative power of a reoccurring object and to provide robustness to appearance changes due to illumination changes. Spatial contextual information is incorporated by considering the object motion at different scales. We combined spatiospectral Gaussian filters and a temporal reparameterized Gabor filter to construct the online temporal frequency filter. We demonstrate the online filter to respond faster and decay faster than offline Gabor filters. Further, we show the online filter to be more selective to the tuned frequency than Gabor filters.We contribute to temporal frequency analysis in that we both identify (“what”) and detect (“when”) the frequency. In color video, we demonstrate the filter to detect and identify the periodicity of natural motion. The velocity of moving gratings is determined in a real world example. We consider periodic and quasiperiodic motion of both stationary and nonstationary objects.

Burghouts, IJCV, 2005 The Amsterdam Library of Object Images
J-M Geusebroek, G.J. Burghouts, A.W.M. Smeulders
International Journal of Computer Vision, 2005


We present the ALOI collection of 1,000 objects recorded under various imaging circumstances. In order to capture the sensory variation in object recordings, we systematically varied viewing angle, illumination angle, and illumination color for each object, and additionally captured wide-baseline stereo images. We recorded over a hundred images of each object, yielding a total of 110,250 images for the collection. These images are made publicly available for scientific research purposes.

. Other Publications

Burghouts, IMAVIS, 2014 A Unified Approach to the Recognition of Complex Actions from Sequences of Zone-Crossings
G. Sanromà, L. Patino, G.J. Burghouts, K. Schutte, J. Ferryman
Image and Vision Computing, 2014

We present a method for the recognition of complex actions. Our method combines automatic learning of simple actions and manual definition of complex actions in a single grammar. Contrary to the general trend in complex action recognition, that consists in dividing recognition into two stages, our method performs recognition of simple and complex actions in a unified way. This is performed by encoding simple action HMMs within the stochastic grammar that models complex actions. This unified approach enables a more effective influence of the higher activity layers into the recognition of simple actions which leads to a substantial improvement in the classification of complex actions. We consider the recognition of complex actions based on person transits between areas in the scene. As input, our method receives crossings of tracks along a set of zones which are derived using unsupervised learning of the movement patterns of the objects in the scene. We evaluate our method on a large dataset showing normal, suspicious and threat behavior on a parking lot. Experiments show an improvement of 30% in the recognition of both high-level scenarios and their composing simple actions with respect to a two-stage approach. Experiments with synthetic noise simulating the most common tracking failures show that our method only experiences a limited decrease in performance when moderate amounts of noise are added.

Burghouts, SPIE Newsroom, 2014 Automated recognition of human activities in video streams in real-time
S.P. van den Broek, J-M ten Hove, R. den Hollander, G.J. Burghouts
SPIE Newsroom, 2014

Early detection of human activities that indicate a possible threat is needed to protect military bases or other important infrastructure. Currently, human observers are much better than computers in detecting human activities in videos. However, in many cases human operators have limitations. For example, many cameras often cover an area, so an operator can only watch one of them at a time. Also, fatigue may limit the time in which an operator can effectively perform. In military situations, resources are limited, and a full-time operator may not be available at all. For these reasons, it is desirable that computers assist in such surveillance in the future. But for that to become reality, the computer system must be able to detect people in the scene, track them, and recognize their activities.

Burghouts, PETS, 2014 Complex Threat Detection: Learning vs. Rules, using a Hierarchy of Features
G.J. Burghouts, P. van Slingerland, R.J-M. ten Hove, R. den Hollander, K. Schutte
IEEE Advanced Video and Signal-based Surveillance, 2014

Theft of cargo from a truck or attacks against the driver are threats hindering the day to day operations of trucking companies. In this work we consider a system, which is using surveillance cameras mounted on the truck to provide an early warning for such evolving threats. Low-level processing involves tracking people and calculating motion features. Intermediate-level processing provides kinematics and localisation, activity descriptions and threat stage estimates. At the high level, we compare threat detection performed with a statistical trained SVM based classifier against a rule based system. Results are promising, and show that the best system depends on the scenario.

Burghouts, JMUI, 2013 An audio-visual dataset of human-human interactions in stressful situations
I. Lefter, G.J. Burghouts, L.M.J. Rothkrantz
Journal of Multimodal User Interfaces, 2013

Stressful situations are likely to occur at human operated service desks, as well as at human-computer interfaces used in public domain. Automatic surveillance can help notifying when extra assistance is needed. Human communication is inherently multimodal e.g. speech, gestures, facial expressions. It is expected that automatic surveillance systems can benefit from exploiting multimodal information. This requires automatic fusion of modalities, which is still an unsolved problem. To support the development of such systems, we present and analyze audio-visual recordings of human- human interactions at a service desk. The corpus has a high degree of realism: all interactions are freely improvised by actors based on short scenarios where only the sources of conflict were provided. The recordings can be considered as a prototype for general stressful human-human interaction. The recordings were annotated on a 5 point scale on degree of stress from the perspective of surveillance operators. The recordings are very rich in hand gestures. We find that the more stressful the situation, the higher the proportion of speech that is accompanied by gestures. Understanding the function of gestures and their relation to speech is essential for good fusion strategies. Taking speech as the basic modality, one of our research questions was, what is the role of gestures in addition to speech. Both speech and gestures can express emotion, so we say that they have an emotional function. They can also express non-emotional information, in which case we say that they have a semantic function. We learn that when speech and gestures have the same function, they are usually congruent, but intensities and clarity can vary. Most gestures in this dataset convey emotion. We identify classes of gestures in our recordings, and argue that some classes are clear indications of stressful situations.

Burghouts, MTAP, 2013 Requirements for Multimedia Metadata Schemes in Surveillance Applications for Security
J. van Rest, F.A. Grootjen, M. Grootjen, R. Wijn, O. Aarts, M.L. Roelofs, G.J. Burghouts, H. Bouma, L. Alic, W. Kraaij
Multimedia Tools and Applications, 2013

Surveillance for security requires communication between systems and humans, involves behavioural and multimedia research, and demands an objective benchmarking for the performance of system components. Metadata representation schemes are extremely important to facilitate (system) interoperability and to define ground truth annotations for surveillance research and benchmarks. Surveillance places specific requirements on these metadata representation schemes. This paper offers a clear and coherent terminology, and uses this to present these requirements and to evaluate them in three ways: their fitness in breadth for surveillance design patterns, their fitness in depth for a specific surveillance scenario, and their realism on the basis of existing schemes. It is also validated that no existing metadata representation scheme fulfils all requirements. Guidelines are offered to those who wish to select or create a metadata scheme for surveillance for security.

Burghouts, AVSS, 2013 Activity Recognition and Localization on a Truck Parking Lot
M. Andersson, L. Patino, G.J. Burghouts, A. Flizikowski, M. Evans, D. Gustafsson, H. Petersson, K. Schutte, J. Ferryman
IEEE Advanced Video and Signal-based Surveillance (AVSS), 2013

In this paper we present a set of activity recognition and localization algorithms that together assemble a large amount of information about activities taking place on a parking lot. The aim is to detect and recognize events that may pose a threat to truck drivers and trucks. The algorithms perform zone-based activity learning, individual action recognition and group detection. Visual sensor data, from one camera, have been recorded for 23 realistic scenarios of different complexities. The scene is complicated and causes uncertain and false position estimates. We also present a situational assessment ontology which serves the algorithms with relevant knowledge about the observed scene (e.g. information about parking objects, vulnerabilities and historic data). The algorithms are tested with real tracking data and the evaluations, with annotated data, show promising results. The accuracies are 90% for zone-based activity learning, 71% for individual action recognition and 66% for group detection (i.e. merging of people).

Burghouts, AVSS, 2013 Improved Action Recognition by Combining Multiple 2D Views in the Bag-of-Words Model
G.J. Burghouts, P. Eendebak, H. Bouma, J-M ten Hove
IEEE Advanced Video and Signal-based Surveillance (AVSS), 2013

Action recognition is a hard problem due to the many degrees of freedom of the human body and the movement of its limbs. This is especially hard when only one camera viewpoint is available and when actions involve subtle movements. For instance, when looked from the side, checking one’s watch may look very similar to crossing one’s arms. In this paper, we investigate how much the recognition can be improved when multiple views are available. The novelty is that we explore various combination schemes within the robust and simple bag-of-words (BoW) framework, from early fusion of features to late fusion of multiple classifiers. In new experiments on the publicly available IXMAS dataset, we learn that action recognition can be improved significantly already by only adding one viewpoint. We demonstrate that the state-of-the-art on this dataset can be improved by 5% - achieving 96.4% accuracy - when multiple views are combined. Cross-view invariance of the BoW pipeline can be improved by 32% with intermediate-level fusion.

Burghouts, CVPR, 2013 Spatio-Temporal Saliency for Action Similarity
G.J. Burghouts, B. van den Broek, J-M ten Hove
IEEE Computer Vision and Pattern Recognition (CVPR), 2013

Human actions are spatio-temporal patterns. A popular representation is to describe the action by features at interest points. Because the interest point detection and feature description are generic processes, they are not tuned to discriminate one particular action from the other. In this paper we propose a saliency measure for each individual feature to improve its distinctiveness for a particular action. We propose a spatio-temporal saliency map, for a bag of features, that is specific to the current video and to the action of interest. The novelty is that the saliency map is derived directly from the SVM's support vectors. For the retrieval of 48 human actions from the database of 3,480 videos, we demonstrate a systematic improvement across the board of 35.3% on average and significant improvements for 25 actions. We learn that the improvements are achieved in particular for complex human actions such as giving, receiving, burying and replacing an item.

Burghouts, SPIE, 2013 Recognition and localization of relevant human behavior in videos
H. Bouma, G.J. Burghouts, L. de Penning, P. Hanckmann, J.-M. ten Hove, S. Korzec, M. Kruithof, S. Landsmeer, C. van Leeuwen, S.P. van den Broek, A. Halma, R.J. den Hollander, K. Schutte
SPIE, 2013

Surveillance is normally performed by human operators, since it requires visual intelligence. However, especially for military operations, this can be dangerous. Therefore, unmanned autonomous visual-intelligence systems are desired. In this paper, we present an improved system that can recognize actions of a human and interactions between multiple humans. Central to the new system is our agent-based architecture. The system is trained on thousands of videos and evaluated on realistic persistent surveillance data in the DARPA Mind’s Eye program, with hours of videos of challenging scenes. The results show that our system is able to track the people, detect and localize events, and discriminate between different behaviors.

Burghouts, SPIE, 2013 Behavioral profiling in CCTV cameras by combining multiple subtle suspicious observations of different surveillance operators
H. Bouma, J. Vogels, O. Aarts, C. Kruszynski, R. Wijn, G.J. Burghouts
SPIE, 2013

The complexity of the camera surveillance task and the growing importance of the prevention of incidents may lead to unnecessary bothering of innocent passers-by. When a surveillance operator recognizes subtle deviant behavior for a person it is insufficient to merit a follow-up action. However, when multiple weak observations are fused it can become a strong indication that needs intervention. In this paper, we analyze the influence of combining multiple observations/tags of different operators, the effects of the tagging instruction for these operators (many tags for weak signals or few tags for strong signals), and the performance of using a semi-automatic system for combining the different observations.

Burghouts, SPIE, 2013 GOOSE: Semantic search on internet connected sensors
K. Schutte, F. Bomhof, G.J. Burghouts, J. van Diggelen, P. Hiemstra, J. van 't Hof, W. Kraaij, H. Pasman, A. Smith, C. Versloot, J. de Wit
SPIE, 2013

The GOOSE concept has the ambition to provide the capability to search semantically for any relevant information within “all” (including imaging) sensor streams in the entire Internet of sensors. Similar to the capability provided by presently available Internet search engines which enable the retrieval of information on “all” pages on the internet. Similar to current Internet search engines any indexing services shall be utilized cross-domain. Main challenges for GOOSE are the Semantic Gap and Scalability. The paper will report on the initial GOOSE demonstrator, which consists of the MES multimedia analysis platform and the CORTEX action recognition software, and provide an outlook into future GOOSE development.

Burghouts, ICPRAM, 2013 A search engine for retrieval and inspection of events with 48 human actions in realistic videos
G.J. Burghouts, L. de Penning, J-M ten Hove, S. Landsmeer, S.P. van den Broek, R. den Hollander, P. Hanckmann, M. Kruithof, C. van Leeuwen, S. Korzec, H. Bouma, K. Schutte
International Conference on Pattern Recognition, Applications and Methods, 2013

The contribution in this paper is a demonstrator that recognizes and describes 48 human actions in realistic videos. The core algorithms have been published recently, from the early visual processing (SPIE, 2012), recognition (ICPR, 2012) and description (ECCV, 2012) of 48 human actions. We summarize the key algorithms and specify their performance. The novelty of this paper is that we demonstrate the power of these combined algorithms in a search engine that enables the user to search for particular parts in the video where specific events occurred, and to inspect the involved actors and objects. We show that events can be successfully retrieved and inspected by usage of the proposed search engine.

Wiley, 2012 Color in Computer Vision: Fundamentals and Applications
T. Gevers, A. Gijsenij, J. van de Weijer, J-M. Geusebroek (book chapter on color features by G.J. Burghouts)
Wiley, 2012

While the field of computer vision drives many of today’s digital technologies and communication networks, the topic of color has emerged only recently in most computer vision applications. One of the most extensive works to date on color in computer vision, this book provides a complete set of tools for working with color in the field of image understanding. Based on the authors’ intense collaboration for more than a decade and drawing on the latest thinking in the field of computer science, the book integrates topics from color science and computer vision, clearly linking theories, techniques, machine learning, and applications. The fundamental basics, sample applications, and downloadable versions of the software and data sets are also included.

Burghouts, SSPR, 2012 Recognition of Long-Term Behaviors by Parsing Sequences of Short-Term Actions with a Stochastic Regular Grammar
G. Sanroma, G.J. Burghouts, K. Schutte
Structural and Syntactic Pattern Recognition, 2012

Human behavior understanding from visual data has applications such as threat recognition. A lot of approaches are restricted to limited time actions, which we call short-term actions. Long-term behaviors are sequences of short-term actions that are more extended in time. Our hypothesis is that they usually present some structure that can be exploited to improve recognition of short-term actions. We present an approach to model long-term behaviors using a syntactic approach. Behaviors to be recognized are hand-crafted into the model in the form of grammar rules. This is useful for cases when few (or no) training data is available such as in threat recognition. We use a stochastic parser so we handle noisy inputs. The proposed method succeeds in recognizing a set of predefined long-term interactions in the CAVIAR dataset. Additionally, we show how imposing prior knowledge about the structure of the long-term behavior improves the recognition of short-term actions with respect to standard statistical approaches.

Burghouts, ECCV, 2012 Automated Textual Descriptions for a Wide Range of Video Events with 48 Human Actions
P. Hanckmann, K. Schutte, G.J. Burghouts
European Conference on Computer Vision, 2012
International Workshop on Video Event Categorization, Tagging and Retrieval

Presented is a hybrid method to generate textual descriptions of video based on actions. The method includes an action classifier and a description generator. The aim for the action classifier is to detect and classify the actions in the video, such that they can be used as verbs for the description generator. The aim of the description generator is (1) to find the actors (objects or persons) in the video and connect these correctly to the verbs, such that these represent the subject, and direct and indirect objects, and (2) to generate a sentence based on the verb, subject, and direct and indirect objects. The novelty of our method is that we exploit the discriminative power of a bag-of-features action detector with the generative power of a rule-based action descriptor. Shown is that this approach outperforms the approach with a homogeneous setup with the rule-based action detector and action descriptor.

Burghouts, ICPR, 2012 Correlations Between 48 Human Actions Improve Their Detection
G.J. Burghouts, K. Schutte
International Conference on Pattern Recognition, 2012

Many human actions are correlated, because of compound and/or sequential actions, and similarity. Indeed, human actions are highly correlated in human annotations of 48 actions in the 4,774 videos from We exploit such correlations to improve the detection of these 48 human actions, ranging from simple actions such as walk to complex actions such as exchange. We apply a basic pipeline of STIP features, a Random Forest to quantize the features into histograms, and an SVM classifier. First, we show that the sampling for the Random Forest can be improved by exploiting the correlations between human actions. Second, we show that exploiting all 48 actions' posteriors for detecting a particular action also improves further the detection in general. We demonstrate a 50% relative improvement for human action detection in 1,294 realistic test videos.

Burghouts, AVSS, 2012 Automatic Audio-Visual Fusion for Aggression Detection using Meta-Information
I. Lefter, G.J. Burghouts, L.J.M. Rothkrantz
IEEE Advanced Video and Signal-based Surveillance, 2012

We propose a new method for audio-visual sensor fusion and apply it to automatic aggression detection. While a variety of definitions of aggression exist, in this paper we see it as any kind of behavior that has a disturbing effect on others. We have collected multi- and unimodal assessments by humans, who have given aggression scores on a 3 point scale. There are no trivial fusion algorithms to predict the multimodal labels from the unimodal labels. We propose an intermediate step to discover the structure in the fusion process. We call these meta-features and we find a set of five which have an impact on the fusion process. We use simple state of the art low level audio and video features to predict the level of aggression in audio and video, and we also predict the three most feasible meta-features. We show the significant positive impact of adding the meta-features on predicting the multimodal label as compared to standard fusion techniques like feature and decision level fusion.

Burghouts, TSD, 2012 Aggression Detection in Speech Using Sensor and Semantic Information
I. Lefter, L.J.M. Rothkrantz, G.J. Burghouts
Text, Speech and Dialogue, 2012

By analyzing a multimodal (audio-visual) database with ag- gressive incidents in trains, we have observed that there are no trivial fusion algorithms to successfully predict multimodal aggression based on unimodal sensor inputs. We proposed a fusion framework that con- tains a set of intermediate level variables (meta-features) between the low level sensor features and the multimodal aggression detection. In this paper we predict the multimodal level of aggression and two of the meta-features: context and semantics. We do this based on the audio stream, from which we extract both nonverbal and verbal information. Given the spontaneus nature of speech in the database, we rely on a keyword spotting approach in the case of verbal information. We have found the existence of 6 semantic groups of keywords that have a positive in uence on the prediction of aggression and of the two meta-features.

Burghouts, AAAI, 2012 A Neural-Symbolic Cognitive Agent with a Mind’s Eye
H.L.H. de Penning, R.J.M. den Hollander, H. Bouma, G.J. Burghouts, A.S. d'Avila Garcez
AAAI Neural Symbolic Learning and Reasoning, 2012

The DARPA Mind’s Eye program seeks to develop in machines a capability that currently exists only in animals: visual intelligence. This paper describes a Neural-Symbolic Cognitive Agent that integrates neural learning, symbolic knowledge representation and temporal reasoning in a visual intelligent system that can reason about actions of entities observed in video. Results have shown that the system is able to learn and represent the underlying semantics of the actions from observation and use this for several visual intelligent tasks, like recognition, description, anomaly detection and gap-filling.

Burghouts, Fusion, 2012 Learning the Fusion of Audio and Video Aggression Assessment by Meta-Information from Human Annotations
I. Lefter, G.J. Burghouts, L.M.J. Rothkrantz
International Conference on Information Fusion, 2012

The focus of this paper is finding a method to predict aggression using a multimodal system, given multiple unimodal features. The mechanism underlying multimodal sensor fusion is complex and not completely clear. We try to understand the process of fusion and make it more transparent. As a case study we use a database with audio-visual recordings of aggressive behavior in trains. We have collected multi- and unimodal assessments by humans, who have given aggression scores on a 3 point scale. There are no trivial fusion steps to predict the multimodal labels from the unimodal labels. We propose an intermediate step to discover the structure in the fusion process. We call these meta-features and we find a set of five which have an impact on the fusion process. Using a propositional rule based learner we show the high positive impact of the meta-features on predicting the multimodal label for the complex situations in which the labels for audio, video and multimodal do not reinforce each other. We continue with an experiment by which we prove the added value of such an approach on the whole data set.

Burghouts, OPTRO, 2012 Recognition of 48 Human Behaviors from Video
G.J. Burghouts, H. Bouma, R.J.M. den Hollander, S.P. van den Broek, K. Schutte
OPTRO, 2012

We have developed a system that recognizes 48 human behaviors from video. The essential elements are (i) inference of the actors in the scene, (ii) assessment of event-related properties of actors and between actors, (iii) exploiting the event properties to recognize the behaviors. The performance of our recognizer approaches human performance, yet the performance for unseen variations of the behaviors needs to be improved.

Burghouts, SPIE, 2012 Automatic Human Action Recognition in a Scene from Visual Inputs
H. Bouma, P. Hanckmann, J-W. Marck, L. de Penning, R. den Hollander, J-M. ten Hove, S.P. van den Broek, K. Schutte, G.J. Burghouts
SPIE, 2012

Surveillance is normally performed by humans, since it requires visual intelligence. However, it can be dangerous, especially for military operations. Therefore, unmanned visual-intelligence systems are desired. In this paper, we present a novel system that can recognize human actions. Central to the system is a break-down of high-level perceptual concepts (verbs) in simpler observable events. The system is trained on 3482 videos and evaluated on 2589 videos from DARPA, with for each video human annotations indicating the presence or absence of 48 verbs. The results show that our system reaches a good performance approaching the human average response.

Burghouts, TSD, 2011 Adressing Multimodality in Overt Aggression Detection
I. Lefter, L.J.M. Rothkrantz, G.J. Burghouts, Z. Yang, P. Wiggers
Text, Speech and Dialogue, 2011

Automatic detection of aggressive situations has a high societal and scientific relevance. It has been argued that using data from multimodal sensors as for example video and sound as opposed to unimodal is bound to increase the accuracy of detections. We approach the problem of multimodal aggression detection from the viewpoint of a human observer and try to reproduce his predictions automatically. Typically, a single ground truth for all available modalities is used when training recognizers. We explore the benefits of adding an extra level of annotations, namely audio-only and video-only. We analyze these annotations and compare them to the multimodal case in order to have more insight into how humans reason using multimodal data. We train classifiers and compare the results when using unimodal and multimodal labels as ground truth. Both in the case of audio and video recognizer the performance increases when using the unimodal labels.

Burghouts, SPIE, 2011 Increasing the Security at Vital Infrastructures: Automated Detection of Deviant Behaviors
G. J. Burghouts, R. den Hollander, K. Schutte, J-W Marck, S. Landsmeer, E. den Breejena
SPIE, 2011

This paper discusses the decomposition of hostile intentions into abnormal behaviors. A list of such behaviors has been compiled for the specific case of public transport. Some of the deviant behaviors are hard to observe by people, as they are in the midst of the crowd. Examples are deviant walking patterns, prohibited actions such as taking photos and waiting without taking the train. We discuss our visual analytics algorithms and demonstrate them on CCTV footage from the Amsterdam train station.

Burghouts, NATO MSS, 2011 A Vision towards Automatic Inference of Hostile Intent from Sensory Observations
G.J. Burghouts, K. Schutte
NATO MSS (invited speech), 2011

Burghouts, ICDP, 2009 Automated Indicators for Behavior Interpretation
G.J. Burghouts, B. van den Broek, B. G. Alefs, E. den Breejen, K. Schutte
International Conference on Crime Detection and Prevention, 2009

While sensors become distributed at an unprecedented scale, their use in the monitoring of hostile activities is very limited. Monitoring by humans is demanding and expensive. To aid in the complex and semantic task of deciding whether a situation implies hostile intent, the authors describe a set of automated behavioral indicators based on camera and radar data.

Burghouts, PRL, 2009 Material-specific adaptation of color invariant features
G.J. Burghouts, J-M Geusebroek
Pattern Recognition Letters, 2009


For the modeling of materials, the mapping of image features onto a codebook of feature representatives receives extensive treatment. For reason of their generality and simplicity, filterbank outputs are commonly used as features. The MR8 filterbank of Varma and Zisserman is performing well in a recent evaluation. In this paper, we construct color invariant filter sets from the original MR8 filterbank. We evaluate several color invariant alternatives over more than 250 real-world materials recorded under a variety of imaging conditions including clutter. Our contribution is a material recognition framework that learns automatically for each material specifically the most discriminative filterbank combination and corresponding degree of color invariance. For a large set of materials each with different physical properties, we demonstrate the material-specific filterbank models to be preferred over models with fixed filterbanks.

Burghouts, SPIE, 2009 Discrimination of Classes of Ships for Aided Recognition in a Coastal Environment
S.P. van den Broek, M. Degache, H. Bouma, G.J. Burghouts
SPIE, 2009

For naval operations in a coastal environment, detection of boats is not sufficient. When doing surveillance near a supposedly friendly coast, or self protection in a harbor, it is important to find the one object that means harm, among many others that do not. For this, it is necessary to obtain information on the many observed targets, which in this scenario are typically small vessels. Determining the exact type of ship is not enough to declare it a threat. However, in the whole process from (multi-sensor) detection to the decision to act, classification of a ship into a more general class is already of great help, when this information is combined with other data to assist an operator. We investigated several aspects of the use of electro-optical systems. As for classification, this paper concentrates on discriminating classes of small vessels with different electro-optical systems (visual and infrared) as part of the larger process involving an operator. It addresses both selection of features (based on shape and texture) and ways of using these in a system to assess threats. Results are presented on data recorded in coastal and harbor environments for several small targets.

Burghouts, COGIS, 2007 3-D Scene Reconstruction with a Handheld Stereo Camera
W. van der Mark, G.J. Burghouts, E. den Dekker, T. ten Kate, J. Schavemaker
Cognitive Systems with Interactive Sensors (COGIS), 2007

We have developed a method for 3-D scene reconstruction with a handheld stereo camera. Unlike 3-D laser scanning devices, the software tool only requires a relatively inexpensive stereo camera and a laptop computer. Our approach does not require artificial markers or structured light. Only stereo image information is used to obtain the 3-D model. This ensures that the scene remains undisturbed during the recording session. It is also unnecessary to move the camera around at a fixed speed or in a certain pattern. Because a novel image selection method is applied, the system automatically selects the important images and removes those with redundant information. Robust methods are applied to recover the stereo camera trajectory and the surface geometry. This eliminates the need for user interaction or guidance while the 3-D model is reconstructed. Our method could therefore serve as an inexpensive and easy to use 3-D modelling tool for applications such as crime scene investigation, engineering, construction work, and the entertainment industry.

Burghouts, BMVC, 2006 Color Textons for Texture Recognition
G.J. Burghouts, J-M Geusebroek
British Machine Vision Conference, 2006


Texton models have proven to be very discriminative for the recognition of grayvalue images taken from rough textures. To further improve the discriminative power of the distinctive texton models of Varma and Zisserman (VZ model) (IJCV, vol. 62(1), pp. 61-81, 2005), we propose two schemes to exploit color information. First, we incorporate color information directly at the texton level, and apply color invariants to deal with straightforward illumination effects as local intensity, shading and shadow. But, the learning of representatives of the spatial structure and colors of textures may be hampered by the wide variety of apparent structure-color combinations. Therefore, our second contribution is an alternative approach, where we weight grayvalue-based textons with color information in a post-processing step, leaving the original VZ algorithm intact. We demonstrate that the color-weighted textons outperform the VZ textons as well as the color invariant textons. The color-weighted textons are speci.cally more discriminative than grayvalue-based textons when the size of the example image set is reduced. When using 2 example images only, recognition performance is 85:6%, which is an improvement over grayvaluebased textons of 10%. Hence, incorporating color in textons facilitates the learning of textons.

Burghouts, IJIST, 2005 Color Invariant Object Recognition Using Entropic Graphs
J.C. van Gemert, G.J. Burghouts, F.J. Seinstra, J-M Geusebroek
International Journal of Imaging Systems and Technology, 2006

We present an object recognition approach using higher-order color invariant features with an entropy-based similarity measure. Entropic graphs offer an unparameterized alternative to common entropy estimation techniques, such as a histogram or assuming a probability distribution. An entropic graph estimates entropy from a spanning graph structure of sample data. We extract color invariant features from object images invariant to illumination changes in intensity, viewpoint, and shading. The Henze–Penrose similarity measure is used to estimate the similarity of two images. Our method is evaluated on the ALOI collection, a large collection of object images. This object image collection consists of 1000 objects recorded under various imaging circumstances. The proposed method is shown to be effective under a wide variety of imaging conditions.

Burghouts, LNCS, 2005 Invariant Representations to Prepare for Content Based Image Retrieval from First Principles
J-M Geusebroek, G.J. Burghouts, J. C. van Gemert, A.W.M. Smeulders
LNCS Trends and Advances in Content-Based Image and Video Retrieval, 2005

In our view, the physical and statistical constraints on the sensory input determines the construction of content based image retrieval systems. The simpli- .cation of the sensory input by invariant representation advances towards better retrieval performance. Local features provide robustness to object occlusion and background changes. Invariance includes a low-level of semantic knowlegde, hence achieves a rudimentary level of visual cognition. Rather than aiming for one complete geometrical representation of the visual .eld, cognition based image retrieval may be based on weak description of the important features in the scene, as long as mutual correspondence between observation and objects in the world is maintained.

Burghouts, ECV, 2004 Observables and Invariance for Early Cognitive Vision
G.J. Burghouts, J-M Geusebroek, A.W.M. Smeulders
Early Cognitive Vision Workshop, 2004

This paper presents the visual measurement of physical object properties that characterize the perceived object including: size, shape, surface properties, cover reflectance properties, distance, and motion. We provide an overview of complete set of local visual measurements. We derive photometric, geometrical, and temporal invariants to counteract unwanted transformations in the observation including: illumination spectrum and intensity, scene setting causing shadow, shading and highlight effects, and variation due to object position, pose and distance.

. Awards

Burghouts, TNO, 2011 Excellent Young Researcher of TNO, Netherlands (2011). I received this award based on winning a grant from the prestigious research agency DARPA, for the development of Visual Intelligence. At the evaluation trials at the end of 2011, i.e. the end of the first year of the research program, we achieved prominent results in the behavior recognition task. The program includes 11 other teams, among them MIT, NASA, Berkeley, and we collaborate with both US universities (e.g. CSU, Buffalo, Leeds) and with system integrators (Toyon and General Dynamics Robotics Systems).
Burghouts, KMAR, 2010 Award from Royal Marechaussee (Ministry of Defense), Netherlands, for the best innovation presented at the joint KMAR/NIDV symposium (2010). I presented the outcomes of a field trial together with the police, at Amsterdam central train station. We implemented and tested a video-based monitoring system, see also this news report at NOS, YouTube and the SPIE 2011 paper. I showed how this system can be applied in scenarios that are relevant to the Royal Marechaussee.
Burghouts, KIVI-NIRIA, 2007 Award from KIVI-NIRIA (society of engineers), Netherlands, for the best innovation invented during the innovation game of the Ministry of Defense (2007). Together with Jeroen de Jong (Thales Research) I developed a demonstrator that analyzes for a dismounted soldier who of his/her team members are near. The rationale was to invent a solution for the dismounted soldier to prevent fratricide. In three months, we built this demonstrator which was interactive and had options to re-play previous training-fratricide incidents from a military training in an urban environment.
Burghouts, UT, 1997 Scholarship from the University of Twente, Netherlands, for promising first-year students (1997). At the start of my master studies Computer Science, a consortium of industrial technical companies selected students who achieved good grades during secondary school. My specialization at CS was Artificial Intelligence (under supervision of prof. Anton Nijholt).