Large scale data collection and advances in computational paradigms encouraged use highly flexible and nonparametric models for computer vision applications, for instance, ConvNets (CNNs). However, most of CNN architectures are applied in a supervised learning setting, hence, they need large scale annotated data. Human guided annotation schemes seems not scalable. This issue ramps up the research along two different lines: (a) Virtual Reality: using Computer Graphics (CG) generated data to train and validate the models, (b) Unsupervised/semi-supervised training: using unlabeled or partially labelled data for training the models. In this post, let us talk about the former one.
In the last decade, the advances in computational paradigms also pushed the boundaries of CG on a parallel track. Now, they are able render very realistic and physically plausible images/videos. For example, see the evolution of CG in video games from 2000 to 2016. However, there has been long debate on the utility of CG for CV even 1980's. Some argue that most of the assumptions that underlie simulation models we use in CG may overlap with the models we use in CV. Hence, they simulate ideal or near ideal inputs to CV models. Moreover, CG takes lot of short cuts (approximations) to simulate photorealistic effects, specially in video games. These might reflect as domain shift b/w rendered data and real world data. In scientific terminology, impact of modeling errors and computations approximations in the rendering pipeline on transferability of CV model is still not clear.
In my thesis, we aim to address this space systematically, by (a) developing required tools for scene generation and rendering with different choices in plug and play manner, (b) quantizing the impact of generation and rendering parameters on transfer, atleast in the case of a particular task (semantic segmentation) and scene context (urban street traffic scenes). To motivate into my work, just see the following traffic scene which is generated in complete automated fashion. and it is richly annotated at pixel level.
Now lets talk about conventional pipeline of vision system design process. We will be given some amount of training data (T) which labelled. This is collected over real world scenes using a camera which may not be available at development phase. These training images are assumed to be randomly sampled from a real world model (unknown). Now we have use this T to select an optimal vision hypothesis (St) from a space of hypotheses that is defined by the system architecture. St should give minimal error on our T and should be generalizable to unseen data. But for nonparametric models such as CNNs, we need large amounts of labelled data. This would be an obstacle many vision application specially for pixel level tasks such as semantic segmentation, optical flow and intrinsic image decomposition etc. CG can provide high quality labelled data. It will give us a chance to select a virtual scene generation model (W), for example video games, a rendering engine (C), for example ray tracer. Unfortunately, no free lunch situations arise here. No model is perfect. Then the question is how our rendering choices impact the transfer?
In Graphics words, one can divide the rendering engines into 2 classes: real time rendering (used in video games) and physics based rendering (used in scientific visualizations and movies). How these two kinds of renderers differ from the lens of CV models? Also, our world model has its own hyper parameters. How these parameters settings gonna impact the transfer? If tune these parameters to given target data, will improve the generalization of the trained model? These are questions my work aims to address. However, one can bypass these issues by simply adding some labelled real samples to simulated training data.
We now see some related works which uses CG data in different phases. Information from CG data can be utilized at different phases like modeling, learning, inference, and also validation. Please note that there is no perfect boundaries b/w these phases. [Parameswaran et al, VABI, 2011] uses simulations in the phase modeling relevant features for the problem of queue statistics estimation. And recently several works used CG data to train CNNs. A recent trend is to use CG based engine as inference engines. It is also called as probabilistic programming or inverse graphics. And there are works using CG simulations to validate the behavior of the system in what-if situations.
We can view the issue of domain shift from two perspectives inspired from two different fields. One is inspired from systems science and engineering and other is from tradition machine learning also called domain adaptation.
Transfer depends on may factors including closeness of virtual world models to reality, rendering engine, real world test data, vision model that being trained, criterion function. All the choices will make a joint space. There a subdomain in this field, named as modeling and simulation. It suggests to propagate your uncertainty all the way from W to your output space so that your estimates come along with their uncertainty. You will get to weather to transfer them or not. But uncertainty of conclusion from virtual environments is still an active research area. Here in our context also, this is nearly impractical, we resort to empirical approach.
In ideal learning setting, we should minimize the expectation of our loss over joint pdf in inputs and outputs. But in practice, we dont have access to joint pdf. We will be given some data which is assumed to be drawn from joint pdf. we convert continuous integration into discrete sum. Here assumption is that future testing data also coming from the same pdf. This is not true in our context. our future data is coming similar pdf but little but different. There is lot work has been done on this topic under the name of domain adaptation. One work has derived theoretical upper bound domain adaptation error. It says that for any hypothesis S, its error in P (target distributions) is upper bounded by its error in Q (source domain) plus domain discrepancy in the feature space of S. This equation has opened up two research lines: one is re-weighting your source samples that more similar to target samples, and design invariant features space for S in which P and Q are more aligned. We use the first concepts in my work, I will talk about that later.
I can divide my work into 2 components: 1. developing required tool for large scale annotated scene generation and rendering, 2. systematic analysis of impact of rendering parameters on the transfer in a specific application, i.e. semantic segmentation in traffic scenes.
Like I mentioned already, we use three different rendering engines to render the data, one is from very basic model, i.e, Lambertian shader, second one is from the family of real time rendering method, RayTracer, and the third one is from the family of physics inspired rendering methods, Monte Carlo path tracer. At the end of the day, these algorithms solve the same task for different applications. We use blender as our underlying base, and integrate several rendering and annotation plugins. Please the following the figures see the diversity of color images and annotation that our tool can generate.
We choose a particular scene generative model to stochastically generate out scene states. It is based marked poisson processes coupled with predownloded CAD models. It is simple concept, assume all the objects in the scene as points with some attributes. MPP will fire these marked points on ground plane and road network in a way that the object configuration meet some scene layout constraints, which are defined by factor potentials. A good use of factor potentials to incorporate scene layout constraints can be found in the work of scene net for indoor scene. We use some factors for instance bounding box exclusion, mutual alignment, object placement. This simple scene demonstrates the effect of factors on the object arrangement.
We take semantic segmentation as a case study to impact of rendering parameters on transfer I mean generalization of the model trained on simulated data. CNNs are adopted to solve semantic segmentation, for example, DeepLab. This code is publicly available and proven to be state of the art on pascal dataset. So we choose to use this as a vision system that is being trained on virtual data. The deeplab is a modified version of FCNs by replacing standard convoltions with a generic form, Atros convolutions and appended with CRF at test time. However we use batch normalization layers which seems to be working bettter for transfer.
We simulate different sets with different rendering setting, We use lambertian shader, ray tracer and path tracer as renderers. We compare the performance b/w models trained on the data rendered with these rendering engines, so, its like measuring the impact of modeling errors of simulation on the generalization of trained model. We also render five additional datasets with different parameters setting path tracer which will decide the physical accuracy of the data. We refer these sets as MPCT_X, where X being the number of samples-per-pixel used. It is equivalent to measuting impact of computational approximations on the transfer.
Let us divide rendering choices into two things, one is choosing engine itself and second is setting up its parameters, for example samples-per-pixel (spp) in MCPT. We compare the generalization of three DeepLab models trained on three simulated sets rendering lambertian, ray tracer and path tracer. By comparing these three models on cityscapes testing data, we hope to get some quantitative answers to the questions of impact of photorealism and its physical accuracy. The quantitative results of these experiments are shown in the following table and plots.
Impact of modeling errors and computational approximation: Lets take a close look at the differences b/w results labelled with lambertian, ray_trace, mcpt_100. The diffeence b/w lambertian and ray tracer is 20%, which tells us the importance of photorealism of the data. The difference b/w ray trace and mcpt is just 5% but rendering is 15 times more. physics may not be important as much given that its computational very intensive. If you compare among mcpt sets with different spp setting, after 40 samples performance is kinda saturated, which means extreme levels photorealism is may not necessary. We need to spend our time at some other place for reliable transfer.
Things vs Stuff: Now lets compare object level performance. The literature of semantic segmentation refers the object with limited spatial extent as things and no limited extent as stuff. Pedestrians and vehicles are examples for things, where as ground, sky, vegetation are examples of stuff. these standard deviations measures tell us transfer for stuff is more reliable than that of things.
Localizing the major errors: We also try to localize the major errors in the images. For that, we make use of trimaps. Trimap is a binary map which is obtained by dilated edge map with circular structural element of some width. We slowly increase this width and plot corresponding IoU averaged only in that masked output. This (above right) picture clearly shows that near object boundaries the performance is more deviated. the reason for this is that we think, in real world camera, there a lot interesting and complex phenomena happens and it depends on the type of camera we use. for example, color bleeding, chromatic aberrations, lens effects so on. So, it may be important make sure that our virtual camera is closer real world camera that is used to collect real world data.
Now let us talk about scene generative model part. Assuming our model representations are correct, how the choices of model parameters are going to impact the transfer? Do we get any benefit by tuning the model parameters to given labelled real world data? These are questions that we are going address now. In the experiments we have seem till now, we just set the parameter with a common sense. for example pedestrian height should around 1.7m +/- 0.3m. Now, we tune these parameters to given real world data in adversarial manner. We adopt the original GAN approach for our parametric model in a rejection framework.
Initially we place uniform distributions on the scene parameters such as light intensity, object heights ets, we random sample from these pdfs to create a set of scene states. A renderer takes scene states and render the image data. Now this simulated set is given a discriminator along with real world data (cityscapes data). The descrnimkinator's job is to classify real vs simulated sample, i mean output high probability if it's real and less for simulated. The goal of the generator is to generate more and more similar samples near to target data so that discriminator will be confused. We hope it eventually leads to a parameter set which generates simulate that are statistically similar to target real samples.
We use CityScapes and CamVid benchmarks as the sets coming from target domain. So, we tune our generative model to these and repeat the experiments as we did before. We've realized a little bit improvement on both the sets. On cityscapes, it is around 2.28 % improvement and while on camvid, it is around 3.42%. We must admit these improvements are minimal for the amounts of computational resources we have given adversarial tuning.
My insights from these experiments are that,