Smooth Imitation Learning for Automated Video Editing

This page discusses my work on training a machine learning model to do automated video editing, using Smooth Imitation Learning (Simile) [Hoang M. Le et. al, 2016], as well as a software library I created during this project that enables one to use this algorithm with their own data. You can find the code for the Simile library here.

I start by describing the proposed problem and how Simile was a good solution for it. Then,  I discuss a few details on how the algorithm works, and finally, I explain how to use the Simile library with your own application.

Before I delve into more details, let's check some of the results.  The videos below contain two stacked frames: the bottom one is the raw input (wide-angle videos from a static camera), while the top one is the corresponding AI-edited frame. The AI-edited videos are not only centered at the center of action, but also display smooth frame transitions, making the resulting videos aesthetically appealing to the viewer. 

AI-Edited Video 1

AI-Edited Video 2

System Overview

This is an overview of the whole system. The system takes wide-angle videos from a static camera placed at a fixed location as input, and outputs edit parameters (Zoom, X, Y) on a frame-by-frame basis. These parameters are used to edit the original video frames and create edited videos that focus on where the action is. 

Basic Components:


In this project, I used a pre-trained CNN (SSD) to detect people and then performed some feature processing to filter out non-runner detections. Then, I used the position percentiles of remaining detections as my final features. More specifically, I chose seven position percentiles to represent their dispersion on the running field (T0, T10, T25, T50, T75, T90, T100), therefore, 7 features to describe the environment.


A key important requirement for this model is that the predictions need to be smooth, i.e, the model needs to learn how to reproduce the expert demonstrations as accurate as possible while also taking into account previous edits made by itself, so that frame transitions can be fluidic and appealing to the viewer. 

Supervised Learning Baseline

This baseline model was trained using XGBoost.  I used the same environment features described above and added some history information to the training data. 

For this case, I chose to keep track of features from past 10 frames (τ=10), such that the input space contained 7*10=70 features arranged as:

X = [xt, xt-1, ..., xt-10]

As one can see, the framing looks fairly reasonable, which indicates that the model has some amount of predictive power, however, the frame transitions are not smooth. This happens because the model does not taking into account previous actions taken by the policy as part of the decision-making. This illustrates the need for a Smooth Machine Learning model.

Smooth Imitation Learning

The algorithm [Hoang M. Le et. al, 2016allows one to train policies that are constrained to make smooth predictions in a continuous action space given sequential input from an exogenous environment and previous actions taken by the policy. The resulting policies are able to generate smooth action sequences in response to context sequence in an online fashion. 

This framework fits very well with the problem of automated video editing, in which we'd like to make sequential predictions for the 3 editing parameters while also making sure these predictions are close to recent previous edits, given context sequence. Beyond automated video editing, the algorithm can also find good applications in problems such as smooth self-driving vehicles for obstacle avoidance, helicopter aerobatics in the presence of turbulence, smart grid management for external energy demand, automated camera planning [Chen et. al, 2016], among others.

In this session, I'll go over a few important aspects of the paper.

Simile Problem Formulation

Considering X ={x1, …, xT}⊂χto be a context sequence from an environment χ (in my case, the features from the video), and considering A ={a1, …, aT}⊂AT to be an action sequence from an action space A (in my case, the 3 edit parameters [Zoom, X, Y]), the state space is defined on the paper as  S: { st =[xt, at-1] }, such that policies can be viewed as mapping states to actions π :S→A . In other words, a policy π generate an action sequence A in response to a context sequence X.

The rollout of a policy is given by:

at = π(st) = π([xt, at-1])

st+1 = [xt+1, at ]    , ∀t ∈[1, …,T]

You can see from the way the rollout is formalized that the prediction at from current state state st is part of the next state st+1 , which will in turn generate a prediction at+1 that will be part of the subsequent state. This is how the sequence prediction works, and since at each state scontains information about the previous action, it is possible to enforce astay close to at-1 .  To make this closeness more formal, the authors of the paper introduce the concept of a Smooth Policy class, which I'll go over in the next session. 

With that being said, the goal of this imitation learning problem is to find a policy π ̂ ∈ Π that minimizes the imitation loss, considering Π to be a smooth policy class:

 π ̂ = argminπ∈Π  [ ‖π(s) - π(s)‖

Smooth Policy Class

Given F to be any complex supervised model class (e.g. Neural networks, Decision Trees), and H to be the space of smooth analytic functions (e.g linear auto-regressors), the authors define Smooth Policy class (Π)  as Π⊂F×H such that:

Π≜ {π =(f,h), f ∈ F, h ∈ H | π(s) is close to both f(x, a) and h(a), ∀ induced state s = [x , a] ∈ S}

In other words, the prediction of a smooth policy is bounded to be close to both the prediction of your supervised learning model and the prediction of your smooth function by definition.

In order to solve that, the authors integrated alternating optimization between F and H, such that the prediction of policy π can be viewed as regularized optimization over the action space to ensure closeness of π to both f and h:

π(x, a) = argmin a′∈A⁡  ‖ f(x,a) - a′ ‖2 + λ ‖ h(a) - a′ ‖2 

π(x, a) =  f(x,a)+ λh(a) / (1+λ)   

You can see from the resulting π(x,a) that λ trades-off closeness to f and to previous actions. For large λ, the policy is encouraged to make predictions that stay close to previous actions a (therefore smoother), while for smaller values of λ, the policy is encouraged to make predictions that are closer to the predictions of the supervised model (less smooth, but sometimes more accurate). You can see an example of how that formula works on Fig. 1 on the right. 



Fig. 1 : The graph above is meant to provide some intuition on how π(x, a) works during prediction time.  a* is the expert demonstration (ground truth), f(x,a) is the output of the supervised model, h(a) is the output of a linear auto regressor trained to predict current action based on previous actions, and π(x,a) is the final prediction of the policy, for λ=10

Supervised Learning Reduction and Policy Update

Fig.2 on the right shows how Simile works, step-by-step. 

From a high-level, Simile is a boosting algorithm, in which at each training iteration (n=1, ..., N), a new policy is trained and added to the ensemble. The new policy πn is a linear interpolation between the previous πn-1 and the newly learned π ̂n:

πn = βπ ̂n+(1- β) πn-1 

The parameter β is adaptively selected based on relative empirical loss of π and π ̂ w.r.t ground truth {at}. Therefore, the value of β  at each learning iteration reflects the quality of the π ̂n, which results in encouraging the learner to disregard bad policies and converge to a good policy.


Moreover, the authors reduced the smooth imitation learning problem to a supervised learning problem. In fact, you can see the supervised learning reduction in lines 7-8, where at each training iteration, there's a two-step procedure: 


One important thing to note is the fact that the algorithm shown on Fig. 2 implies that the smooth regularizer is a linear auto-regressor, since it states that at steps 1 and 7, a linear model should be trained to predict action at based on previous actions. However, this is not a requirement: you could also use simple functions of previous actions to predict at, without any training required. In fact, the type of smooth regularizer function is an option you can choose on the Simile library, in which case the steps in line 1 and 7 may or may not occur depending on your choice. 



Fig. 2 : Simile algorithm

Smooth Feedback

Fig. 3: Smooth feedback example

This step happens on line 6 of the algorithm shown in Fig.2. You can see that before each learning step, a "virtual" target is created and used as a target for the learning step instead of the actual expert demonstration (a*). The closeness of this virtual target (a ̂) to the actual target (a*) is controlled by a parameter σ. 

You can see  intuitively how this works on Fig. 3. Consider the orange curve to be the rollout from a policy (an), and the black curve to be the expert demonstration (a*). When the rollout differs substantially from a*, the policy mistakes lead to the formation of "bad" states, because the previous actions taken by the policy are part of the state, which can make learning not feasible if a* is used as the target. This can happen specially during early learning iterations, when the policy still did not learn how to imitate the expert quite well. 

Using virtual targets that lie between the rollout (an) and a* can make training "easier" for the learner during early iterations and allow for more stable learning. The parameter σ should decay exponentially as n increases, such as when n is large enough, σ→0 and a ̂→a* as a result.

Using SIMILE Library with your own application

The Simile library allows you to use this algorithm to train your own policies with your own data. The library takes as input config files with training parameters of your choice. You can find more information about how to prepare these config files on the library documentation.

Choosing training parameters correctly requires a good understanding of how the algorithm works, since the best choice of parameters varies according to data and application. This library is a tool that can help you train your policies and analyze your results.

I hope the previous sessions gave you some intuition to help you get started with your own project. In this session, I'll go over more details of some key parameter choices.

Training Parameters

Data Parameters

With that being said, this parameter is optional. If you provide it, this will be your pre-defined initial value for all episodes. If not provided, it will automatically take them from the data. 

SIMILE Parameters

 s= [xt, xt-1, xt-2, xt-3, at-1, at-2, at-3]

σ = σ0 / 1.6n-1 , where n is the learning iteration number.

σ0 should be a number between [0,1].  Correctly choosing this parameter will depend on your data. If the rollout at the first learning iterations look bad and does not get better after more learning iterations, increasing this parameter may (or may not) solve the problem by slowly bringing the rollout close to the expert demonstrations and making the learning problem more stable. 

π(x, a) =  f(x,a)+ λh(a) / (1+λ)     

One important difference between these two choices is the fact that XGBoost does not support multi-output regression. Therefore, if you have a multi-dimensional target, XGBoost will automatically train and predict one regressor per target. As a result, this choice will not leverage potential relationships between your targets that could help with prediction. So depending on your data, you may be better-off using a feed forward neural network.  

Moreover, the code is modular enough so that in case you wish to write your own  supervised model class, you can easily do so by creating similar classes to NeuralNet  (neuralnet.py) and XGBoost (parallel_tree_boosting.py): you'd basically need to write a "train" function and a "get_raw_prediction" function for your custom class. 

at = at-1 + γ ( at-1 - at-2 ) + γ2 ( at-1 - at-2 ) + γ3 ( at-3 - at-4 )

Operation Options

This option will build the first state using only environment features. That way, the rollout of the first policy will be very close to what you'd get using pure supervised learning (and a little smoother).

If the results of this step are not good, you should probably change a few model parameters, model architecture, or get better features before proceeding.  

Test Parameters

Data Parameters

Acknowledgement

I would like to express my deepest gratitude to Prof. Yisong Yue, Prof. Pietro Perona and Hoang M.Le for their guidance throughout this project. 

I would also like to thank Quiet Machines LLC for providing us with the labelled videos that made this project possible, as well as allowing us to release some of the video results.

Luciana Cendon

I am a Research Engineer working with Machine Learning and Computer Vision.  I particularly enjoy working at the intersection between research and real-world applications, turning state-of-the-art algorithms into practical solutions.