Smooth Imitation Learning for Automated Video Editing

This page discusses my work on training a machine learning model to do automated video editing, using Smooth Imitation Learning (Simile) [Hoang M. Le et. al, 2016], as well as a software library I created during this project that enables one to use this algorithm with their own data. You can find the code for the Simile library here.

I start by describing the proposed problem and how Simile was a good solution for it. Then, I discuss a few details on how the algorithm works, and finally, I explain how to use the Simile library with your own application.

Before I delve into more details, let's check some of the results. The videos below contain two stacked frames: the bottom one is the raw input (wide-angle videos from a static camera), while the top one is the corresponding AI-edited frame. The AI-edited videos are not only centered at the center of action, but also display smooth frame transitions, making the resulting videos aesthetically appealing to the viewer.

AI-Edited Video 1

AI-Edited Video 2

System Overview

This is an overview of the whole system. The system takes wide-angle videos from a static camera placed at a fixed location as input, and outputs edit parameters (Zoom, X, Y) on a frame-by-frame basis. These parameters are used to edit the original video frames and create edited videos that focus on where the action is.

Basic Components:

    • Training Data:

      • Input Videos: 47 videos of ~40s each (~32min total)

      • Expert Demonstrations: 3-dimensional labels (Zoom, X, Y) for each frame, enough to reproduce the framing.

    • Feature Extraction: extracts features from the input videos on a frame-by-frame basis. These features are used by the Machine Learning model as environment features, to help with the video framing prediction (Zoom, X, Y).

In this project, I used a pre-trained CNN (SSD) to detect people and then performed some feature processing to filter out non-runner detections. Then, I used the position percentiles of remaining detections as my final features. More specifically, I chose seven position percentiles to represent their dispersion on the running field (T0, T10, T25, T50, T75, T90, T100), therefore, 7 features to describe the environment.

    • Smooth Machine Learning Model: the model is trained using features extracted from the videos (environment information), and expert demonstrations as a target.

A key important requirement for this model is that the predictions need to be smooth, i.e, the model needs to learn how to reproduce the expert demonstrations as accurate as possible while also taking into account previous edits made by itself, so that frame transitions can be fluidic and appealing to the viewer.

Supervised Learning Baseline

This baseline model was trained using XGBoost. I used the same environment features described above and added some history information to the training data.

For this case, I chose to keep track of features from past 10 frames (τ=10), such that the input space contained 7*10=70 features arranged as:

X = [xt, xt-1, ..., xt-10]

As one can see, the framing looks fairly reasonable, which indicates that the model has some amount of predictive power, however, the frame transitions are not smooth. This happens because the model does not taking into account previous actions taken by the policy as part of the decision-making. This illustrates the need for a Smooth Machine Learning model.

Smooth Imitation Learning

The algorithm [Hoang M. Le et. al, 2016] allows one to train policies that are constrained to make smooth predictions in a continuous action space given sequential input from an exogenous environment and previous actions taken by the policy. The resulting policies are able to generate smooth action sequences in response to context sequence in an online fashion.

This framework fits very well with the problem of automated video editing, in which we'd like to make sequential predictions for the 3 editing parameters while also making sure these predictions are close to recent previous edits, given context sequence. Beyond automated video editing, the algorithm can also find good applications in problems such as smooth self-driving vehicles for obstacle avoidance, helicopter aerobatics in the presence of turbulence, smart grid management for external energy demand, automated camera planning [Chen et. al, 2016], among others.

In this session, I'll go over a few important aspects of the paper.

Simile Problem Formulation

Considering X ={x1, …, xT}⊂χT to be a context sequence from an environment χ (in my case, the features from the video), and considering A ={a1, …, aT}⊂AT to be an action sequence from an action space A (in my case, the 3 edit parameters [Zoom, X, Y]), the state space is defined on the paper as S: { st =[xt, at-1] }, such that policies can be viewed as mapping states to actions π :S→A . In other words, a policy π generate an action sequence A in response to a context sequence X.

The rollout of a policy is given by:

at = π(st) = π([xt, at-1])

st+1 = [xt+1, at ] , ∀t ∈[1, …,T]

You can see from the way the rollout is formalized that the prediction at from current state state st is part of the next state st+1 , which will in turn generate a prediction at+1 that will be part of the subsequent state. This is how the sequence prediction works, and since at each state st contains information about the previous action, it is possible to enforce at stay close to at-1 . To make this closeness more formal, the authors of the paper introduce the concept of a Smooth Policy class, which I'll go over in the next session.

With that being said, the goal of this imitation learning problem is to find a policy π ̂ ∈ Π that minimizes the imitation loss, considering Π to be a smooth policy class:

π ̂ = argminπ∈Π [ ‖π(s) - π(s)‖2 ]

Smooth Policy Class

Given F to be any complex supervised model class (e.g. Neural networks, Decision Trees), and H to be the space of smooth analytic functions (e.g linear auto-regressors), the authors define Smooth Policy class (Π) as Π⊂F×H such that:

Π≜ {π =(f,h), f ∈ F, h ∈ H | π(s) is close to both f(x, a) and h(a), ∀ induced state s = [x , a] ∈ S}

In other words, the prediction of a smooth policy is bounded to be close to both the prediction of your supervised learning model and the prediction of your smooth function by definition.

In order to solve that, the authors integrated alternating optimization between F and H, such that the prediction of policy π can be viewed as regularized optimization over the action space to ensure closeness of π to both f and h:

π(x, a) = argmin a′∈A⁡ ‖ f(x,a) - a′ ‖2 + λ ‖ h(a) - a′ ‖2

π(x, a) = f(x,a)+ λh(a) / (1+λ)

You can see from the resulting π(x,a) that λ trades-off closeness to f and to previous actions. For large λ, the policy is encouraged to make predictions that stay close to previous actions a (therefore smoother), while for smaller values of λ, the policy is encouraged to make predictions that are closer to the predictions of the supervised model (less smooth, but sometimes more accurate). You can see an example of how that formula works on Fig. 1 on the right.

Fig. 1 : The graph above is meant to provide some intuition on how π(x, a) works during prediction time. a* is the expert demonstration (ground truth), f(x,a) is the output of the supervised model, h(a) is the output of a linear auto regressor trained to predict current action based on previous actions, and π(x,a) is the final prediction of the policy, for λ=10

Supervised Learning Reduction and Policy Update

Fig.2 on the right shows how Simile works, step-by-step.

From a high-level, Simile is a boosting algorithm, in which at each training iteration (n=1, ..., N), a new policy is trained and added to the ensemble. The new policy πn is a linear interpolation between the previous πn-1 and the newly learned π ̂n:

πn = βπ ̂n+(1- β) πn-1

The parameter β is adaptively selected based on relative empirical loss of π and π ̂ w.r.t ground truth {at}. Therefore, the value of β at each learning iteration reflects the quality of the π ̂n, which results in encouraging the learner to disregard bad policies and converge to a good policy.

Moreover, the authors reduced the smooth imitation learning problem to a supervised learning problem. In fact, you can see the supervised learning reduction in lines 7-8, where at each training iteration, there's a two-step procedure:

    • Update smooth regularizer hn

    • Train new π ̂n via supervised learning

One important thing to note is the fact that the algorithm shown on Fig. 2 implies that the smooth regularizer is a linear auto-regressor, since it states that at steps 1 and 7, a linear model should be trained to predict action at based on previous actions. However, this is not a requirement: you could also use simple functions of previous actions to predict at, without any training required. In fact, the type of smooth regularizer function is an option you can choose on the Simile library, in which case the steps in line 1 and 7 may or may not occur depending on your choice.

Fig. 2 : Simile algorithm

Smooth Feedback

Fig. 3: Smooth feedback example

This step happens on line 6 of the algorithm shown in Fig.2. You can see that before each learning step, a "virtual" target is created and used as a target for the learning step instead of the actual expert demonstration (a*). The closeness of this virtual target (a ̂) to the actual target (a*) is controlled by a parameter σ.

You can see intuitively how this works on Fig. 3. Consider the orange curve to be the rollout from a policy (an), and the black curve to be the expert demonstration (a*). When the rollout differs substantially from a*, the policy mistakes lead to the formation of "bad" states, because the previous actions taken by the policy are part of the state, which can make learning not feasible if a* is used as the target. This can happen specially during early learning iterations, when the policy still did not learn how to imitate the expert quite well.

Using virtual targets that lie between the rollout (an) and a* can make training "easier" for the learner during early iterations and allow for more stable learning. The parameter σ should decay exponentially as n increases, such as when n is large enough, σ→0 and a ̂→a* as a result.

Using SIMILE Library with your own application

The Simile library allows you to use this algorithm to train your own policies with your own data. The library takes as input config files with training parameters of your choice. You can find more information about how to prepare these config files on the library documentation.

Choosing training parameters correctly requires a good understanding of how the algorithm works, since the best choice of parameters varies according to data and application. This library is a tool that can help you train your policies and analyze your results.

I hope the previous sessions gave you some intuition to help you get started with your own project. In this session, I'll go over more details of some key parameter choices.

Training Parameters

Data Parameters

  • N_features: As environment features, I used the position percentiles of the centroids of the bounding boxes of detected people at each frame (after some outlier filtering). I chose seven position percentiles to represent their dispersion on the running field (T0, T10, T25, T50, T75, T90, T100), therefore, 7 features.

  • N_targets: my action space was 3 dimensional since I wanted to predict 'Zoom', 'X', 'Y' simultaneously. Therefore, 3 targets.

  • model_dir: directory where to save your model files. I would strongly recommend a dedicated directory for each model you train. During training multiple files get generated, and having multiple models on the same directory can cause a few conflicts and possibly overwrite a few important files if you're not careful.

  • init_value: when running simile in prediction mode, it is required that you specify the first value of the sequence. With that in mind, and depending on the nature of your data, you could choose to train your data using a single pre-defined initial value for all episodes during rollout, which would make prediction mode easier by just using this same pre-defined initial value, or you could choose to take the initial values from the expert annotations. This would work well for training, but you'd need to be careful and make sure that you have initial sequence values available during prediction.

With that being said, this parameter is optional. If you provide it, this will be your pre-defined initial value for all episodes. If not provided, it will automatically take them from the data.

SIMILE Parameters

  • tao: time-horizon (τ). Represents how many features from the past you wish to include as part of the state. You can see how (τ) affects the state formation on Fig.2, lines 1 and 4. As an example, if τ=3, the state at time t would be defined as:

st = [xt, xt-1, xt-2, xt-3, at-1, at-2, at-3]

  • sigma: this parameter (σ0) correspond to the smooth feedback on the first policy. In this implementation, this parameter decays exponentially with the number of iterations according to this formula:

σ = σ0 / 1.6n-1 , where n is the learning iteration number.

σ0 should be a number between [0,1]. Correctly choosing this parameter will depend on your data. If the rollout at the first learning iterations look bad and does not get better after more learning iterations, increasing this parameter may (or may not) solve the problem by slowly bringing the rollout close to the expert demonstrations and making the learning problem more stable.

  • lambd_smooth: regularization parameter (λ) that determines how close the prediction should be to either the supervised model prediction (f) or smooth function (h), according to this relationship (also shown in Fig. 1):

π(x, a) = f(x,a)+ λh(a) / (1+λ)

  • n_it: number of learning iterations; correspond to the 'N' on line 3 of fig. 2.

  • policy_type: choice of supervised model class. In this implementation, you can choose between:

    • neuralnet: simple 2-layer feedforward neural network. If you wish to add more layers (or remove layers), you can do so by changing function "build_network()" inside NeuralNet class.

    • xgboost: uses parallel tree boosting model from XGBoost library.

One important difference between these two choices is the fact that XGBoost does not support multi-output regression. Therefore, if you have a multi-dimensional target, XGBoost will automatically train and predict one regressor per target. As a result, this choice will not leverage potential relationships between your targets that could help with prediction. So depending on your data, you may be better-off using a feed forward neural network.

Moreover, the code is modular enough so that in case you wish to write your own supervised model class, you can easily do so by creating similar classes to NeuralNet ( and XGBoost ( you'd basically need to write a "train" function and a "get_raw_prediction" function for your custom class.

  • autoreg_type: choice of smooth regularizer. You can choose between:

    • average: takes the average of previous τ-1 actions

    • linear: prediction of an auto-regressor trained to predict next action given τ-1 actions. If this option is chosen, then lines 1 and 7 from Fig.2 will be executed, i.e, before each training step, a linear auto-regressor will be trained to predict action at time t given previous τ-1 actions.

    • constant: uses the action taken immediately previous to the one it is trying to predict.

    • geometric_velocity: uses a weighted velocity relationship. For τ=5:

at = at-1 + γ ( at-1 - at-2 ) + γ2 ( at-1 - at-2 ) + γ3 ( at-3 - at-4 )

Operation Options

    • no_action_first_policy: if you look at line 1 in Fig.2, it states that the expert annotation should be part of the initial state. However, despite the fact that learning this first policy can be very easy since the target is part of the state, the rollout of this initial policy can drift very far from what it is supposed to be due to the distribution mismatch between training and prediction data, to the point of making learning unfeasible (at least this is what happened to my data).

This option will build the first state using only environment features. That way, the rollout of the first policy will be very close to what you'd get using pure supervised learning (and a little smoother).

    • look_future: this option allows the algorithm to consider τ/2 environment features in the future and τ/2 environment features in the past when making predictions at time t, instead of the original τ past environment features.

    • only_env_feat: this option performs one round of supervised learning using only the environment features. I personally use this option as an important step to assess the predictive power of the environment features before proceeding with simile.

If the results of this step are not good, you should probably change a few model parameters, model architecture, or get better features before proceeding.

    • normalize_input and normalize_output: this is specially useful if you're using neural net as your supervised model class, and your data varies a lot in range: normalizing makes the cost function easier to optimize.

Test Parameters

Data Parameters

  • policy_load: defines which policy to load during test time. From Fig.2, you can see that Simile will train multiple policies and interpolate their results during prediction time. However, the last policy trained is not necessarily the best policy. In fact, most of the times it is not. And simply relying on mean squared error to choose the best policy is not enough, since many times the policy can be more accurate but much less smooth. Therefore, you have two options:

    • Look over your resulting plots, pick the policy you think works best and provide the policy number of your choice to be loaded.

    • best_policy : the code will try to estimate the best policy based on some measure of roughness and accuracy and automatically select one for you. This option worked pretty well for me, but it is not 100% correct every time.


I would like to express my deepest gratitude to Prof. Yisong Yue, Prof. Pietro Perona and Hoang M.Le for their guidance throughout this project.

I would also like to thank Quiet Machines LLC for providing us with the labelled videos that made this project possible, as well as allowing us to release some of the video results.

Luciana Cendon

I am a Research Engineer working with Machine Learning and Computer Vision. I particularly enjoy working at the intersection between research and real-world applications, turning state-of-the-art algorithms into practical solutions.