Smooth Imitation Learning for Automated Video Editing

The goal of this page is to describe a project where I successfully trained a Machine Learning model using the Smooth Imitation Learning (Simile) algorithm to do automated video editing, given expert demonstrations as labels for training. This project also comes with a library that allows anyone to use this algorithm with their own data. You can find the code here.

I start by describing a few details of the proposed problem and how Simile was a good solution for it. Later, I discuss a few details about how the algorithm works and how to adapt it to your own application using the Simile library.

But before I delve into more details, I would like to show a few results on the example videos below. The input videos to the system correspond to the bottom frames (wide-angle videos from a static camera), while top frames correspond to the edited videos, edited by the AI model in a completely automated form. You can see that the resulting edited videos not only have reasonable framing w.r.t where the action is, but also the frame transitions are smooth, so the resulting videos are aesthetically appealing to the viewer.

If you find any of this interesting, please read on for more details

AI-Edited Video 1

AI-Edited Video 2

AI-Edited Video 3

System Overview

Let's start with an overview of how the whole system works. The system takes as input wide-angle videos from a static camera placed at a fixed location at the running track, and output edit parameters (Zoom, X, Y) on a frame-by-frame basis. These parameters are used to edit the original videos and create edited videos as output, focusing on where the action is.

Basic Components:

    • Training Data:
      • Input Videos: 47 videos of ~40s each (~32min total)
      • Expert Demonstrations: 3-dimensional labels (Zoom, X, Y) for each frame, enough to reproduce the framing.

    • Feature Extraction: extracts features from the input videos on a frame-by-frame basis. These features are used by the Machine Learning model as environment features, to help in the prediction of an adequate video framing (Zoom, X, Y).

In this project, I used a pre-trained CNN model for people detection, and after some processing aimed at filtering non-runner detections out of the pool, I used the position percentiles of the centroids of the bounding boxes of remaining detections at each frame as my final features. More specifically, I chose seven position percentiles to represent their dispersion on the running field (T0, T10, T25, T50, T75, T90, T100), therefore, 7 features to describe the environment.

It is worth nothing that this is by no means an ideal representation of action in the frame. Not only it is lacks semantical meaning on what the action really represent, but is also noisy in the sense that there are multiple non-runner people on the scene. A more semantic-based set of features would be required in order to make this solution more general. However, it worked well for this specific problem for the sake of proof-of-concept. A more in-depth discussion on feature representation is beyond the scope of this page.

    • Smooth Machine Learning Model: the model is trained using the features extracted from the videos (environment information), and expert demonstrations as a target.

A key important requirement for this machine learning model is the fact that predictions need to be smooth, i.e, the model needs to learn how to reproduce the expert demonstrations as accurate as possible while also taking into account the edits it previously made, in such a way that frame the transitions stay close to each other, resulting in fluidic transitions appealing to the viewer.

Supervised Learning Baseline

The video on the left illustrates the need for a Smooth Machine Learning model. This baseline model was trained using simple supervised learning on the environment features with some history information.

For this case, I chose to keep track of features of past 10 frames (τ=10), such that the input space contained 7*10=70 features arranged as:

X = [xt, xt-1, ..., xt-10]

You can see that the model have some amount of predictive power since the framing looks fairly reasonable. However, the frame transitions are not smooth since the model is not taking into account previous actions taken by the policy as part of the decision-making.

Smooth Imitation Learning

The algorithm was developed by Hoang M.Le from prof. Yisong Yue group at Caltech. You can read all about it on their paper, which is very detailed and well-written. In short, this algorithm algorithm allows one to train policies that are constrained to make smooth predictions in a continuous action space given sequential input from an exogenous environment and previous actions taken by the policy. The resulting policies are able to generate smooth action sequences in response to context sequence in an online fashion.

This framework fits very well with the problem of automated video editing, in which we'd like to make sequential predictions for the 3 editing parameters while also making sure these predictions are close to recent previous edits, given context sequence. Beyond automated video editing, the algorithm can also find good applications in problems such as smooth self-driving vehicles for obstacle avoidance, helicopter aerobatics in the presence of turbulence, smart grid management for external energy demand, automated camera planning [Chen et. al, 2016], among others.

In this session, I'll go over a few important aspects of the paper. I hope this will help you get started with adapting this algorithm to your own application.

Simile Problem Formulation

Considering X ={x1, …, xT}⊂χT to be a context sequence from an environment χ (in my case, the features from the video), and considering A ={a1, …, aT}⊂AT to be an action sequence from an action space A (in my case, the 3 edit parameters [Zoom, X, Y]), the state space is defined on the paper as S: { st =[xt, at-1] }, such that policies can be viewed as mapping states to actions π :S→A . In other words, a policy π generate an action sequence A in response to a context sequence X.

The rollout of a policy is given by:

at = π(st) = π([xt, at-1])

st+1 = [xt+1, at ] , ∀t ∈[1, …,T]

You can see from the way the rollout is formalized that the prediction at from current state state st is part of the next state st+1 , which will in turn generate a prediction at+1 that will be part of the subsequent state. This is how the sequence prediction works, and since at each state st contains information about the previous action, it is possible to enforce at stay close to at-1 . To make this closeness more formal, the authors of the paper introduce the concept of a Smooth Policy class, which I'll go over in the next session.

With that being said, the goal of this imitation learning problem is to find a policy π ̂ ∈ Π that minimizes the imitation loss, considering Π to be a smooth policy class:

π ̂ = argminπ∈Π [ ‖π(s) - π(s)‖2 ]

Smooth Policy Class

Given F to be any complex supervised model class (e.g. Neural networks, Decision Trees), and H to be the space of smooth analytic functions (e.g linear auto-regressors), the authors define Smooth Policy class (Π) as Π⊂F×H such that:

Π≜ {π =(f,h), f ∈ F, h ∈ H | π(s) is close to both f(x, a) and h(a), ∀ induced state s = [x , a] ∈ S}

In other words, the prediction of a smooth policy is bounded to be close to both the prediction of your supervised learning model and the prediction of your smooth function by definition.

In order to solve that, the authors integrated alternating optimization between F and H, such that the prediction of policy π can be viewed as regularized optimization over the action space to ensure closeness of π to both f and h:

π(x, a) = argmin a′∈A⁡ ‖ f(x,a) - a′ ‖2 + λ ‖ h(a) - a′ ‖2

π(x, a) = f(x,a)+ λh(a) / (1+λ)

You can see from the resulting π(x,a) that λ trades-off closeness to f and to previous actions. For large λ, the policy is encouraged to make predictions that stay close to previous actions a (therefore smoother), while for smaller values of λ, the policy is encouraged to make predictions that are closer to the predictions of the supervised model (less smooth, but sometimes more accurate). You can see an example of how that formula works on Fig. 1 on the right.

Fig. 1 : The graph above is meant to provide some intuition on how π(x, a) works during prediction time. a* is the expert demonstration (ground truth), f(x,a) is the output of the supervised model, h(a) is the output of a linear auto regressor trained to predict current action based on previous actions, and π(x,a) is the final prediction of the policy, for λ=10

Supervised Learning Reduction and Policy Update

Fig.2 on the right shows how Simile works, step-by-step.

From a high-level, Simile is a boosting algorithm, in which at each training iteration (n=1, ..., N), a new policy is trained and added to the ensemble. The new policy πn is a linear interpolation between the previous πn-1 and the newly learned π ̂n:

πn = βπ ̂n+(1- β) πn-1

The parameter β is adaptively selected based on relative empirical loss of π and π ̂ w.r.t ground truth {at}. Therefore, the value of β at each learning iteration reflects the quality of the π ̂n, which results in encouraging the learner to disregard bad policies and converge to a good policy.

Moreover, the authors reduced the smooth imitation learning problem to a supervised learning problem. In fact, you can see the supervised learning reduction in lines 7-8, where at each training iteration, there's a two-step procedure:

    • Update smooth regularizer hn
    • Train new π ̂n via supervised learning

One important thing to note is the fact that the algorithm shown on Fig. 2 implies that the smooth regularizer is a linear auto-regressor, since it states that at steps 1 and 7, a linear model should be trained to predict action at based on previous actions. However, this is not a requirement: you could also use simple functions of previous actions to predict at, without any training required. In fact, the type of smooth regularizer function is an option you can choose on the Simile library, in which case the steps in line 1 and 7 may or may not occur depending on your choice.

Fig. 2 : Simile algorithm

Smooth Feedback

Fig. 3: Smooth feedback example

This step happens on line 6 of the algorithm shown in Fig.2. You can see that before each learning step, a "virtual" target is created and used as a target for the learning step instead of the actual expert demonstration (a*). The closeness of this virtual target (a ̂) to the actual target (a*) is controlled by a parameter σ.

You can see intuitively how this works on Fig. 3. Consider the orange curve to be the rollout from a policy (an), and the black curve to be the expert demonstration (a*). When the rollout differs substantially from a*, the policy mistakes lead to the formation of "bad" states, because the previous actions taken by the policy are part of the state, which can make learning not feasible if a* is used as the target. This can happen specially during early learning iterations, when the policy still did not learn how to imitate the expert quite well.

Using virtual targets that lie between the rollout (an) and a* can make training "easier" for the learner during early iterations and allow for more stable learning. The parameter σ should decay exponentially as n increases, such as when n is large enough, σ→0 and a ̂→a* as a result.

SIMILE Library and your own application

The Simile library allows you to use this algorithm to train your own policies using your own data. The library takes as input config files with training parameters of your choice as well as paths to your data. You can find more information about how to prepare these config files on the library documentation.

As you may imagine, choosing these parameters correctly requires a good understanding of how the algorithm works. Also, the best choice of parameters should vary according to the application and dataset. Therefore, this cannot be an off-the-shelf library that you can simply feed your data into and expect good results, but more like a tool that can help you train your policies and analyze your results.

I hope the previous sessions gave you some intuition to help you get started with your own project. In this session, I'll go over more details of a few parameter choices.

Training Parameters

Data Parameters

  • N_features: As environment features, I used the position percentiles of the centroids of the bounding boxes of detected people at each frame (after some outlier filtering). I chose seven position percentiles to represent their dispersion on the running field (T0, T10, T25, T50, T75, T90, T100), therefore, 7 features.
  • N_targets: my action space was 3 dimensional since I wanted to predict 'Zoom', 'X', 'Y' simultaneously. Therefore, 3 targets.
  • model_dir: directory where to save your model files. I would strongly recommend a dedicated directory for each model you train. During training multiple files get generated, and having multiple models on the same directory can cause a few conflicts and possibly overwrite a few important files if you're not careful.
  • init_value: when running simile in prediction mode, it is required that you specify the first value of the sequence. With that in mind, and depending on the nature of your data, you could choose to train your data using a single pre-defined initial value for all episodes during rollout, which would make prediction mode easier by just using this same pre-defined initial value, or you could choose to take the initial values from the expert annotations. This would work well for training, but you'd need to be careful and make sure that you have initial sequence values available during prediction.

With that being said, this parameter is optional. If you provide it, this will be your pre-defined initial value for all episodes. If not provided, it will automatically take them from the data.

SIMILE Parameters

  • tao: time-horizon (τ). Represents how many features from the past you wish to include as part of the state. You can see how (τ) affects the state formation on Fig.2, lines 1 and 4. As an example, if τ=3, the state at time t would be defined as:

st = [xt, xt-1, xt-2, xt-3, at-1, at-2, at-3]

  • sigma: this parameter (σ0) correspond to the smooth feedback on the first policy. In this implementation, this parameter decays exponentially with the number of iterations according to this formula:

σ = σ0 / 1.6n-1 , where n is the learning iteration number.

σ0 should be a number between [0,1]. Correctly choosing this parameter will depend on your data. If the rollout at the first learning iterations look bad and does not get better after more learning iterations, increasing this parameter may (or may not) solve the problem by slowly bringing the rollout close to the expert demonstrations and making the learning problem more stable.

  • lambd_smooth: regularization parameter (λ) that determines how close the prediction should be to either the supervised model prediction (f) or smooth function (h), according to this relationship (also shown in Fig. 1):

π(x, a) = f(x,a)+ λh(a) / (1+λ)

  • n_it: number of learning iterations; correspond to the 'N' on line 3 of fig. 2.
  • policy_type: choice of supervised model class. In this implementation, you can choose between:
    • neuralnet: simple 2-layer feedforward neural network. If you wish to add more layers (or remove layers), you can do so by changing function "build_network()" inside NeuralNet class.
    • xgboost: uses parallel tree boosting model from XGBoost library.

One important difference between these two choices is the fact that XGBoost does not support multi-output regression. Therefore, if you have a multi-dimensional target, XGBoost will automatically train and predict one regressor per target. As a result, this choice will not leverage potential relationships between your targets that could help with prediction. So depending on your data, you may be better-off using a feed forward neural network.

Moreover, the code is modular enough so that in case you wish to write your own supervised model class, you can easily do so by creating similar classes to NeuralNet ( and XGBoost ( you'd basically need to write a "train" function and a "get_raw_prediction" function for your custom class.

  • autoreg_type: choice of smooth regularizer. You can choose between:
    • average: takes the average of previous τ-1 actions
    • linear: prediction of an auto-regressor trained to predict next action given τ-1 actions. If this option is chosen, then lines 1 and 7 from Fig.2 will be executed, i.e, before each training step, a linear auto-regressor will be trained to predict action at time t given previous τ-1 actions.
    • constant: uses the action taken immediately previous to the one it is trying to predict.
    • geometric_velocity: uses a weighted velocity relationship. For τ=5:

at = at-1 + γ ( at-1 - at-2 ) + γ2 ( at-1 - at-2 ) + γ3 ( at-3 - at-4 )

Operation Options

    • no_action_first_policy: if you look at line 1 in Fig.2, it states that the expert annotation should be part of the initial state. However, despite the fact that learning this first policy can be very easy since the target is part of the state, the rollout of this initial policy can drift very far from what it is supposed to be due to the distribution mismatch between training and prediction data, to the point of making learning unfeasible (at least this is what happened to my data).

This option will build the first state using only environment features. That way, the rollout of the first policy will be very close to what you'd get using pure supervised learning (and a little smoother).

    • look_future: this option allows the algorithm to consider τ/2 environment features in the future and τ/2 environment features in the past when making predictions at time t, instead of the original τ past environment features.
    • only_env_feat: this option performs one round of supervised learning using only the environment features. I personally use this option as an important step to assess the predictive power of the environment features before proceeding with simile.

If the results of this step are not good, you should probably change a few model parameters, model architecture, or get better features before proceeding.

    • normalize_input and normalize_output: this is specially useful if you're using neural net as your supervised model class, and your data varies a lot in range: normalizing makes the cost function easier to optimize.

Test Parameters

Data Parameters

  • policy_load: defines which policy to load during test time. From Fig.2, you can see that Simile will train multiple policies and interpolate their results during prediction time. However, the last policy trained is not necessarily the best policy. In fact, most of the times it is not. And simply relying on mean squared error to choose the best policy is not enough, since many times the policy can be more accurate but much less smooth. Therefore, you have two options:
    • Look over your resulting plots, pick the policy you think works best and provide the policy number of your choice to be loaded.
    • best_policy : the code will try to estimate the best policy based on some measure of roughness and accuracy and automatically select one for you. This option worked pretty well for me, but it is not 100% correct every time.


I would like to express my deepest gratitude to Prof. Yisong Yue, Prof. Pietro Perona and Hoang M.Le for all their guidance and assistance throughout this project.

I would also like to thank Quiet Machines LLC for providing us with the labelled videos which made this project possible, and also for allowing us to release some of the video results.

Luciana Cendon

I am a Research Engineer with 3 years of experience in the fields of Machine Learning and Computer Vision, during which I had the opportunity to work in both Industry and Academic environments. I specially love working at the intersection between research and real-world applications, turning state-of-the-art algorithms into practical solutions.