Towards End-to-End Generative Modeling of Long Videos
with Memory-Efficient Bidirectional Transformers