Often we batch data for training/inferencing. However each data might have different lengths. For example, sentences may have different lengths. Usually we pad up the shorter data lengths upto the longest data in a batch. In this article we will explore the probability distribution of the batched data, and ways to bucket it to reduce variations in sizes
Let X be the discrete random variable denoting the length of a datum.
TODO: Finish