The following two figures show the structure of the framework and the video player.
Video player is consisted of three components, i.e., rate adaptation, decoder and display. Rate adaptation algorithm decide which video version/layer to request and send out HTTP request for the segments and receive the segments and save them in the segment buffer. The decoder decodes the received segments as fast as possible and save them in the picture buffer. Display just displays the the picture on the screen at constant frame rate.
We use a specific example to show the process of adaptive scalable video streaming over HTTP.
The source video "Big Buck Bunny (1080x720)" is downsampled to two smaller resolutions 640x360 and 320x180. Then each video is splitted into small segments with 17 frames. Since the frame rate is 24 frames/second, each segment has duration around 0.7 seconds. We use the first 3400 frames.
Each segment is encoded using SVC encoder (JSVM). In this example, there are 3 resolutions and each resolution has 2 quality layers. Following table summarize the bitrate and video quality of the layers. Audio stream is not included here.
Extractor extracts the layers' NAL units with the same layer index from segment frames and store them together in a layer segment file (click here to see a list of segments or click backup if the previous link does not work ). Also, the NAL units structure of the segment is analyzed and stored as a text file (click here to see a sample file or click backup if the previous link does not work).
Similar to progressive download, a simple HTTP server can be used as the streaming server. We choose lighttpd as the web server, which is also used by YouTube. The only configuration option need to be carefully configured is server.max-keep-alive-requests, which determines how many requests can be processed before the server close the connection. The client is using pipelining to request layer segments.
When the client starts playback, the information such as number of layers of the video and average video bitrate is obtained first.
Then, the player/rate adaptation algorithm will choose the most suitable layers to satisfy the bandwidth, device screen and computation capability requirement. E.g., a smart phone has a small display screen, a "low" layer with resolution 320x180 is suitable. But for a laptop with larger screen and powerful decoding capability, "low" to "high" layers are requested to achieve the best watching experience.