The Holy Grail quest for Ultra Low Latency Streaming Using CMAF
It all started long ago sometime in 2000-2002 with a company called Macromedia. The scope was to stream audio, video and data over the Internet, in a way that it could allow live streaming.
Live streaming refers to online streaming media simultaneously recorded and broadcast in real time. It is often referred to simply as streaming, however this abbreviated term is ambiguous due to the fact that “streaming” may refer to any media delivered and played back simultaneously without requiring a completely downloaded file. Non-live media such as video-on-demand, vlogs, and YouTube videos are technically streamed, but not live streamed.
Live stream services encompass a wide variety of topics, from social media to video games. Platforms such as Facebook Live, Periscope, Kuaishou, and 17 include the streaming of scheduled promotions and celebrity events as well as streaming between users, as in videotelephony. Sites such as Twitch.tv have become popular outlets for watching people play video games, such as in eSports, Let’s Play-style gaming, or speedrunning.
User interaction via chat rooms forms a major component of live streaming. Platforms often include the ability to talk to the broadcaster or participate in conversations in chat. An extreme example of viewer interfacing is the social experiment Twitch Plays Pokémon, where viewers collaborate to complete Pokémon games by typing in commands that correspond to controller inputs.
The protocol used was called Real-Time Messaging Protocol (RTMP), from Flash Media Server, and although Flash was the dominant platform for online multimedia content, it was slowly abandoned as Adobe favors a transition to HTML5 due to inherent security flaws and significant resources required to maintain the platform. Apple restricted the use of Flash on iOS in 2010 due to concerns that it performed poorly on its mobile devices, had negative impact on battery life, and was deemed unnecessary for online content.
To the rescue came HTTP Live Streaming (HLS). HTTP Live Streaming is an HTTP-based adaptive bitrate streaming communications protocol implemented by Apple Inc. as part of its QuickTime, Safari, OS X, and iOS software. HLS was anything but live in a sense of real time, with delays (latency) of up to 30 seconds between broadcaster and viewer, while RTMP could do it in 2 or 3 seconds.
So the two giants on the internet Apple and Microsoft join together to solve this issue, the Holy Grail quest for Ultra Low Latency, trying to bring down the delay between the broadcaster and the viewer! The Common Media Application Format (CMAF) the Big Bang of live streaming.
What is CMAF?
In 2016, Apple announced they would start supporting CMAF (or Common Media Application Format) within HLS. The format, as in the name, has as a goal to bring together the HLS and MPEG-DASH formats in order to simplify online video delivery. Contrary to popular belief, CMAF itself does not reduce latency. It provides a low latency mode where media segments can be divided in even smaller chunks.
CMAF leverages an ISOBMFF, fMP4 container (ISO/IEC 14496-12:201). In the past however, HLS would make use of Transport Streams, which have served the purpose of the broadcast and cable industries well in delivering continuous streams of data. Segmented media delivery however, is not one of its strengths, incurring overhead ratios between 5% and 15%, far higher than fMP4. On top of that, the fMP4 format is easily extensible, and already used often in DASH, Smooth and HDS implementations.
HTTP Adaptive Streaming & Chunked Transfer Encoding
CMAF comes with a low latency mode. This low latency mode allows you to split up an individual segment into smaller chunks. One could ask: why would I need to split a segment, and not just make segments smaller? CMAF requires a keyframe at the start of every segment; this is not the case for a chunk. A keyframe tends to be significantly larger than non-keyframes. A reduction in segment size would result in increased bandwidth, while still delivering the same quality.
Segments usually have a duration ranging from 2 to 6 seconds. Most streaming protocols have determined a buffer of about three segments, and a fourth segment usually being buffered, is optimal for avoiding playback stalls. The reasoning here is that segments need to be listed in the manifest, encoded, downloaded and added to the buffer as a whole. This often results in 10-30s latencies.
By splitting segments into chunks, a streaming server can make chunks within a segment available already while the entire chunk has not been completed yet. This changes the situation as players can download the individual chunks ahead of time, and buffers can be filled faster. This allows a reduction of the end-to-end latency significantly.
Just splitting into chunks is not enough to reduce latency. While producing chunks instead of segments allows the packager to produce chunks faster (and already list a segment in the manifest once the first chunk is ready), the following components of the pipeline should be ready as well. In practice this means the origin should expose the chunks using HTTP/1.1 chunked transfer encoding (or a similar technique over an alternative protocol). Similarly, the CDN allowing to scale to a larger number of viewers, should mimic this behavior, and expose the chunks to the player in the same way.
Why Do We Need CMAF?
Competing codecs, protocols, media formats, and devices make the already complex world of live streaming infinitely more so. In addition to causing a headache, the use of different media formats increases streaming latency and costs.
What could be achieved with a single format for each rendition of a stream instead requires content distributors to create multiple copies of the same content. In other words, the streaming industry makes video delivery unnecessarily expensive and slow.
The number of different container files alone is exhaustive: .ts, .mp4, .wma, .mpeg, .mov… the list goes on. But what if competing technology providers could agree on a standard streaming format across all playback platforms? Enter the Common Media Application Format, or CMAF.
CMAF brings us closer to the ideal world of single-approach encoding, packaging, and storing. What’s more, it promises to drop end-to-end delivery time from 30-45 seconds to less than three seconds.
But at Apple Worldwide Developer Conference (WWDC) a couple of weeks back, Apple announced the specs for a brand-new extension of their HTTP Live Streaming (HLS) protocol: Low-Latency HLS. While reducing latency for live streaming is a valiant goal (and one that we can get behind), this news interrupted an industry-wide effort to do so via chunked transfer encoding.
But whatever the result or which will win the race we at RTMP Server have nothing to worry, we working with Wowza Streaming Engine the core software on our servers to implement any of these systems, Low-Latency HLS or Common Media Application Format (CMAF) or even both of them, we are ready and excited to really stream live with no latency at all to our customers!