In a recent project at BBC R&D, the Single Operator Mixing Application, we’ve been building a vision mixer for live event production in a web browser. However, this has thrown up challenges around synchronising multiple video streams using current web technologies.
The Single Operator Mixing Application is just one interface on top of the IP Studio project, which is the R&D testbed for developing new standards, specifications and capabilities for IP media production. At the core of IP Studio is the concept that every audio and video source is independent, so they can be routed and processed separately, and then recombined at the point when it is delivered to audiences. Traditionally this is in a transmitter or server, but for new interactive forms of content, it could be as late as the audience’s web brower. This is achieved by using the Precision Time Protocol to distribute a clock for your studio to every machine on the network, and then embedding every “grain” (a frame of video, or set of audio samples) with that timestamp. By giving everything timing information, and only recombining when needed, we can achieve high levels of flexibility & reliability in setting up our broadcasting chains.
In IP Studio, the user interfaces that operators control are not the same services as those which actually produce the rendered output. For the vision mixing tools, we instead use the principles of object-based composition to produce an edit decision list which can then be rendered for delivery to the user. This separation of concerns is convenient as it means we can route lower resolution video with web-safe codecs to the web browser (meaning we can use a standard desktop or laptop, instead of one on a network connection that can handle several uncompressed 4K streams and support specialist encodings), whilst still operating on the full quality version for the final render. It also means that the edit decision list uses the timing identity embedded in the streams as the timestamp of the edit, and synchronisation can happen in the final mux.
When we originally built the Primer prototype as part of the Nearly Live Production project, we used MPEG-DASH to get the media into the browser. This gave us good synchronisation, as we could start to request segments from a known start time, and the streams would play out in real time. The offset between the start time and the current time is then used to compute the time that the edits in the backend are made. This can feel uncomfortable, as it means what you see in the tool isn’t actually the final mix, but in practice this is not a problem.
In developing Primer, we encountered many bugs relating to synchronisation. Despite in theory all the videos starting at the same time, and then moving together in sync, there’s nothing actually locking all of the video streams to each other, so there is a risk that the streams could move out of sync. Another downside of MPEG-DASH is that the segmentation process adds significant latency (in the order of seconds), which if you’re in the same room as the event you’re mixing, can be a disconcerting experience. For the newest iteration of the tool, we decided to use WebRTC to route video to the browser. WebRTC is designed for sending real time signals, albeit with the use case of video and audio chat, but under the hood uses the same transport standards as the broadcast industry is now moving towards, such as RTP.
However, with WebRTC, in-browser synchronisation becomes even harder. As it’s designed for low latency, videos get played as they are received (with a jitter buffer), with no synchronisation guarantees. For our initial deployment, this is actually fine, as our pipeline delays are so small any sync issues are imperceptible, but this is clearly not a sustainable solution. A bigger problem is that of deriving timing, how do you know at what time an action in the UI should take place when generating the edit decision list? With the TR-03 and NMOS specifications for IP production video streams, the timing information is embedded in the stream using header extensions. Sadly, in a web browser environment you can not dig down deep enough into the stream to extract this information. It would be very helpful if browser vendors decided to support these header extensions and expose timing information to developers! We’ve worked around this by parsing the header extensions in our WebRTC server and using the WebRTC data channel to send them to the browser.
But there is no good way to solve synchronisation of multiple real-time streams with the current set of browser standards.
We would like to work with browser vendors, and other developers who are trying to solve the same problems as us, in order to fill this important gap in web standards to support these new types of use case. Many media use cases have focused on delivering existing linear media to the audience, but as we work on using the web as a platform for creating media, and putting more control and personalisation into the hands of our users, media synchronisation across distinct streams is a key to unlocking this new rich set of streams.
For example, if WebRTC adopted the NMOS timing RTP header extensions, then MediaStream could be extended by having a synchronised mode. In this mode, any MediaStreamTrack that is then fetched from that synchronised MediaStream is then synchronised, with additional error conditions being made available for when a particular track falls out of sync.
In addition to the media production use case described here, the ability to synchronise multiple streams is useful in many other use cases. For example, in WebVR for 360 video conferencing, then performance can be improved by splitting video into a number of segments covering part of the circle, and then disconnecting those which are behind the user, or in distributing multi-stream experiences where media is composed of multiple elements for accessibility and localisation purposes.
Originally published at www.bbc.co.uk.