1.1 Project Background
Back in 2002 I was employed by a start-up development house that was building a prototype TV set-top box. The concept was quite ahead of its time, it was supposed be a PVR (Personal Video Recorder) with VOD (Video on Demand) capabilities built in. The set-top box was intended to be used on existing cable TV networks. Each client would rent a unit and would access the TV content through a peer-to-peer networking system. In essence, the client would share, transfer, record and view TV programs at any time on the network.
Similar set-top box and streaming technologies are already in place today, although peer-to-peer PVR and VOD schemes were still a relatively new concept in 2002. The hope was to get a foothold in the cable TV market before competing firms would do the same. The start-up company had an advantage in that respect, because its funding came from a parent company that owned cable TV networks. Unfortunately, these plans were never fully realised. Project was shelved due to the extensive financial restructuring in the parent company.
1.2 Technology Overview
The system was supposed to operate on a proprietary network (owned by the pay-TV company), hence the actual sharing and transferring mechanism would be hidden from the user. The idea was to allow clients to save (or stream) content directly to (or from) other set-top boxes transparently. Therefore, even though each set-top box would be equipped with a storage device, a recorded TV program may not reside on the client's set-top box at all, it could be distributed and/or stored on other set-top boxes on the network. The client would only see a big collection of TV programs on the network. Any program recorded by a client will be visible to other clients on the network.
A distributed networking system of this magnitude would need a centralised database to keep track of the all recorded TV programs. On the client side, the set-top box would need an appropriate menu system that would allow the user to navigate through the list of programs obtained from the server.
The actual video content was streamed at either full NTSC or PAL resolutions. Obviously, this would need an efficient video compression algorithm to save bandwidth and to reduce data storage. Additionally, the video codec was required to encode video streams in real-time while playing back another stream simultaneously.
The whole project was prototyped on a couple of PCs, complete with a functional networking code, menu interface, a fast video codec and a TV program database server on the side. The grand plan was to eventually migrate the system onto an embedded architecture, and perhaps even adopt an ASIC implementation further down the track.
2 Video Codec
2.1 Choice of Compression Scheme
The video stream was processed by a technique known as the fast lifting wavelet transform . The wavelet transform converted image data from the spatial domain into a series of frequency sub-bands. Data compression was achieved by combining the wavelet transform with a quantizing filter, and then by passing the resulting wavelet coefficients to a zero-tree entropy coder [3, 4].
The method we used in this project was based on the 2-6 wavelet transform , which was an optimised technique intended for ASIC applications. The performance of 2-6 wavelet compression was comparable to MPEG-2 technology, it achieved similar peak signal-to-noise and data compression ratios. However, the most attractive feature of 2-6 wavelet compression was its low complexity, making it suitable for fast software applications. Our implementation was capable of encoding and decoding video streams in real-time on a Pentium 4 machine. The typical image quality that was achievable by this wavelet compression scheme is illustrated in Figures 1, 2 and 3.
2.2 Colour Format Conversion
The initial step for video compression was to convert the incoming RGB frames into three separate colour planes, Y (luminance), U (chrominance blue) and V (chrominance red) respectively (see Figures 4 to 6). The frames were represented in the YUV9 format, essentially assigning every 4×4 block of Y pixels with a single pair of U and V pixels. In this form, each chrominance plane was reduced to 1⁄16th of the luminance bandwidth before compression. The only drawback with the YUV9 format was the chroma bleeding — colours would leak at the boundaries of the objects depicted, see Figure 3.
2.3 The Wavelet Transform
The 2-6 wavelet transform accepted a series of YUV9 colour planes, like the ones shown in Figures 4 to 6. It operated in multiple passes on each colour plane, successively dividing them into series of sub-bands as shown in Figure 7. The algorithm extracted both the low and high frequency data in each pass, separating them into two regions. The process was repeated successively on the low frequency half, dividing it further into smaller sub-bands, and stopped when the smallest sub-band was no longer divisible. For most situations, six passes was quite adequate (hence the term "2-6 transform" — splitting sub-bands in just six passes).
As shown in Figure 7, the first horizontal division would extract the low and high frequency bands into L (left) and R (right) regions respectively. In the next pass, the low frequency data was vertically divided into L(T) and L(B) regions, forming new low and high frequency sub-bands respectively. The algorithm eventually formed a sub-band pyramid, essentially accumulating the low frequency data towards the lop-left, leaving the DC coefficients in the corner. The final transform was represented using 16-bit integers.
The transform algorithm was quite easy to adopt for SIMD processing. A large portion of the transform code was implemented in MMX and SSE assembly on the Intel platform, which could operate on four pixels at a time. This increased the efficiency by four-fold over the generic C++ implementation.
Figures 8 to 10 illustrate how the YUV9 colour planes (in Figures 1 to 3) were transformed into wavelet sub-bands. It was immediately apparent that most of the sub-bands contained very little information, especially in the high frequency regions. This was exploited by applying a quantizing filter on these regions, practically making most of the high frequency coefficients zero. The redundant zero data would favour the zero-tree encoder for better data compression.
Figures 11 to 13 show the same wavelet data represented with a colour coded map. It highlights the bit-size of each wavelet coefficient. The black areas correspond to 0-bit information, while blue gradients represent bits of increasing size. The redundant zero coefficients are further highlighted in Figures 14 to 16. The wavelet coefficients were actually 16-bit, but most of those only had values set in the 5 least significant bits. About 1% of the coefficients had values greater than 8-bits - these corresponded to the low frequency and the DC coefficients.
2.4 Entropy Encoding
If the collection of wavelet coefficients were split into sixteen individual bit planes, one would notice that the plane corresponding to the most significant bit would be sparse and largely zero; whereas the plane corresponding to the least significant bit would be the most populated. Furthermore, each bit plane would become more sparse towards in higher frequency regions. To exploit this property, a zero-tree compression algorithm was implemented and operated on these bit planes independently.
Figure 17 shows how the zero-tree was organised in a single bit plane. The diagram on the right illustrates its equivalent quad-tree structure. The root node coincided with the location of the DC coefficients. Its descendants expanded outwards, spreading out to cover the high frequency area. The tree was built by linking each parent node to four children that contained one or more non-zero bits. Parent nodes whose children contained only zero bits were turned into a leaf node.
In other words, entire branches would be truncated at a node whose descendants were all zero. Since high frequency coefficients contained mostly zero data, the truncation technique made tree increasingly sparse for branch levels spreading further out, effectively making data compression feasible. This effect is illustrated Figures 18 to 20, which show the distribution of all the parent nodes in every bit plane. For example, one could observe that the sky region in Figure 1 had very little high frequency information to begin with, thus most of the corresponding wavelet coefficients were quantized out to zero. Figures 18 to 20 show the absence of parent nodes that correspond to the sky regions, because all of their descendant branches were zero.
Once the zero-tree was constructed, a series of bit codes were generated to represent the traversal of the entire tree. These traversal bit codes are saved in a file container as compressed data, which would allow a decoder to reconstruct the zero-tree during playback.
The whole encoding process could be summed up as follows:
Convert each RGB frame into individual YUV9 colour planes;
Apply the 2-6 wavelet transform on each of the YUV9 input colour planes;
Quantise the wavelet coefficients to zero-out some of the high frequency data;
Decompose the wavelet space into sixteen bit planes;
For each bit plane, generate a parent node map for building a tree of non-zero bit coefficients;
Traverse the entire tree while ignoring (or truncating) branches containing only zero bits;
Store the traversal order in a series of bit codes.
C. Valens, The Fast Lifting Wavelet Transform, 1999
W. C. Lynch, K. Kolarov, W. Arrighi, R. Hoover, WZD - A Very Low Complexity Wavelet Video Codec for an ASIC Macro, Interval Research Corporation, 1999
J. Shapiro, Embedded Image Coding Using Zerotrees of Wavelet Coefficients, IEEE Trans. Signal Processing, vol. 41, pp. 3445 - 3463, 1993
C. Valens, Embedded Zerotree Wavelet Encoding, 1999