3 MPEG-4: Technical Description

MPEG-4 version 1 became an International Standard in 1999. The second version of the standard, which also includes the entire technical content of the first version of the standard, became an International Standard in 2000. After these two versions, more parts were developed, and new tools and profiles were introduced. But perhaps most important was the development of Part 10, better known as H.264 or Advanced Video Coding (AVC), which substantially improves MPEG-4’s video compression efficiency.

The MPEG-4 standard addresses system issues such as the multiplexing and composition of audiovisual data in the Systems part [7], the decoding of the visual data in Part 2 [8] and Part 10 [9], and the decoding of the audio data in the Audio part [10]. In this chapter, we will focus on the visual parts of the standard, i.e., MPEG-4 Part 2 (MPEG-4 Video) and Part 10 (H.264/AVC).

3.1 MPEG-4 Part 2

MPEG-4 Part 2, officially known as ISO/IEC 14496-2 [8], standardizes an efficient object-based representation of video. Such representation is achieved by defining visual objects and encoding them into separate bitstream segments [6,7]. While MPEG-4 Part 2 defines only the bitstream syntax and the decoding process, the precise definitions of some compliant encoding algorithms are presented in two verification models: one for natural video coding [11], and the other for synthetic and natural hybrid video coding (SNHC) [12].

MPEG-4 Part 2 allows four different types of coding tools: Video object coding for the coding of natural and/or synthetic generated, rectangular or arbitrarily shaped video objects, mesh object coding for the coding of visual objects represented with mesh structures, model-based coding for the coding of a synthetic representation and animation of the human face and body, and still texture coding for the wavelet coding of still textures.

In the following sections, we first describe the object-based representation and each of the MPEG-4 Part 2 coding tools. Next, we discuss the scalability and the error resilience tools, followed by a presentation of the MPEG-4 Part 2 profiles.

3.1.1 Object-based Representation

The object-based representation in MPEG-4 Part 2 is based on the concept of the audiovisual object (AVO). An AVO consists of a visual object component, an audio object component, or a combination of these components. The characteristics of the audio and visual components of the individual AVOs can vary, such that the audio component can be (a) synthetic or natural or (b) mono, stereo, or multichannel (e.g., surround sound), and the visual component can be natural or synthetic. Some examples of AVOs include object-based representations of a person recorded by a video camera, a sound clip recorded with a microphone, and a three-dimensional (3D) image with text overlay.

MPEG-4 supports the composition of a set of audiovisual objects into a scene, also referred to as an audiovisual scene.

To allow interactivity with individual AVOs within a scene, it is essential to transmit the information that describes each AVO’s spatial and temporal coordinates. This information is referred to as the scene description information and is transmitted as a separate stream and multiplexed with AVO elementary bitstreams so that the scene can be composed at the user’s end. This functionality makes it possible to change the composition of AVOs without having to change the content of AVOs.

An example of an audiovisual scene, which is composed of natural and synthetic audio and visual objects, is presented in Fig. 1. AVOs can be organized in a hierarchic fashion. Elementary AVOs, such as the blue head and the associated voice, can be combined together to form a compound AVO, (i.e., a talking head). It is possible to change the position of the AVOs, delete them or make them visible, or manipulate them in a number of ways depending on their characteristics. For example, a visual object can be zoomed and rotated by the user. Even more, the quality, spatial resolution, and temporal resolution of the individual AVOs can be modified. For example, in a mobile video telephony application, the user can request a higher frame rate and/or spatial resolution for the talking person than those of the background objects.

FIGURE 1. An audiovisual scene.

3.1.2 Video Object Coding

A video object (VO) is an arbitrarily shaped video segment that has a semantic meaning. A 2D snapshot of a video object at a particular time instant is called a video object plane (VOP). A VOP is defined by its texture (luma and chroma values) and its shape. MPEG-4 Part 2 allows object-based access to the video objects, as well as temporal instances of the video objects (i.e., VOPs). To enable access to an arbitrarily shaped object, a separation of the object from the background and the other objects must be performed. This process, known as segmentation, is not standardized in MPEG-4. However, automatic and semi-automatic tools [13] and techniques such as chroma keying [14], although not always effective, can be used for video object segmentation.

As illustrated in Fig. 2, a basic VOP encoder consists mainly of two blocks: a DCT-based motion-compensated hybrid video texture coder, and a shape coder. Similar to1 MPEG-1 and MPEG-2, MPEG-4 supports intra coded (I-), temporally predicted (P-), and bidirectionally predicted (B-) VOPs, all of which are illustrated in Fig. 3. Except for I-VOPs, motion estimation and compensation are applied. Next, the difference between the motion compensated data and the original data is DCT transformed, quantized, and then variable length coded (VLC). Motion information is also encoded using VLCs. Since the shape of a VOP may not change significantly between consecutive VOPs, predictive coding is used to reduce temporal redundancies. Thus, motion estimation and compensation are also applied to the shape of the VOP. Finally, the motion, texture, and shape information is multiplexed with the headers to form the coded VOP bitstream. At the decoder end, the VOP is reconstructed by combining motion, texture, and shape data decoded from the bitstream.

FIGURE 2. A basic block diagram of an MPEG-4 Part 2 video encoder.

FIGURE 3. Prediction types for a video object plane (VOP).

3.1.2-1 Motion Vector Coding.

Motion vectors (MVs) are predicted using a spatial neighborhood of three MVs and the resulting prediction error is variable length coded. Motion vectors are transmitted only for P-VOPs and B-VOPs. MPEG-4 Part 2 uses a variety of motion compensation techniques, such as the use of unrestricted MVs (motion vectors that are allowed to point outside the coded area of a reference VOP), and the use of four MVs per macroblock. In addition, version 2 of MPEG-4 Part 2 supports global motion compensation and quarter-sample motion vector accuracy.

3.1.2-2 Texture Coding.

Intra blocks, as well as motion compensation prediction error blocks, are texture coded. Similar to MPEG-1, MPEG-2 (described in Chapter 6.4) and H.263, DCT based coding is employed to reduce spatial redundancies. That is, each VOP is divided into macroblocks as illustrated in Fig. 4. Each macroblock consists of a 16 × 16 array of luma samples and two corresponding 8 × 8 arrays of chroma samples. These arrays are partitioned into 8 × 8 blocks for DCT processing. DCT coding is applied to the four 8 × 8 luma and two 8 × 8 chroma blocks of each macroblock. If a macroblock lies on the boundary of an arbitrarily shaped VOP, then the samples that are outside the VOP are padded before DCT coding. As an alternative, a shape-adaptive DCT coder can be used for coding boundary macroblocks of intra VOPs. This generally results in higher compression performance, at the expense of an increased implementation complexity. Macroblocks that are completely inside the VOP are DCT transformed as in MPEG-1/2 and H.263. DCT transformation of the blocks is followed by quantization, zig-zag coefficient scanning, and variable length coding. Adaptive DC/AC prediction methods and alternate scan techniques can be employed for efficient coding of the DCT coefficients of intrablocks.

FIGURE 4. A video object plane enclosed in a rectangular bounding box and divided into macroblocks.

3.1.2-3 Shape Coding.

Besides H.263+ [4], which provides some limited shape coding support via its chroma-keying coding technique, MPEG-4 Parts 2 and 10 are the only other video coding standards that support shape coding. As opposed to H.263+, MPEG-4 Part 2 adopted a bitmap-based shape coding method because of its ability to clearly distinguish between shape and texture information while retaining high compression performance and reasonable complexity. In bitmap-based shape coding, the shape and transparency of a VOP are defined by their binary or grayscale (respectively) alpha planes. A binary alpha plane indicates whether or not a sample belongs to a VOP. A gray-scale alpha plane indicates the transparency of each sample within a VOP. Transparency of samples can take values from 0 (transparent) to 255 (opaque). If all of the samples in a VOP block are indicated to be opaque or transparent, then no additional transparency information is transmitted for that block.

MPEG-4 Part 2 provides tools for both lossless and lossy coding of binary and gray-scale alpha planes. Furthermore, both intra and inter shape coding are supported. Binary alpha planes are divided into 16 × 16 blocks as illustrated in Fig. 5. The blocks that are inside the VOP are signaled as opaque blocks and the blocks that are outside the VOP are signaled as transparent blocks. The samples in boundary blocks (i.e., blocks that contain both samples inside and outside the VOP) are scanned in a raster scan order and coded using context-based arithmetic coding. Grayscale alpha planes, which represent transparency information, are divided into 16 × 16 blocks and coded the same way as the texture in the luma blocks.

FIGURE 5. Binary alpha plane.

In intra shape coding using binary alpha planes, a context is computed for each sample using ten neighboring samples (shown in Fig. 6A) and the equation C = Σk ck 2k, where k is the sample index, ck is “0” for transparent samples and “1” for opaque samples. If the context samples fall outside the current block, then samples from neighboring blocks are used to build the context. The computed context is then used to access the table of probabilities. The selected probability is used to determine the appropriate code space for arithmetic coding. For each boundary block, the arithmetic encoding process is also applied to the transposed version of the block. The representation that results in the least coding bits can be selected by the encoder to be conveyed in the bitstream.

FIGURE 6. Template samples that form the context of an arithmetic coder for (a) intra and (b) inter coded shape blocks.

In inter shape coding using binary alpha planes, the shape of the current block is first predicted from the shape of the temporally previous VOP by performing motion estimation and compensation using integer sample accuracy. The shape motion vector is then coded predictively. Next, the difference between the current and the predicted shape block is arithmetically coded. The context for an inter coded shape block is computed using a template of nine samples from both the current and temporally previous VOP shape blocks, as shown in Fig. 6(b).

In both intra and inter shape coding, lossy coding of the binary shape is achieved by either not transmitting the difference between the current and the predicted shape block or subsampling the binary alpha plane by a factor of two or four prior to arithmetic encoding. To reduce the blocky appearance of the decoded shape caused by lossy coding, an upsampling filter is employed during the reconstruction.

3.1.2-4 Sprite Coding.

In MPEG-4 Part 2, sprite coding is used for representation of video objects that are static throughout a video scene or are modified such that their changes can be approximated by warping the original object planes [8,15]. Sprites may typically be used for transmitting the background in video sequences. As shown in the example of Fig. 7, a sprite may consist of a panoramic image of the background, including the samples that are occluded by other video objects. Such a representation can increase coding efficiency, since the background image is coded only once in the beginning of the video segment, and the camera motion, such as panning and zooming, can be represented by only a few global motion parameters.

FIGURE 7. Sprite coding of a video sequence

(courtesy of Dr. Thomas Sikora, Technical University of Berlin, [6]). VO, video object. (See color insert.)

3.1.3 Mesh Object Coding

A mesh is a tessellation (partitioning) of an image into polygonal patches. Mesh representations have been successfully used in computer graphics for efficient modeling and rendering of 3D objects. In order to benefit from functionalities provided by such representations, MPEG-4 Part 2 supports two-dimensional (2D) and 3D mesh representations of natural and synthetic visual objects, and still texture objects, with triangular patches [8, 16]. The vertices of the triangular mesh elements are called node points, and they can be used to track the motion of a video object, as depicted in Fig. 8. Motion compensation is performed by spatially piecewise warping of the texture maps that correspond to the triangular patches. Mesh modeling can efficiently represent continuous motion, resulting in less blocking artifacts at low bit rates as compared to the block-based modeling. It also enables object-based retrieval of video objects by providing accurate object trajectory information and syntax for vertex-based object shape representation.

FIGURE 8. Mesh representation of a video object with triangular patches

(courtesy of Dr. Murat Tekalp, University of Rochester, [6]).

3.1.4 Model-based Coding

Model-based representation enables very-low bit rate video coding applications by providing the syntax for the transmission of the parameters that describe the behavior of a human being, rather than transmission of the video frames. MPEG-4 Part 2 supports the coding of two types of models: a face object model, which is a synthetic representation of the human face with 3D polygon meshes that can be animated to have visual manifestations of speech and facial expressions, and a body object model, which is a virtual human body model represented with 3D polygon meshes that can be rendered to simulate body movements [8, 17, 18].

3.1.4-1 Face Animation.

It is required that every MPEG-4 Part 2 decoder that supports face object decoding have a default face model, which can be replaced by downloading a new face model. Either model can be customized to have a different visual appearance by transmitting facial definition parameters (FDPs). FDPs can determine the shape (i.e., head geometry) and texture of the face model.

A face object consists of a collection of nodes, also called feature points, which are used to animate synthetic faces. The animation is controlled by face animation parameters (FAPs) that manipulate the displacements of feature points and angles of face features and expressions. The standard defines a set of 68 low-level animations, such as head and eye rotations, as well as motion of a total of 82 feature points for the jaw, lips, eye, eyebrow, cheek, tongue, hair, teeth, nose, and ear. These feature points are shown in Fig. 9. High level expressions, such as joy, sadness, fear, surprise, and mouth movements are defined by sets of low-level expressions. For example, the joy expression is defined by relaxed eyebrows, and an open mouth with the mouth corners pulled back toward ears. Fig. 10 illustrates several video scenes that are constructed using FAPs.

FIGURE 9. Feature points used for animation.

FIGURE 10. Examples of face expressions coded with facial animation parameters

(courtesy of Dr. Joern Ostermann, University of Hannover, [19]).

The FAPs are coded by applying quantization followed by arithmetic coding. The quantization is performed by taking into consideration the limited movements of the facial features. Alternatively, DCT coding can be applied to a vector of 16 temporal instances of the FAP. This solution improves compression efficiency at the expense of a higher delay.

3.1.4-2 Body Animation.

Body animation was standardized in version 2 of MPEG-4 Part 2. Similar to the case of a face object, two sets of parameters are defined for a body object: body definition parameters (BDPs), which define the body through its dimensions, surface and texture, and body animation parameters (BAPs), which define the posture and animation of a given body model.

3.1.5 Still Texture Coding

The block diagram of an MPEG-4 Part 2 still texture coder is shown in Fig. 11. As illustrated in this figure, the texture is first decomposed using a 2D separable wavelet transform, employing a Daubechies biorthogonal filter bank [8]. The discrete wavelet transform is performed using either integer or floating point operations. For coding of an arbitrarily shaped texture, a shape-adaptive wavelet transform can be used.

FIGURE 11. Block diagram of the still texture coder.

The DPCM coding method is applied to the coefficient values of the lowest frequency subband. A multi-scale zero-tree coding method [20] is applied to the coefficients of the remaining subbands. Zero-tree modeling is used for encoding the location of non-zero wavelet coefficients by taking advantage of the fact that, if a wavelet coefficient is quantized to zero, then all wavelet coefficients with the same orientation and the same spatial location at finer wavelet scales are also likely to be quantized to zero. Two different zero-tree scanning methods are used to achieve spatial and quality (signal-to-noise ratio [SNR]) scalability. After DPCM coding of the coefficients of the lowest frequency subband, and zero-tree scanning of the remaining subbands, the resulting data is coded using an adaptive arithmetic coder.

3.1.6 Scalability

In addition to the video coding tools discussed so far, MPEG-4 Part 2 provides scalability tools that allow organization of the bitstream into base and enhancement layers. The enhancement layers are transmitted and decoded depending on the bit rate, display resolution, network throughput, and decoder complexity constraints. Temporal, spatial, quality, complexity, and object-based scalabilities are supported in MPEG-4 Part 2. The first three types of scalability were discussed in Chapter 6.3. Complexity scalability is the scaling of the processing tasks in such a way that the reconstruction quality of the bitstream is adaptable to the processing power of the decoder. Object-based scalability allows the addition or removal of video objects, as well as the prioritization of the objects within a scene. Using this functionality, it is possible to represent the objects of interest with higher spatial and/or temporal resolution, while allocating less bandwidth and computational resources to the objects that are less important. All of these forms of scalability allow prioritized transmission of data, thereby also improving the bitstream’s error resilience.

A special case of MPEG-4 Part 2 scalability support, which allows encoding of a base layer and up to eleven enhancement layers, is known as fine granularity scalability (FGS). In FGS coding, the DCT coefficients in the enhancement layers are coded using bit-plane coding instead of the traditional run-level coding [21]. As a result, the FGS coded streams allow for smoother transitions between different quality levels, thereby adapting better to varying network conditions [6, 17, 21].

3.1.7 Error Resilience

To ensure robust operation over error-prone channels, MPEG-4 Part 2 offers a suite of error-resilience tools that can be divided into three groups: resynchronization, data partitioning, and data recovery [8, 22]. Resynchronization is enabled by the MPEG-4 Part 2 syntax, which supports a video packet structure that contains resynchronization markers and information such as macroblock number and quantizer in the header. All of these are necessary to restart the decoding operation in case an error is encountered. Data partitioning allows the separation between the motion and texture data, along with additional resynchronization markers in the bitstream to improve the ability to localize the errors. This technique provides enhanced concealment capabilities. For example, if texture information is lost, motion information can be used to conceal the errors. Data recovery is supported in MPEG-4 Part 2 by reversible variable-length codes for DCT coefficients and a technique known as new prediction (NEWPRED). Reversible variable length codes for DCT coefficients can be decoded in forward and backward directions. Thus, if part of a bitstream cannot be decoded in the forward direction due to errors, some of the coefficient values can be recovered by decoding the coefficient part of the bitstream in the backward direction. NEWPRED, which is a method intended for real-time encoding applications, makes use of an upstream channel from decoder to encoder, where the encoder dynamically replaces the reference pictures according to the error conditions and feedback received from the decoder [6].

3.1.8 MPEG-4 Part 2 Profiles

Since the MPEG-4 Part 2 syntax is designed to be generic, and includes many tools to enable a variety of video applications, the implementation of an MPEG-4 Part 2 decoder that supports the full syntax is often impractical. Therefore, the standard defines a number of subsets of the syntax, referred to as profiles, each of which targets a specific group of applications. For instance, the simple profile targets low-complexity and low-delay applications such as mobile video communications, whereas the main profile is intended primarily for interactive broadcast and digital video device (DVD) applications. A complete list of the MPEG-4 Part 2 version 1 and version 2 profiles is given in Table 1. The subsequent versions added more profiles to the ones defined in version 1 and version 2, increasing the total number of profiles in MPEG-4 Part 2 to approximately 20. Examples of new profiles include the advanced simple profile targeting more efficient coding of ordinary rectangular video, the simple studio profile targeting studio editing applications and supporting 4:4:4 and 4:2:2 chroma sampling, and the fine granularity scalability profile targeting webcasting and wireless communication applications.

TABLE 1. MPEG-4 Part 2 visual profiles. The index (2) marks profiles that are available in version 2 of MPEG-4 Part 2. The rest of the profiles shown are available in both version 1 and 2 of MPEG-4 Part 2.

Profile Group Profile Name Supported Functionalities
Simple Error-resilient coding of rectangular video objects
Simple scalable Simple profile + frame-based temporal and spatial scalability
Core Simple profile + coding of arbitrarily shaped objects
Profiles for natural video content Main Core profile + interlaced video coding + transparency coding + sprite coding
N-bit Core profile + coding video objects with sample-depths between 4 and 12 bits
Advanced real-time simple(2) Improved error resilient coding of rectangular video with low buffering delay
Core scalable(2) Core profile + temporal and spatial scalability of arbitrarily shaped objects
Advanced coding efficiency(2) Coding of rectangular and arbitrarily shaped objects with improved coding efficiency
Simple face animation Basic coding of simple face animation
Scalable texture Spatially scalable coding of still texture objects
Profiles for synthetic and synthetic/natural/hybrid video content Simple basic animated two-dimensional texture Simple face animation + spatial and quality scalability + mesh-based representation of still texture objects
Hybrid Coding of arbitrarily shaped objects + temporal scalability + face object coding + mesh coding of animated still texture objects
Advanced scalable texture(2) Scalable coding of arbitrarily shaped texture, shape, wavelet tiling, and error resilience
Advanced core(2) Core visual profile + advanced scalable texture visual profile
Simple face and body animation(2) Simple face animation profile + body animation

3.2 MPEG-4 Part 10: H.264/AVC

MPEG-4 Part 10: H.264/AVC was developed by a partnership project known as the Joint Video Team (JVT) between the ISO, IEC, and ITU-T, and it became an International Standard in 2003. One of the key goals in the development of H.264/AVC was to address the needs of the many different video applications and delivery networks that would be used to carry the coded video data. To facilitate this, the standard is conceptually divided into a Video Coding Layer (VCL) and a Network Abstraction Layer (NAL). The VCL defines a decoding process that can provide an efficient representation of the video, while the NAL provides appropriate header and system information for each particular network or storage media. The NAL enables network friendliness by mapping VCL data to a variety of transport layers such as RTP/IP, MPEG-2 systems, and H.323. A more detailed description of the NAL concepts and the error resilience properties of H.264/AVC are provided in [23] and [24]. In this chapter, we provide an overview of the H.264/AVC video coding tools that comprise the VCL.

3.2.1 H.264/AVC Video Coding Layer: Technical Overview

Even though H.264/AVC, similar to the preceding standards, defines only the bitstream syntax and video decoding process, here we discuss both the encoding and the decoding process for completeness. H.264/AVC employs a hybrid block-based video compression technique, similar to those defined in earlier video coding standards, which is based on combining inter picture prediction to exploit temporal redundancy and transform-based coding of the prediction errors to exploit the spatial redundancy. A generalized block diagram of an H.264/AVC encoder is provided in Fig. 12. While it is based on the same hybrid coding framework, H.264/AVC features a number of significant components that distinguish it from its predecessors. These features include spatial directional prediction; an advanced motion-compensation model utilizing variable block size prediction, quarter-sample accuratemotion compensation, multiple reference picture prediction, and weighted prediction; an in-loop deblocking filter; a small block-size integer transform and two context-adaptive entropy coding modes.

FIGURE 12. Block diagram of an H.264/AVC encoder.

A number of special features are provided to enable more flexible use or to provide resilience against losses or errors in the video data. The sequence and picture header information are placed into structures known as parameter sets, which can be transmitted in a highly flexible fashion to either help reduce bit rate overhead or add resilience against header data losses. Pictures are composed of slices that can be highly flexible in shape, and each slice of a picture is coded completely independently of the other slices in the same picture to enable enhanced loss/error resilience. Further loss/error resilience is enabled by providing capabilities for data partitioning (DP) of the slice data and allowing redundant picture (RP) slice representations to be sent. Robustness against variable network delay is provided by allowing arbitrary slice order (ASO) in the compressed bitstream. A new type of coded slices known as synchronization slices for intra (SI) or inter (SP) coding is supported that enables efficient switching between streams or switching between different parts of the same stream (e.g., to change the server bit rate of pre-encoded streaming video, for trick-mode playback2 or loss/error robustness).

3.2.1-1 Slices and Slice Groups.

As in previous standards, pictures in H.264/AVC are partitioned into macro-blocks, which are the fundamental coding unit, each consisting of a 16 × 16 block of luma samples and two corresponding 8 × 8 blocks of chroma samples. Also, each picture can be divided into a number of independently decodable slices, where a slice consists of one or more macroblocks. The partitioning of the picture into slices is much more flexible in H.264/AVC than in previous standards. In previous standards, the shape of a slice was often highly constrained and the macroblocks within the same slice were always consecutive in the raster-scan order of the picture or of a rectangle within the picture. However, in H.264/AVC, the allocation of macroblocks into slices is made completely flexible, through the specification of slice groups and macro-block allocation maps in the picture parameter set. Through this feature, informally known as flexible macroblock ordering (FMO), a single slice may contain all of the data necessary for decoding a number of macroblocks that are scattered throughout the picture, which can be useful for recovery from errors due to packet loss [23]. An example is given in Fig. 13, which illustrates one method of allocating each macroblock in a picture to one of three slice groups. In this example, the bitstream data for every third macroblock in raster scan order is contained within the same slice group. A macroblock allocation map is specified that categorizes each macroblock of the picture into a distinct slice group. Each slice group is partitioned into one or more slices, where each slice consists of an integer number of macroblocks in raster scan order within its slice group. Thus, if a single slice is lost and the remaining slices are successfully reconstructed, several macroblocks that are adjacent to the corrupted macroblocks are available to assist error concealment.

FIGURE 13. Example of slice groups.

In contrast with previous standards in which the picture type (I, P, or B) determined the macroblock prediction modes available, it is the slice type in H.264/AVC that determines which prediction modes are available for the macroblocks. For example, I slices contain only intra predicted macroblocks, and P slices can contain inter predicted (motion-compensated) macroblocks in addition to the types allowed in I slices. Slices of different types can be mixed within a single picture.

3.2.1-2 Spatial Directional Intra Prediction.

The compression performance of a block-based video codec depends fundamentally on the effective use of prediction of sample blocks to minimize the residual prediction error. H.264/AVC provides a very powerful and flexible model for the prediction of each block of samples from previously encoded and reconstructed sample values. This includes spatial directional prediction within the same picture (intra prediction), and flexible multiple reference picture motion compensation from a set of previously reconstructed and stored pictures (inter prediction). As stated above, the prediction modes that are available are dependent upon the type of each slice (intra, predictive, or bipredictive, for I, P, and B slices, respectively).

Intra prediction requires data from only within the current picture. Unlike the previous video standards, in which prediction of intra coded blocks was achieved by predicting only the DC value or a single row or column of transform coefficients from the above or left block, H.264/AVC uses spatial directional prediction, in which individual sample values are predicted based on neighboring sample values that have already been decoded and fully reconstructed. Two different modes are supported: 4 × 4 and 16 × 16. In the 4 × 4 intra coding mode, each 4 × 4 luma block within a macroblock can use a different prediction mode. There are nine possible modes: DC and eight directional prediction modes. For example, in the horizontal prediction mode, the prediction is formed by copying the samples immediately to the left of the block across the rows of the block. As illustrated in Fig. 14, diagonal modes operate similarly, with the prediction of each sample based on a weighting of the previously reconstructed samples adjacent to the predicted block.

FIGURE 14. Intra prediction in H.264/AVC.

The 16 × 16 intra prediction mode operates similarly, except that the entire luma macroblock is predicted at once, based on the samples above and to the left of the macroblock. Also, in this mode there are only 4 modes available for prediction: DC, vertical, horizontal and planar. The 16 × 16 intra mode is most useful in relatively smooth picture areas.

3.2.1-3 Enhanced Motion-Compensation Prediction Model.

Motion-compensated prediction (inter prediction) in H.264/AVC, similar to that in the previous standards, is primarily based on the translational block-based motion model, in which blocks of samples from previously reconstructed reference pictures are used to predict current blocks through transmission of motion vectors. However, the motion-compensation model defined in H.264/AVC [25, 26] is more powerful and flexible than those defined in earlier standards, and provides a much larger number of options in the search for block matches to minimize the residual error. The H.264/AVC motion model includes seven partition sizes for motion compensation, quarter-sample accurate motion vectors, a generalized multiple reference picture buffer, and weighted prediction. For an encoder to take full advantage of the larger number of prediction options, the prediction selection in H.264/AVC is more computationally intense than in some earlier standards that have simpler models, such as MPEG-2.

Previous standards typically allowed motion compensation block sizes of only 16 × 16 or possibly 8 × 8 luma samples. In H.264/AVC, the partitioning of each macroblock into blocks for motion compensation is much more flexible, allowing seven different block sizes, as illustrated in Fig. 15. Each macroblock can be partitioned in one of the four ways illustrated in the top row of the figure. If the 8 × 8 partitioning is chosen, each 8 × 8 block can be further partitioned in one of the four ways shown in the bottom row of the figure.

FIGURE 15. Illustration of macroblock partitioning into blocks of different sizes (dimensions shown are in units of luma samples).

All motion vectors in H.264/AVC are transmitted with quarter-luma sample accuracy, providing improved opportunities for prediction compared to most previous standards, which allowed only half-sample accurate motion compensation. For motion vectors at fractional-sample locations, the sample values are interpolated from the reference picture samples at integer-sample locations. Luma component predictions at half-sample locations are interpolated using a 6-tap finite impulse response (FIR) filter. Predictions at quarter-sample luma locations are computed via a bilinear interpolation of the values for two neighboring integer- and half-sample locations. Chroma fractional-sample location values are computed by linear interpolation between integer-location values.

H.264/AVC provides great flexibility in terms of which pictures can be used as references to generate motion-compensated predictions for subsequent coded pictures. This is achieved through a flexible multiple reference picture buffer. In earlier standards, the availability of reference pictures was generally fixed and limited to a single temporally previous picture for predicting P pictures, and two pictures — one temporally previous and one temporally subsequent — for predicting B pictures. However, in H.264/AVC, a multiple-reference picture buffer is available and it may contain up to 16 reference frames or 32 reference fields (depending on the profile, level, and picture resolution specified for the bitstream). The assignment and removal of pictures entering and exiting the buffer can be explicitly controlled by the encoder by transmitting buffer control commands as side-information in the slice header, or the buffer can operate on a first-in, first-out basis in decoding order. The motion-compensated predictions for each macroblock can be derived from one or more of the reference pictures within the buffer by including reference picture selection syntax elements in conjunction with the motion vectors. With the generalized multiple reference picture buffer, the temporal constraints that are imposed on reference picture usage are greatly relaxed. For example, as illustrated in Fig. 16, it is possible that a reference picture used in a P slice is located temporally subsequent to the current picture, or that both references for predicting a macroblock in a B slice are located in the same temporal direction (either forward or backward in temporal order). To convey the generality, the terms forward and backward prediction are not used in H.264/AVC. Instead the terms List 0 prediction and List 1 prediction are used, reflecting the organization of pictures in the reference buffer into two (possibly overlapping) lists of pictures without regard to temporal order. In addition, with the generalized multiple reference picture buffer, pictures coded using B slices can be referenced for the prediction of other pictures, if desired by the encoder [25].

FIGURE 16. Multiple reference picture prediction in H.264/AVC.

In P slices, each motion-compensated luma prediction block is derived by quarter-sample interpolation of a prediction from a single block within the set of reference pictures. In B slices, an additional option is provided, allowing each prediction block to be derived from a pair of locations within the set of reference pictures. By default, the final prediction in a bipredictive block is computed using a sample-wise ordinary average of the two interpolated reference blocks. However, H.264/AVC also includes a weighted prediction feature, in which each prediction block can be assigned a relative weighting value rather than using an ordinary average value, and an offset can be added to the prediction. These parameters can either be transmitted explicitly by the encoder, or computed implicitly, based on a time-related relationship between the current picture and the reference picture(s). Weighted prediction can provide substantial coding gains in scenes containing fades or cross fades, which are very difficult to code efficiently with previous video standards that do not include this feature.

3.2.1-4 In-Loop Deblocking Filter.

Block-based prediction and transform coding, including quantization of transform coefficients, can lead to visible and subjectively objectionable changes in intensity at coded block boundaries, referred to as blocking artifacts. In previous video codecs, such as MPEG-2, the visibility of these artifacts could optionally be reduced to improve subjective quality by applying a deblocking filter after decoding and prior to display. The goal of such a filter is to reduce the visibility of subjectively annoying artifacts while avoiding excessive smoothing that would result in loss of detail in the image. However, because the process was optional for decoders and was not normatively specified in these standards, the filtering was required to take place outside of the motion-compensation loop, and large blocking artifacts would still exist in reference pictures that would be used for predicting subsequent pictures. This reduced the effectiveness of the motion-compensation process and allowed blocking artifacts to be propagated into the interior of subsequent motion-compensated blocks, causing the subsequent picture predictions to be less effective and making the removal of the artifacts by filtering more challenging.

To improve upon this, H.264/AVC defines a normative deblocking filtering process. This process is performed identically in both the encoder and decoder to maintain an identical set of reference pictures. This filtering leads to both objective and subjective improvements in quality, shown in Fig. 17, due to the improved prediction and the reduction in visible blocking artifacts. Note that the second version of the ITU-T H.263 standard also included an in-loop deblocking filter as an optional feature, but this feature was not supported in the most widely deployed baseline profile of that standard, and the design had problems with inverse-transform rounding error effects.

FIGURE 17. A decoded frame of the sequence Foreman (a) without the in-loop deblocking filtering applied and (b) with the in-loop deblocking filtering (the original sequence Foreman is courtesy of Siemens AG) (see color insert).

The deblocking filter defined in H.264/AVC operates on the 4 × 4 block transform grid. Both luma and chroma samples are filtered. The filter is highly adaptive to remove as many artifacts as possible without excessive smoothing. For the line of samples across each horizontal or vertical block edge (such as that illustrated in Fig. 18), a filtering strength parameter is determined based on the coding parameters on both sides of the edge. When the coding parameters indicate that large artifacts are more likely to be generated (e.g., intra prediction or coding of non-zero transform coefficients), larger strength values are assigned. This results in stronger filtering being applied. Additionally, sample values (i.e., a0a3 and b0b3, in Fig. 18) along each line of samples to be (potentially) filtered are checked against several conditions that are based on the quantization step size used on either side of the edge in order to distinguish between discontinuities that are introduced by quantization, and those that are true edges that should not be filtered to avoid loss of detail [26]. For instance, in the example shown in Fig. 18, a significant lack of smoothness can be detected between the sample values on each side of the block edge. When the quantization step size is large, this lack of smoothness is considered an undesirable artifact and it is filtered. When the quantization step size is small, the lack of smoothness is considered to be the result of actual details in the scene being depicted by the video and it is not altered.

FIGURE 18. Example of an edge profile. Notations a0, a1, a2, and b0, b1, b2, b3 stand for sample values on each side of the edge.

3.2.1-5 Transform, Quantization, and Scanning.

As in earlier standards, H.264/AVC uses block transform coding to efficiently represent the prediction residual signal. However, unlike previous standards, the H.264/AVC transform is primarily based on blocks of 4 × 4 samples, instead of 8 × 8 blocks. Furthermore, instead of performing an approximation of the floating-point DCT (i.e., by specifying that each implementation must choose its own approximation of the ideal IDCT equations), the transform employed has properties similar to that of the 4 × 4 DCT, but it is completely specified in integer operations. This eliminates the problem of mismatch between the inverse transforms performed in the encoder and decoder, enabling an exact specification of the decoded sample values and significantly reducing the complexity of the IDCT computations. In particular, the IDCT can be computed easily using only 16-bit fixed-point computations (including intermediate values), which would have been very difficult for the IDCT defined in all previous standards. Additionally, the basis functions of the transform are extended by using additional second-stage transforms that are applied to the 2 × 2 chroma DC coefficients of a macroblock, and the 4 × 4 luma DC coefficients of a macroblock predicted in the 16 × 16 intra mode.

Scalar quantization is applied to the transform coefficients [27], but without as wide a dead-zone as used in all other video standards. The 52 quantization parameter values are designed so that the quantization step size increases by approximately 12.2% for each increment of one in the quantization parameter (such that the step size exactly doubles if the quantization parameter is increased by 6). As in earlier standards, each block of quantized coefficients is zig-zag-scanned for ordering the quantized coefficient representations in the entropy coding process. The decoder performs an approximate inversion of the quantization process by multiplying the quantized coefficient values by the step size that was used in their quantization. This inverse quantization process is referred to as scaling in the H.264/AVC standard, since it consists essentially of just a multiplication by a scale factor. In slice types other than SP/SI, the decoder then performs an inverse transform of the scaled coefficients and adds this approximate residual block to the spatial-domain prediction block to form the final picture reconstruction prior to the deblocking filter process.

In SP and SI synchronization slices, the decoder performs an additional forward transform on the prediction block and then quantizes the transform-domain sum of the prediction and the scaled residual coefficients. This additional forward transform and quantization operation can be used by the encoder to discard any differences between the details of the prediction values obtained when using somewhat different reference pictures in the prediction process (for example, differences arising from the reference pictures being from an encoding of the same video content at a different bit rate). The reconstructed result in this case is then formed by an inverse transform of the quantized transform-domain sum, rather than the ordinary case where the reconstructed result is the sum of a spatial-domain prediction and a spatial-domain scaled residual.

3.2.1-6 Entropy Coding.

Two methods of entropy coding are specified in H.264/AVC. The simpler variable-length coding method is supported in all profiles. In both cases, many syntax elements except for the quantized transform coefficients are coded using a regular, infinite-extent variable-length codeword set. Here, the same set of Exp-Golomb codewords is used for each syntax element, but the mapping of codewords to decoded values is selected to fit the syntax element.

In the simpler entropy coding method, scans of quantized coefficients are coded using a context-adaptive variable-length coding (CAVLC) scheme. In this method, one of a number of variable-length coding tables is selected for each symbol, depending on the contextual information. This includes statistics from previously coded neighboring blocks, as well as statistics from previously coded data for the current block.

For improved coding efficiency, a more complex context-adaptive binary arithmetic coding (CABAC) method can be used [28]. When CABAC is in use, scans of transform coefficients and other syntax elements of the macroblock level and below (such as reference picture indices and motion vector values) are encoded differently. In this method, each symbol is binarized (i.e., converted to a binary code), and then the value of each bin (bit of the binary code) is arithmetically coded. To adapt the coding to non-stationary symbol statistics, context modeling is used to select one of several probability models for each bin, based on the statistics of previously coded symbols. The use of arithmetic coding allows for a non-integer number of bits to be used to code each symbol, which is highly beneficial in the case of very skewed probability distributions. Probability models are updated following the coding of each bit, and this high degree of adaptivity improves the coding efficiency even more than the ability to use a fractional number of bits for each symbol. Due to its bit-serial nature, the CABAC method entails greater computational complexity than the VLC-based method.

3.2.1-7 Frame/Field Adaptive Coding.

H.264/AVC includes tools for efficiently handling the special properties of interlaced video, since the two fields that compose an interlaced frame are captured at different instances of time. In areas of high motion, this can lead to less statistical correlation between adjacent rows within the frame, and greater correlation within individual fields, making field coding (where lines from only a single field compose a macroblock) a more efficient option. In addition to regular frame coding in which lines from both fields are included in each macro-block, H.264/AVC provides two options for special handling of interlaced video: field picture coding and macroblock-adaptive field/frame coding (MB-AFF).

Field coding provides the option of coding pictures containing lines from only a single video field (i.e., a picture composed of only top field lines or only bottom field lines). Depending on the characteristics of the frame, an encoder can choose to code each input picture as two field pictures, or as a complete frame picture. In the case of field coding, an alternate zig-zag coefficient scan pattern is used, and individual reference fields are selected for motion compensated prediction. Thus, frame/field adaptivity can occur at the picture level. Field coding is most effective when there is significant motion throughout the video scene for interlaced input video, as in the case of camera panning.

On the other hand, in MB-AFF coding, the adaptivity between frame and field coding occurs at a level known as the macroblock pair level. A macroblock pair consists of a region of the frame that has a height of 32 luma samples and width of 16 luma samples and contains two macroblocks. This coding method is most useful when there is significant motion in some parts of an interlaced video frame and little or no motion in other parts. For regions with little or no motion, frame coding is typically used for macroblock pairs. In this mode, each macroblock consists of 16 consecutive luma rows from the frame, thus lines from both fields are mixed in the same macroblock. In moving regions, field macroblock pair coding can be used to separate each macroblock pair into two macroblocks that each include rows from only a single field. Frame and field macroblock pairs are illustrated in Fig. 19. The unshaded rows compose the first (or top) macroblock in each pair, and the shaded rows compose the second (or bottom) macroblock. Figure 19(a) shows a frame macro-block pair, in which samples from both fields are combined in each coded macroblock. Figure 19(b) shows a field macro-block pair, in which all of the samples in each of the two macroblocks are derived from a single field. With MB-AFF coding, the internal macroblock coding remains largely the same as in ordinary frame coding or field coding. However, the spatial relationships that are used for motion vector prediction and other context determination become significantly more complicated in order to handle the low-level switching between field and frame based operation.

FIGURE 19. Illustration of macroblock pairs.

3.2.2 Profiles

Similar to other standards, H.264/AVC defines a set of profiles that each support only a subset of the entire syntax of the standard and are designed to target specific application areas. The baseline profile targets applications not requiring interlace support, and requiring moderate computational complexity, such as videoconferencing. This profile also provides robustness for use on unreliable channels, where some of the video data may be lost or corrupted. The main profile targets applications such as broadcast of standard-definition (SD) and high-definition (HD) video on more reliable channels, and storage on optical media. This profile includes more advanced coding tools, such as CABAC entropy coding and B slices, that can improve coding efficiency over that provided by the baseline profile, at the cost of higher complexity. A third profile, called the extended profile, is intended for use in streaming and use in error-prone transmission environments, such as wireless networks. The key features that are supported by each of the profiles of H.264/AVC mentioned above are shown3 in Table 2.

TABLE 2. H.264/AVC profiles

Supported Functionalities Profile Name
Empty Cell Baseline Main Extended
I slices X X X
P slices X X X
B slices X X
SI and SP slices X
In-loop deblocking filter X X X
CAVLC entropy decoding X X X
CABAC entropy decoding X
Weighted prediction X X
Field pictures X X
MB-AFF X X
Multiple slice groups X X
Arbitrary slice order X X
Redundant pictures X X
Data partitioning X

Notation X denotes available functionalities.

3.3 MPEG-4 Compression Performance

In this section, we first present experimental results that compare the coding performance of frame-based coding to object-based coding using an MPEG-4 Part 2 codec on some progressive-scan video content. Next, the coding efficiency of H.264/AVC is compared to that of the prior standards.4

3.3.1 MPEG-4 Part 2

While MPEG-4 Part 2 also yields significantly improved coding efficiency over the previous coding standards (e.g., 20-30% bit rate savings over MPEG-2 [29]), its main advantage is its object-based representation, which enables many desired functionalities and can yield substantial savings in bit rate savings for some low-complexity video sequences. Here, we present an example that illustrates such a coding efficiency advantage.

The simulations were performed by encoding the color sequence Bream at CIF resolution (352 × 288 luma samples/frame) at 10 frames per second (fps) using a constant quantization parameter of 10. The sequence shows a moderate motion scene with a fish swimming and changing directions. We used the MPEG-4 Part 2 reference codec for encoding. The video sequence was coded (a) in frame-based mode (b) in object-based mode at 10 fps. In the frame-based mode, the codec achieved a 56:1 compression ratio with relatively high reconstruction quality (34.4 dB). If the quantizer step size were larger, it would be possible to achieve up to a 200:1 compression ratio for this sequence, while still keeping the reconstruction quality above 30 dB. In the object-based mode, where the background and foreground (fish) objects are encoded separately, a compression ratio of 80:1 is obtained. Since the background object did not vary with time, the number of bits spent for its representation was very small. Here, it is also possible to employ sprite coding by encoding the background as a static sprite.

The peak signal-to-noise ratio (PSNR) versus rate performance of the frame-based and object-based coders for the video sequence Bream is presented in Fig. 20. As shown in the figure, for this sequence, the PSNR-bit rate tradeoffs of object-based coding are better than those of frame-based coding. This is mainly due to the constant background, which is coded only once in the object-based coding case. However, for scenes with complex and fast varying shapes, since a considerable amount of bits would be spent for shape coding, frame-based coding would achieve better compression levels, but at the cost of a limited object-based access capability.

FIGURE 20. Peak signal-to-noise ratio performance for the Bream video sequence using different profiles of the MPEG-4 Part 2 video coder.

3.3.2 MPEG-4 Part 10: H.264/AVC

Next, the coding performance of H.264/AVC is compared with that of MPEG-2, H.263 and MPEG-4 Part 2. The results were generated using encoders for each standard that were similarly optimized for rate-distortion performance using Lagrangian coder control. The use of the same efficient rate-distortion optimization method, which is described in [29], allows for a fair comparison of encoders that are compliant with the corresponding standards. Here, we provide one set of results comparing the standards in two key application areas: low-latency video communications and entertainment-quality broadband video.

In the video communications test, we compare the rate-distortion performance of an H.263 baseline profile encoder, which is the most widely deployed conformance point for such applications, with an MPEG-4 simple profile encoder and an H.264/AVC baseline profile encoder. The rate-distortion curves generated by encoding the color sequence Foreman at CIF resolution (352 × 288 luma samples/frame) at a rate of 15 fps are shown in Fig. 21. As shown in the figure, the H.264/AVC encoder provides significant improvements in coding efficiency. More specifically, H.264/AVC coding yields bit rate savings of approximately 55% over H.263 baseline profile and approximately 35% over MPEG-4 Part 2 simple profile.

FIGURE 21. Comparison of H.264/A VC baseline profile and the H.263 baseline profile using the sequence Foreman.

A second comparison addresses broadband entertainment-quality applications, where higher resolution content is encoded and larger amounts of latency are tolerated. In this comparison, the H.264/AVC main profile is compared with the widely implemented MPEG-2 main profile. Rate-distortion curves for the interlaced sequence Entertainment (720 × 576) are given in Fig. 22. In this plot, we can see that H.264/AVC can yield similar levels of objective quality at approximately half the bit rate for this sequence. In addition to these objective rate-distortion results, extensive subjective testing conducted for the MPEG verification tests of H.264/AVC have confirmed that similar subjective quality can be achieved with H.264/AVC at approximately half the bit rate of MPEG-2 encoding [30].

FIGURE 22. Comparison of H.264/AVC main profile and MPEG-2 main profile using the sequence Entertainment.

3.4 MPEG-4: Video Applications

Although early industry adoption of MPEG-4 Part 2 was significantly slowed down by initial licensing terms that were considered unattractive for some applications, many companies now commercialize a variety of MPEG-4 Part 2 software and hardware products. By far, the most popular implementations of MPEG-4 Part 2 are compliant with just the simple profile (SP), and most of these implementations enable video playback in mobile and embedded consumer devices (e.g., cellular phones, personal digital assistants [PDAs], digital cameras, digital camcorders, and personal video recorders).

The MPEG-4 Part 2 advanced simple profile (ASP), which permits significantly better video quality than that of SP, has been primarily used in professional video markets, particularly in surveillance cameras and higher quality video coding systems. The availability of two reference software implementations published by the MPEG standards committee and additional sustained efforts to continue providing open source software for MPEG-4 Part 2 may also facilitate the wide adoption of the standard [31].

MPEG-4 Part 10 (H.264/AVC) has rapidly gained support from a wide variety of industry groups worldwide. In the video conferencing industry for example, where video codecs are generally implemented in software, companies such as UB Video have developed H.264/AVC baseline-compliant software codecs that are being already used in many of the popular video conferencing products. Later on, UB Video and other companies such as VideoLocus (now part of LSI Logic) have demonstrated software/hardware products that are compliant with the H.264/AVC main profile. Some of these products are currently being deployed in the broadcast market. Companies such as Apple Computer, Ahead, Polycom, Tandberg, Sony, MainConcept, VideoSoft, KDDI, and others have demonstrated H.264/AVC software products that address a wide market space, from real-time mobile and video conferencing applications to broadcast-quality SD video-coding applications. Also in the broadcast domain, Apple Computer and Modulus Video have demonstrated H.264/AVC real-time software decoding and real-time hardware encoding (respectively) of HD video. Other companies,5 such as Scientific Atlanta, Tandberg, Envivio, Modulus Video, W&W Communications, Harmonic, DG2L (to mention a few) have demonstrated hardware H.264/AVC encoders and decoders for various broadcast and streaming of sub-SD and SD video. Additionally, the JVT is finalizing a reference software implementation that is publicly available to aid industry deployment and conformance testing efforts.