A technical overview of the VC-1 (WMV9) codec

phongn · Post by **phongn** » 2007-07-23 12:16am

Over on AVS Forum there's a video encoding person who works for Microsoft who gave some interesting information on Microsoft's VC1 codec (used in BD-ROM and HD-DVD). It does get a bit technical, though (I actually wish there was more detail since I have some education in video coding).

AVS Forum wrote:So at long last, here are some of the key reasons that VC-1 differentiates itself from AVC in preserving more of the natural texture/resolution and grain of the original HD content. Everything I state here is factual and can be confirmed by looking at the specs and history of these technologies. And apologies in advance for any typos, grammar errors. I am still pretty jetlagged in Japan.

As the discussion of codec technology at mathematical level quickly runs into “Star Trek language” (my term for fancy stuff that make no sense to someone not schooled in them), I am going to provide some high level context and historical perspective for these design choices first. Hopefully that makes it easier to digest the meat of the post, which follows.

MPEG-4 AVC’s origins come from work done by ITU (international standards setting organization) to create an alternative to then leading compression technologies on the internet: namely, Real network’s video codec and Microsoft’s. At the time, MPEG-4 ASP (or Part 2 as it is called) had failed to gain any traction due to not being competitive from performance point of view with these other codec. And MPEG-2 lacked the efficiency to be useful at internet rates.

The ITU initiative, called H.264 (coming after an earlier standard called H.263) was led by Dr. Gary Sullivan who happens to work on my team. Gary is world renowned in the compression/standards circuits and was honored for his work on H.264/AVC with IEEE fellowship award. He has the temperament and skills to drive these things like few have.

Unlike previous standardization efforts, the ITU put computational complexity at very low priority, instead focusing on best compression efficiency as a top goal almost regardless of implementation cost. This allowed them to take on algorithms which were very compute intensive, but generated compression efficiency gains nevertheless (the so called “CABAC” is an excellent example where it might take 20% of a chip just to perform this one function).

H.264 gained the interest and contributions from some of the top compression experts worldwide. As such, the codec started to show significant gains over then best open standard, MPEG-4 Part 2. Not to be upstaged by ITU, MPEG group proposed that H.264 become a joint project between the two standards group and hence the moniker “JVT” that you sometime see (“J” standards Joint). Later the standard was renamed, confusingly so, to MEPG-4 AVC (or Part 10) giving more power to the standard (see below).

The joint initiative got a lot of interest around the world. The aging patent portfolio around MPEG-2 motivated some of the largest companies to contribute to AVC, in the hopes of replenishing their revenue stream one day with patents in that format, when MPEG-2 royalty stream starts to run dry. Alas, that proved difficult as a form of gold rush occurred here, with some 150+ organizations contributing to the final standard (almost 10 times more than did with MPEG-2!). I remember Gary telling me how shocked he was as to the number of attendees in some meetings.

We did our share too and evidenced by Microsoft being one of the members of the MPEG-4 AVC patent pool (we were not part of MPEG-2 but do hold a similar position in MPEG-4 Part 2). And like the other patent holders, we stand to make a few pennies out of AVC one day. Given our involvement both in driving the standard and contributions to the algorithm and business terms/licensing, I hope people realize that my discussions around AVC is not the typical “bashing of your competitor” as we take some ownership and pride in development of AVC (although obviously not nearly as much as VC-1).

Given that 99% of video on the internet at the time of AVC development was CIF resolution (quarter screen/SD) and lower, strong emphasis was put to do well there. Test clips where used to evaluate effective performance various proposals with almost all at CIF resolution. This masked any issues that might have existed in higher resolution material with these algorithms.

JVT work finished around 2003, generating a truly advanced codec, rivaling anything we or Real Networks had. At SD resolution and lower, AVC is a formidable competitor to VC-1 and whether one wins or not in any comparison, is a very subjective thing. More so in its favor, AVC has the power of MPEG brand (with government mandate in some countries), and lack of direct association with Microsoft, which in many situations puts it ahead of VC-1 before the game even starts. This combination has resulted in many strong design wins for AVC in a number of applications from broadcast to satellite, IPTV, and of course, HD DVD/BD.

Computationally, AVC was about 3X slower than MPEG-2 to decode (and many times slower in encode). So making this a lower priority goal, did indeed take its toll on difficulty of implementation.

Of course, we were not sitting still during the development of AVC. We were hard at work, designing our next generation video codec, while being fully aware of work being done in ITU/MPEG.

Like AVC, we wanted to create a “standard.” And by that, I mean that we wanted to license it to others in the industry to implement in their hardware which meant that once we did that, we could not change it again without breaking compatibility. This meant that the algorithm had to last and stay competitive for a long time (5+ years). The final implementation, known as Windows Media Video 9 (or WMV-9) pushed way beyond our previous revisions of our video work, producing significant compression efficiency. Using MPEG’s own test clips, we were able to show 3X gain in objective PSNR measurements as compared to MPEG-2 and 2X compared to MPEG-4 Part 2 (at internet data rates).

The WMV-9 codec later became known as VC-1, when we opened its specifications, and submitted it to SMPTE organization for standardization. This was a requirement of it being adopted by DVD Forum.

Unlike AVC, we did not want to put computational efficiency at the bottom of the goal list. We wanted to run in millions of portable devices and there, battery power becomes directly proportional to how many MIPS you use. So even if we didn’t care how much the hardware cost, we still wanted to be more efficient.

During the development of VC-1, a sequence of events led to changes to our design which literally put us on the map when it comes to HD encoding. While our mainstream business was internet video with SD resolution and lower, we started to do some prototype encodings at HD resolution to see how well the codec performed there.

As we expected, the “coding gain” (compression efficiency) was quite significant at HD resolutions just the same. Many of the techniques that improve the “quality per bit” are resolution independent, letting us produce similar quality to MPEG-2 but doing so at much lower data rate. Excited, we showed a sample of our HD VC-1 encode to one of the (technical) studio executives. That led to us participating in DVD Forum HD DVD codec shootout which per my earlier note, resulted in us defeating the other technologies and becoming a mandatory standard in HD DVD and later, BD.

The reason we did so well was the result of another unintended development earlier. Since there were no real applications at the time for HD on the internet, we used our new capabilities simply to showcase what the codec could do. You know, kind of how Honda does Formula 1 racing and uses that as bragging rights to sell you an Accord . But someone thought we really wanted to sell them a race car, and next thing we know there was a company who was packaging a PC, a dark chip DLP projector, and chasing art houses and smaller cinemas to switch to digital using this low cost system! The package worked quite well, solving the major cost issue of higher end systems being developed in Hollywood and art houses started to deploy it rapidly (they could run digital advertisements on it to make money).

So we started to encode a bunch of independent movies and going through the evaluations with the creative community. Unfortunately, we quickly realized that some of the lessons we had learned on the internet were backfiring on us. In subjective test after test for internet applications, we had seen that people would sacrifice resolution to get a softer but artifact-free video (well, as artifact free as one can make internet video ). When it came to digital cinema though, one couldn’t soften the picture when the thing is being blown up on such a big screen. And people didn’t want to be told to pick between softness or artifacts. They wanted to have their cake and eat it too with transparent resolution to the source and no artifacts.

Our early tests showed up that indeed, we were sacrificing some grain/textures with our algorithm as designed. Worried that if we optimized the codec differently for HD, we would lose the battle on the internet, we made the algorithm adaptive with respect to resolution/data rate. Namely, if you feed VC-1 content above SD resolution, it will behave differently, even though its core algorithm is the same across the board.

The above changes proved very effective and nicely improved our HD performance. We shortly became one of the standard submission formats for Sundance film festival (the only electronic format allowed). I also got to meet Robert Redford and learned that this job has some nice perks . But I had to pay for it dearly, sitting at midnight (the only time we could time on the Sundance theaters during the festival) trying to prove to them that we had not lost the “film look.” After all, these guys don’t call themselves “film makers” for nothing!

So they demanded to have a shootout against film projection before giving us the green light to use the VC-1 encodes. And of course, they were always finishing their movies a couple of days before the show, given us little time to encode them and hence the midnight viewings.

Fortunately, in every case, the producer/director agreed that our VC-1 encode outperformed the equiv. film projection, while sitting in the front row of the theater and watching a 30 foot screen. And we went on to watch some really great, and some really not-so-great independent films on our technology.

During one of the Sundance festivals, we also met with Lionsgate folks who were thinking of releasing T2. Impressed with what we had done with VC-1 at Sundance, they agreed to do a dual disc release creating what later became “WMV-HD” format (VC-1 content but using our audio and file format on red laser DVDs). That started a trend and soon, we had some 50+ titles using the same technology, pushing us nicely to keep improving our codec for HD movie encoding. You really get good as this stuff, when you try to do 1080p in 10 mbit/sec maximum rate for both audio/video and video files the same size as regular SD DVDs in MPEG-2! WMV-HD also got my start at AVS Forum, answering questions about WMV-HD. So if you are unhappy about my postings here, you have Lionsgate to blame .

During this journey, Joe Kane, frustrated with the poor quality of D-VHS MPEG-2 on his HD DVE test patterns, got interested in releasing a WMV-HD disc. So we worked with him and learned a few things to fine tune there. His final VC-1 encode was below 12 mbit/sec (constant bit rate no less), yet outperformed his D-VHS equivalent showing that even on pathological test sequences (i.e. stuff that is hard but never occurs in real life), we had a significant efficiency and quality gain over MPEG-2. So Joe became a believer and continues to advocate VC-1 in his demonstrations to this date (going as far as using HD DVD/VC-1 in the Samsung booth at IFA last year instead of BD/MEPG-2!).

Computationally, we also met our goals in that we achieved great compression efficiency, yet our decoder would only take 2X more MIPS than MPEG-2 (compared to 3X on AVC). Encoding was a lot slower than MPEG-2 at higher quality settings, but way better than AVC. We also use less memory for both encode and decode than AVC.

OK, enough with history. Let’s get into the details.

Both AVC and VC-1, like MPEG-2 before it and many other video codecs, are transform based. The screen is divided into blocks which are then independently compressed (think JPEG) and their motion is tracked on screen (so that we don’t have to retransmit the whole block when some part of the image moves but other parts don’t). We call this “motion estimation.”

There is a lot more sophistication however in both AVC and VC-1 as compared to MPEG-2’s much simpler algorithm (which was heavily constrained by what hardware could do a decade back). We have such features as adaptive (as opposed to fixed) block size, quarter pixel motion estimation (as opposed to half pixel), and more efficient “entropy coding” (lossless component of all video codecs to squeeze the final set of bits as tightly as possible). And much more.

While both AVC and VC-1 are “advanced codecs,” they differ significantly in how they gain their efficiency over MPEG-2:

1. Transform block size differences. MPEG-2 uses fixed 8x8 blocks. AVC changed this to 4x4 blocks (with the transform designed by Microsoft btw ). VC-1 on the other hand, supports 8x8, 4x4, 4x8 and 8x4. This gives VC-1 much more flexibility to pick the optimal block size based on picture content. For example, the outline of a person may be better coded using vertical blocks that suck in less of the background grain (letting the block be optimized for the subject or background gain and not both at the same time). And larger blocks can preserve texture better.

After losing the picture quality test to VC-1 in DVD Forum (and in some cases, finishing even behind MPEG-2), reality set in with AVC folks just as it had for us sometime back. Surprisingly, they chose to modify the standard and add adaptive block size of VC-1 to AVC, post standardization. I say surprisingly because this addition broke compatibility with the just finished standard which is a rare situation.

The paper proposal from Sand Video started it all and is an interesting read in this respect: http://ftp3.itu.ch/av-arch/jvt-site/200 ... H029r1.doc:

“The faithful reproduction of fine detail, including film grain, is required in high definition (HD) broadcasts, HD-DVD, and Digital Cinema. To meet this requirement, more high frequency information must survive quantization than is typical in lower bit rate applications. This contribution demonstrates that the frequency selectivity and reduction in boundary effects realized by using adaptive block transforms (ABT) helps meet the high standards of the HD community. We present average RD performance improvements of 9.75% on HD film sequences and significant perceptual gains in areas of fine detail and film grain.” (emphasis mine)

And then this:

“Test Sequences: Five HD-scans of major release Hollywood movies (designated Movies 1-5) (footnote 1).
(Footnote 1) Due to licensing arrangements with the Hollywood Studios, Sand Video cannot disclose the names of the movies nor show the results of the processing to non-DVD Forum members.” (emphasis mine)

Yes, you guessed it right. They are talking about the DVD Forum tests that I mentioned. They used the same test clips to show the improve quality using adaptive/larger block size which VC-1 used.

And who is “Sand Video”you ask? They were a start-up that was developing AVC encoders and decoders. The company was subsequently sold to Broadcom and their work became the foundation of the single chip AVC/MPEG-2/VC-1 decoders powering Blu-ray and HD DVD players. But at the time, they were a proponent of AVC and one of the companies participating in the tests, with an AVC encoder.

Since the AVC standard was already cast in concrete prior to this submission, a “High Profile” extension was created with the addition of 8x8 block size. The standard AVC spec was then called “Baseline” or “Main” profile. Despite being a late comer, HP extension was added to both HD DVD and BD specifications.

Note that per my previous note, the HP profile is incompatible with the Baseline Profile which many AVC systems use (i.e. crashes the decoder). So unless you see HP after MPEG-4 AVC, you don’t get adaptive block size. And even if you do see it, you have half the choices that VC-1 has. Per earlier mention, non-square blocks are quite handy in coding the picture, letting us to better avoid clustering non-similar pixels.

You may want to check out the pictures in the paper, showing the effectiveness of 8x8 blocks and its performance for preservation of grain.

2. Loop filter differences. “Loop” here refers to a feedback loop which is part of all modern video encoders. To figure out what to send next, the codec subtracts the decoded frame from the source (finding the “error signal”), and then compresses that difference and ships it to the decoder to keep refining the picture to bring it closer to the source (assuming the scene has not changed). Since we feed the output of the encoder back to the input, it creates a feedback “loop” in the classic engineering terminology.

AVC and VC-1 deviate from MPEG-2 in that they can insert a filter in the feedback loop. By taking into account the distortion created in the stream as a result of compression, and filtering it, one can gain significant compression efficiency. This is the key reason neither AVC nor VC-1 degrade as badly as MPEG-2 into a sea of blocks when starved for bits. Think of it as “soft clipping” for you all audiophiles .

The existence of loop filter also means that increasing the bit rate may not necessarily improve perceived quality because the codec is able to some extent mask compression artifacts. The corollary of this is that the quality curve of these advanced codecs is more non-linear with a longer asymptote at higher rates, as compared to say, MPEG-2. They reach higher quality sooner, and increases in bitrate beyond some point do not gain you as much visually as it might in MPEG-2 (this is a good thing, not bad ).

The strength of the loop filter in both codecs is dynamic and based on “heuristics” (educated guesses) by the encoder. They may also be tunable by the operator.

While both codecs sport loop filters, their filter characteristic varies substantially. The difference here is a direct result of our digital-cinema work where we found the loop filter to be very disruptive to film grain and texture. So we optimized the VC-1 filter by reducing its length to one pixel on each side of the block being encoded. Think of this as an eraser with a very sharp point, gently touching up those three pixels so that you can’t tell there is a line through them at the block boundary. Again, note that even this light touch is applied adaptively, i.e. only when needed and by the right amount.

Instead of just one, the AVC filter softens up to 3 pixels on either side of the block edge. Now think of a 4x4 block. If you go into the block two pixels from each side, you will be filtering the pixels twice (once from left, and then one more time from the right)! In contrast, the VC-1 filter only touches those pixels once since we only go one pixels to left or right. In the case of an 8x8 block, VC-1 never touches the center pixels whereas AVC filters them once.

So think of AVC loop filter as a giant eraser, three times bigger than that of VC-1, attempting to smooth the block boundaries.

If the scene gets too difficult to code and block boundaries become visible according to internal heuristics, AVC essentially filters every pixel on screen (and sometimes twice in case of 4x4 blocks), whereas VC-1 judiciously filters block boundaries only. Assuming same filter strength used in both codecs, VC-1 picture is bound to look sharper, with less “resolution pumping” (picture getting soft when it gets hard to encode and then sharper when not) than AVC.

The same video in MPEG-2 by the way, would show blocking artifacts as you see in live HD sports on TV. And since the efficiency of MPEG-2’s algorithm is lower in general as compared to AVC/VC-1 (e.g. in the entropy coding section of the codec), and its blocks always the larger 8x8, its blocking artifacts are more severe and easier to see. Some people may prefer this to the softer look of AVC though in some cases.

As you can imagine, if the content is easy to encode, the filtering does not kick in at full strength and the picture can look fine in AVC. But as the scene complexity increases, AVC becomes heavy handed, smearing detail. Give it easier (e.g. clean) material, or exceedingly high data rate, and it will do much better as its loop filter gets reduces in strength (hence my earlier comment that BD’s data rate “erases some of the AVC sins”).

Don’t even think of turning off the filter though, as some claim to get around this design limitation. Doing so, will take away the distortion mitigation system and you kind of wind up with MPEG-2 style artifacts (we know, we have tested it that way). Put another way, the AVC filter is more like a big on/off switch, without effective graduations, despite its adaptive nature. You are damned if you do, damned if you don’t with this loop filter for HD content.

Now you know why telling us this movie or that movie looks good in AVC, doesn’t impress us as much . We know that some content will suffer less but in many general cases, you are going to run into this loop filter deficiency which works wonderfully for internet video, but not as well at the quality levels we are talking about. And that softening occurs in some scenes and requires knowing what segment is difficult and looking for its effect there.

3. Interpolation filter size. As I mentioned in the introduction, both AVC and VC-1 use quarter pixel resolution to track motion of blocks. This by definition means that you have to “interpolate” the intermediate values between pixels as you decide to move the same block in the decoder frame buffer within subpixel boundaries.

VC-1, roughly speaking, uses bicubic filtering with 4 taps. AVC on the other hand, uses a 6-tap filter. One would think that more of anything is better here but such is not the case when you are compressing HD video. By using more taps, you are using more of the pixels in the frame to find the interpolated one. This has the unintended effect of smearing adjacent pixels into each other with resulting loss of resolution (think of an object being averaged with black background with grain in it). And more taps in digital filters means more “ringing,” (some frequencies emphasizes more than others) causing edges to not be crisp. The shorter taps in the VC-1 filter is also one of the reasons we are much faster than AVC as we don’t need to touch as many pixels.

Are you still with me? If so, you are doing good and could do my job one day! I warned you this stuff gets complicated fast.

Hopefully the above gives you some insight into the differences between these two (very good) codecs and why there are some fundamental differences between them when it comes to HD encoding. While both algorithms can run circles around older technologies such as MPEG-2 (and contrary to popular myth, can be defeatured to behave just like MPEG-2), at the end of the day, one is a better fit for the application in mind.

VC-1 was optimized (and redesigned) before becoming final with real world HD applications and content. AVC was modified quickly, post standardization to perform better but not to the same level as VC-1. Smart people worked on both, but were given different design criteria so they came up with different solutions.

Is there a down side to the differences mentioned? Yes, a bit. VC-1 can become blocky sometimes because it attempts to both preserve detail while at the same time not resorting to excessive filtering of the source as AVC does. This impacts us the most at internet rates as compared to AVC (even there, we opt for sharper pictures). But we think for the formats in question, that can be dialed out by hand optimization and better automated analysis (the main focus of our R&D for the last year). And once there, you have a sharper picture for it. Not everyone agrees of course, especially at below SD resolution. But enough people do thankfully to have gotten us 200+ VC-1 titles in HD DVD/BD and great praise for many of the encodings…