Having been involved in some way with video capture and editing from the late 90’s onwards, something we have always been taught is to avoid lossy formats for capture and editing due to the problem of generational loss. Put simply, lossy formats “approximate” the input, and by editing and re-encoding again for the output, you are taking the “approximate of an approximate” and losing significant quality in the process. Taken to the extreme, any visual defects can be amplified through multiple encodes resulting in noticeable patchiness/blockiness and loss of detail.
As we get to higher definition formats, working in lossless codecs can become somewhat difficult owing to the processing and storage necessary. The temptation is always there to go back to something lossy, and maybe get away with it. For better or for worse, many consumer devices capture solely in lossy formats, so a re-encode is unavoidable for certain types of edits. With naive consumers, multiple re-encodes may be necessary. Internet distribution (e.g. YouTube) often entails their servers performing another re-encode at a very restricted quality.
I was interested in seeing what the generational loss behaviour is like for x264 encoding at CRF 20. For one, H.264 remains one of the most widely used codecs today, x264 is a widely-used open-source codec and CRF 20 is considered “acceptable” quality. What happens to the bit-rate, PSNR, SSIM and perceived quality over generations?
The source material chosen corresponds to the “average case” clip as used in prior video codec tests. Specifically, the clip is Gfriend – Navillera – H.264 Main@L4.1 (34943kbit/s) 3m32s firstname.lastname@example.org 8-bit 4:2:0. The source material is already H.264 compressed, and a few people may have issue with this, however, it has a good level of quality and it represents a common use case – namely transcoding Blu-Ray rips, or re-encoding video post-editing from compressed capture sources (e.g. mobile phones, action cameras, etc) and thus is not entirely unrepresentative of consumer uses.
The encoding and analysis was performed using shell scripting and ffmpeg 2.8.4 built by Zeranoe. This version uses x264 core 148 r2638 7599210 as the encoder. This is considered somewhat dated by now (by a year), but because the analysis was being conducted quite a while earlier based on what I was already using (due to some bugs in more recent builds), I conducted the analysis with it.
Encodes were done with preset slower, at a CRF of 20. Profile was not selected, and files show they were encoded at High@L5. SSIM and PSNR measurements were made using the same ffmpeg version as well. Due to time and practicality limitations, testing was terminated at 500 encode generations.
Bitrate vs Generations
Using a linear scale, the bitrate falls dramatically within the first 50 generations, and tails off asymptotically towards about half the bitrate of the first generation encode.
Using a log scale, we can see the detail more clearly in the first few generations. It seems that the second generation already has lost close to 0.8Mbit/s over the first. You can’t create detail out of “nothing”, so this loss of bitrate at a constant rate factor indicates that there has been a likely loss of detail and picture quality.
While later generations maintain an asymptotic relationship with the bitrate, this does not necessarily imply that quality loss has “halted” in any way. Instead, it could be the case that encoding flaws are generating new detail which keeps encoders busy and consumes bitrate.
PSNR vs Generations
PSNR is known to be more sensitive to small changes in image quality at the high end, and is often used for fine tuning.
In this case, the PSNR graph seems to corroborate the inferences made from the bitrate graph. The first few generations really do hit the quality the most, and then the PSNR decline tails off (as most of the loss has been made). PSNR is less sensitive to bigger changes towards the low-quality end, however.
On a lof scale, it seems that the second generation has lost about 1.5dB from the first generation, which is considered noticeable. By the third generation, the total loss is approaching 2dB, and by the fifth generation, it has reached about 3dB which is considered obvious. This seems to imply that CRF 20, or maybe CRF, or maybe more broadly H.264 is not a good candidate for preserving quality through multiple generations without choosing much higher quality settings or using special loss-less modes.
SSIM vs Generations
SSIM is considered less sensitive towards the high-end of image quality, but more sensitive towards the low-end.
Because of this characteristic, the normalized SSIM figures show less of a “tail off” with higher generations, and instead, more of a sustained decline. This indicates that even though the bitrate may appear stable at the higher generations, the quality is actually reducing in a measurable way.
When plotted on a log scale, it is unlike the SSIM graphs because it is less sensitive to the higher quality, thus it perceives quality to be reducing at a more rapid rate per 10-fold encoding generations.
SSIM can be expressed on a decibel scale, which then mirrors the above results more closely. Losses are quite heavy on initial generations.
It takes about three generations to lose 1dB of SSIM, and around ten generations to lose 3dB of SSIM.
For the visual comparison, frame 54 from the source was used. It is a frame from an introductory scene, not close to any scene cuts, where the whole Gfriend group is visible. Sharp details are seen in the hair and facial features of the members, making it a good comparison frame. As it is known from the previous analysis that the degradation occurs mainly in the initial generations, only the results from generation 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400 and 500 are shown below. Due to space constraints, only Yerin’s smile (left) and SinB’s eyes (right) are shown.
My notes on the above are as follows:
- Generation 1 looks fairly similar to the source, however, when viewing at enlarged scale, some blocking imperfections are already beginning to develop around Yerin’s chin, and slight loss of detail is showing in SinB’s hair. Blocking is also visible in Yerin’s hair to the left.
- Generation 2 starts to affect skin texture, especially on Yerin.
- Generation 3 shows blocking setting in more objectionably in Yerin’s neck area. Yerin’s hair detail is almost all lost. I’d say this is where the visual differences are already quite obvious to a careful viewer.
- Generation 4 has Yerin’s gum detail starting to go missing, and SinB’s hair detail and eyelashes are starting to become very soft. Yerin’s neck now looks like a bit of a watercolour painting as obvious posterization effects are coming in.
- Generation 10 is where detail has become an almost homogenous mess – the hair lacks detail, as does the teeth, and new blocky artifacts seem to be creeping in at a more rapid rate.
- Generation 100 shows some interesting colour shifts starting to get amplified. While the video does still depict two people, and it’s still recognizable, the quality is not there at all and it is difficult to recognize who is in the images.
- By Generation 500, it is nearly impossible to recognize who is depicted in the images.
Rather interesting is that watching the encodes themselves, the quality variation in higher generations is accordingly higher. This is probably due to encoder mode decision differences. For example – frames after an abrupt scene cut tend to maintain their quality much better throughout the generations, as these are likely to trigger a keyframe (I) in the encoder which is an explicit frame that has limited quantization error. The frames just before the keyframe are often the worst quality frames, as they are the product of multiple predicted and bidirectional predicted frames (P, B) which accumulate errors along the way. As a result, those frames which are more often B-frames in every generation degrade more quickly and noticeably than the I frames do. As a result, the quality does “strobe” a bit, as after a keyframe, everything sharpens up only to quickly become mushy again.
For demonstration only, and not in any way fully accurate as it is being delivered through two generations of recompression, the following video shows the result of 500 generations of encoding at a rate of 5 generations per second.
Not unexpectedly, lossy to lossy encoding at even an acceptable level of CRF 20 produces noticeable losses even at a single generation. That being said, all of the indicators suggest that the majority of the quality loss occurs in the first few generations of CRF encoding, which really reinforces the message of reducing the number of lossy encode generations as much as possible within editing workflows, or using artificially high levels of quality for lossy compression if absolutely necessary.
The bitrate also fell off most dramatically in the first few generations, with all forms of degradation slowing down as the number of generations increased. Initially, detail began to be lost, blockiness was introduced and finally, false-colour regions began to bloom in the output after an extreme number of generations (several hundred). This seems to echo the conventional wisdom of “approximating an approximate”, which slowly amplifies errors in a way akin to how the “Chinese whispers” game results in misinformation and imprecision being amplified to chaotic extremes.
If you can avoid re-encoding, I suggest it’s probably best to do so. Or keep your encoding generations limited to one. Always re-encode from the best source you have, rather than by using a lossy version of the original – or similarly, edit from the source rather than an edited output even if it is just a small edit as the quality would be much better.