Audio File Formats FAQ: Compression schemes.

Compression schemes.

Strange though it seems, audio data is remarkably hard to compress effectively. For 8-bit data, a Huffman encoding of the deltas between successive samples is relatively successful. For 16-bit data, companies like Sony, Philips and tons of others have spent millions to develop proprietary schemes.

(Note that silence detection can also be considered compression schemes.)

ITU-T G.711, u-law and A-law.

u-law (pronounced mu-law -- the u really stands for the Greek letter mu) is an encoding commonly used in North America and Japan for digital telephony. u-law samples are logarithmically encoded in 8 bits, like a tiny floating point number; however, their dynamic range is that of 14 bit linear data. When you convert u-law back into 16-bit data you will lose some quality because of the reduced dynamic range.

There exists another encoding similar to u-law, called A-law, which is used as a European telephony standard.

See the section File Formats for some formula describing u-law and A-law. This encoding method comes out to be 60 kbits/sec at 8kHz.

Source for converting to/from u-law/A-law (written by Jef Poskanzer) is distributed as part of the SOX package mentioned later; it can easily be ripped apart to serve in other applications. The official definition is the ITU-T standard G.711 (formally CCITT G.711).

CCITT G.721, G.723, and ITU-T G.726.

CCITT defined public standards for compressing voice data in CCITT G.721 (ADPCM at 32 kbits/sec) and G.723 (ADPCM at 24 and 40 kbits/sec). ADPCM stands for Adaptive Differential Pulse Code Modulation and is a common method for compressing audio data. It takes advantage of the fact that you can generally predict the value of the next sound sample based on the previous sound sample. Most ADPCM implementations are a good compromise between fast processing, good compression rates, and good quality decoding.

Sun Microsystems has placed the source code of a portable implementation of the CCITT ADPCM algorithms (as well as G.711, which defines A-law and u-law) in the public domain (needless to say, their proprietary implementation distributed in binary form with Solaris is better :-). One place to ftp this source code from is ftp://ftp.cwi.nl/pub/audio/ccitt-adpcm.tar.Z.

ITU (which is now the name for CCITT) put out a replacement standard G.726 for both G.721 and G.723 to define standards for digitalization of audio signals at 16, 24, 32 and 40 kbits/second using ADPCM. These rates are often referred to by the bit size of a sample which are 2-bits, 3-bits, 4-bits, and 5-bits respectively.

IMA/DVI ADPCM

IMA/DVI ADPCM is a standard that compresses 16-bit sound data into only 4-bits. It is thought to be faster then Microsoft's ADPCM implementation.

Source for a 32 kbits/sec ADPCM implementation, assumed to be compatible with Intel's DVI audio format, can be ftp'ed from ftp://ftp.cwi.nl/pub/audio/adpcm.shar.

Source to handle IMA/DVI ADPCM formats in .WAV files is included in SOX, mentioned later.

Microsoft ADPCM

Microsoft, as usual, thought it was important to create their own variant of ADPCM for use in their .WAV file format. It also compresses 16-bit sound data into 4-bit data. It should be very similar in quality to IMA's ADPCM.

Source for MS ADPCM used in Microsoft WAVE files can be found in SOX, mentioned later.

LPC-10E

LPC-10E is defined by US DOD Federal Standard 1015 and stands for Linear Prediction Coder (Enhanced) and has a 2400 bits/s rate.

Here's a note about LPC and CELP audio codings by Van Jacobson <[email protected]>: Several people used the words "LPC" and "CELP" interchangeably. They are very different. An LPC (Linear Predictive Coding) coder fits speech to a simple, analytic model of the vocal tract, then throws away the speech & ships the parameters of the best-fit model. An LPC decoder uses those parameters to generate synthetic speech that is usually more-or-less similar to the original. The result is intelligible but sounds like a machine is talking.

CELP

CELP is defined by US DOD Federal Standard 1016 and stands for Code Excited Linear Prediction and has a 4800 bits/s rate. It is important to understand LPC-10E to understand CELP.

Van Jacobson <[email protected]> also provided the following information about CELP: A CELP (Code Excited Linear Predictor) coder does the same LPC modeling but then computes the errors between the original speech & the synthetic model and transmits both model parameters and a very compressed representation of the errors (the compressed representation is an index into a 'code book' shared between coders & decoders -- this is why it's called "Code Excited"). A CELP coder does much more work than an LPC coder (usually about an order of magnitude more) but the result is much higher quality speech: The FIPS-1016 CELP we're working on is essentially the same quality as the 32Kb/s ADPCM coder but uses only 4.8Kb/s (the same as the LPC coder).

The Real Audio streaming audio players use CELP for the original Version 2 28.8k audio codec (audio coder/decoder routines) but they have since concentrated on codecs that are patented and proprietary methods.

GSM 06.10.

GSM 06.10 stands for Global System for Mobile Communications and is a variant of LPC called RPE-LPC (Regular Pulse Excited - Linear Predictive Coder) and is a European standard originally for use in encoding speech for satellite distribution to mobile phones. It can be found in use in various telephony products such as voice mail applications.

It compresses 160 13-bit samples into 260 bits (or 33 bytes), i.e. 1650 bytes/sec (at 8000 samples/sec). It results in very good compression with good quality output but is very costly in terms of performance.

You may read more information about it and a free implementation of it at http://kbs.cs.tu-berlin.de/~jutta/toast.html and grab its source from ftp://tub.cs.tu-berlin.de/pub/tubmik/gsm-1.0.10.tar.gz.

shorten.

Tony Robinson <[email protected]> has written a good FAST loss-less compression for lots of different audio formats (particularly good for WAV and MOD files).

You can obtain the latest version of shorten from http://www.softsound.com/. It has a free license for non-commercial use. Because of its license though you don't see support for it and many programs.

Real Audio

Enough people ask about what compression schemes that Real Audio uses that I've created a section for it. The latest software supports a multitude of different compression schemes by using plug-in codecs (audio coder/decoder routines).

In version 2.0 of the Real Audio player there were two codecs. The first was a 14.4k codec that used a modified version of GSM to compress the data. A 28.8k codec used CELP that was described above.

MPEG

MPEG is an audio/video compression standard that has gained wide acceptance across industries. It has become popular to use the audio portion of the standard to store audio files since it provides near CD quality output at relatively low bit rates. It is very computational intensive, especially during the encoding phase.

There are 3 layers supported, with the 3rd layer the most popular, which include:

Layer-1: From 32 kbps to 448 kbps - target bit rate of 192 kbps
Layer-2: From 32 kbps to 384 kbps - target bit rate of 128 kbps
Layer-3: From 32 kbps to 320 kbps - target bit rate of 64 kbps

Misc.

Apple has an Audio Compression/Expansion scheme called ACE (on the GS) / MACE (on the Macintosh). It's a lossy scheme that attempts to predict where the wave will go on the next sample. There's very little quality change on 8:4 compression, somewhat more for 8:3. It does guarantee exactly 50% or 62.5% compression, though. I believe MACE uses larger ratios/more loss, but I'm unsure of the specific numbers. (Marc Sira)

web hosting • domain names
web design • online games