Audio Encoding
Introduction to Audio Encoding
An audio encoding refers to the manner in which audio data is stored and transmitted. The documentation below describes how such encodings work.
Audio formats vs encodings
Note that an audio format is not equivalent to an audio encoding. A popular file format like .WAV
for example, defines the format of the header of an audio file, but is not itself an audio encoding. .WAV
audio files often, but not always, use a linear PCM encoding; don't assume a .WAV
file has any particular encoding until you inspect its header.
Supported audio encodings
The SoapBox Labs Speech Solutions supports the following audio encoding:
Codec | Name | Lossless | Usage Notes |
---|---|---|---|
| Linear PCM | Yes | 16-bit linear pulse-code modulation (PCM) encoding |
Note: The SoapBox Speech solutions support WAV
files with LINEAR16
encoded audio.
Why encode?
Audio is made up of waveforms, consisting of the interposition of waves of different frequencies and amplitudes. To represent these waveforms within digital media, the waveforms need to be sampled at rates that can (at least) represent sounds of the highest frequency which you wish to replicate, and they also need to store enough bit depth to represent the proper amplitude (loudness and softness) of the waveforms across the sound sample.
The ability of a sound processing device to recreate frequencies is known as its frequency response and the ability of it to create proper loudness and softness is known as its dynamic range. Together these terms are often referred to as a sound device's fidelity. An encoding, in its simplest form, is a means with which to reconstruct sound using these two basic principles, as well as being able to store and transport such data efficiently.
Sampling rates
Sound exists as an analog waveform. A segment of digital audio approximates this analog wave by sampling the amplitude of this analog wave at a fast enough rate to mimic the wave's intrinsic frequencies. A digital audio segment's sample rate specifies the number of samples to take from an audio's source material (per second); a high sample rate increases the ability of digital audio to faithfully represent high frequencies.
As a consequence of the Nyquist-Shannon theorem, you generally need to sample at least twice the highest frequency of any sound wave you wish to capture digitally. To represent audio within the range of human hearing (20-20000 Hz), for example, a digital audio format must sample at least 40000 times per second (which is part of the reason why CD audio uses a sample rate of 44100 Hz).
Bit depths
Bit depth affects the dynamic range of a given audio sample. A higher bit depth allows you to represent more precise amplitudes. If you have lots of loud and soft sounds within the same audio sample, you will need more bit depth to represent those sounds correctly.
Higher bit depths also reduce the signal to noise ratio within audio samples. CD musical audio is provided using 16 bits of bit depth. DVD Audio uses 24 bits of bit depth, while most telephony equipment uses 8 bits of bit depth. (Certain compression techniques can compensate for smaller bit depths, but they tend to be lossy.)
Uncompressed audio
Most digital audio processing uses these two techniques — sampling rate and bit depth — to store audio data in a straightforward manner. One of the most popular digital audio techniques is known as Pulse Code Modulation (or PCM). Audio is sampled at set intervals, and the amplitude of the sampled wave at that point is stored as a digital value using the sample's bit depth.
Linear PCM (which indicates that the amplitude response is linearly uniform across the sample) is the standard used within CDs, and within the LINEAR16
encoding of the our solutions. Both encodings produce an uncompressed stream of bytes corresponding directly to audio data, and both standards contain 16 bits of depth. Linear PCM uses a sample rate of 44,100 Hz within CDs, which is appropriate for the recomposition of music; however a sample rate of 16000 Hz is more appropriate for recomposing speech.
Linear PCM (LINEAR16
) is an example of uncompressed audio in that the digital data is stored exactly as the above standards imply. Reading a one-channel stream of bytes encoded using Linear PCM, you could count off every 16 bits (2 bytes), for example, to get another amplitude value of the waveform. Almost all devices can manipulate such digital data natively — you can even crop Linear PCM audio files using a text editor — but (obviously) uncompressed audio is not the most efficient way to transport or store digital audio. For that reason, most audio uses digital compressions techniques.
Compressed audio
Audio data, like all data, is often compressed to make it easier to store and to transport. Compression within audio encoding may be either lossless or lossy. Lossless compression can be unpacked to restore the digital data to its original form. Lossy compression necessarily removes some such information during compression and decompression, and is parameterised to indicate how much tolerance to give to the compression technique to remove data.
Lossless compression
Lossless compression compresses digital audio data using complex rearrangements of the stored data, but results in no degradation in quality of the original digital sample. With lossless compression, when unpacking the data into its original digital form, no information will be lost.
So why do lossless compression techniques sometimes have optimisation parameters? These parameters often trade file size for decompression time. For example, FLAC
uses a compression level parameter from 0 (fastest) to 8 (smallest file size). Higher level FLAC compression won't lose any information in comparison to lower level compression. Instead, the compression algorithm will just need to expend more computational energy when constructing or deconstructing original digital audio.
SoapBox Labs supports the following lossless encodings: LINEAR16
. Technically, LINEAR16
isn't "lossless compression" because no compression is involved in the first place.
Lossy compression
Currently we do not support audio formats that use lossy compression (i.e. MP3).