Feeding Raw 24-Bit Audio to libmp3lame
Feeding raw, packed 24-bit PCM audio directly into the standard
libmp3lame encoder API causes severe mathematical
misalignment of the audio data. Because the encoder natively expects
16-bit signed integers (two bytes per sample) or 32-bit floating-point
numbers, interpreting a 24-bit byte stream (three bytes per sample)
results in a stride mismatch. This mismatch distorts the amplitude,
scrambles the sign of the audio samples, and increases the sample count
by 50%, resulting in a pitch-shifted, catastrophic wall of digital
noise.
The Mathematics of the Stride Mismatch
Raw 24-bit PCM audio represents each audio sample using 3 bytes (24
bits) in a packed format. Conversely, the default LAME input function
(lame_encode_buffer) expects 16-bit signed integers, which
use 2 bytes (16 bits) per sample.
When you pass a raw byte stream \(B = [b_0, b_1, b_2, b_3, b_4, b_5, \dots]\) to the encoder, LAME groups these bytes into 2-byte pairs to reconstruct 16-bit integers.
In a little-endian system, the original 24-bit samples (\(S\)) are reconstructed mathematically as:
\[S_0 = b_0 + (b_1 \cdot 256) + (b_2 \cdot 65536)\] \[S_1 = b_3 + (b_4 \cdot 256) + (b_5 \cdot 65536)\]
However, because LAME expects 16-bit inputs, it reads the exact same byte stream in 2-byte steps. The resulting samples (\(L\)) processed by the encoder become:
\[L_0 = b_0 + (b_1 \cdot 256)\] \[L_1 = b_2 + (b_3 \cdot 256)\] \[L_2 = b_4 + (b_5 \cdot 256)\]
The Resulting Waveform Distortion
This byte alignment shift causes three distinct mathematical anomalies in the output audio:
- Byte Mixing (Phase and Sign Corruption): The sample \(L_1\) is constructed from \(b_2\) (the most significant byte of the first 24-bit sample) and \(b_3\) (the least significant byte of the second 24-bit sample). Combining the high-magnitude bits of one sample with the low-magnitude bits of another completely destroys the original waveform’s amplitude structure and flips the signs unpredictably.
- Time and Pitch Scaling (1.5x Speedup): Since 24-bit audio uses 3 bytes per sample and 16-bit audio uses 2 bytes, every 2 samples of 24-bit audio (6 bytes) are interpreted by LAME as 3 samples of 16-bit audio. Mathematically, the number of samples increases by a factor of 1.5 (\(N_{new} = 1.5 \cdot N_{old}\)), resulting in a pitch shift and speed-up of 150% if played back at the original sample rate.
- White-Noise Generation: Because the byte boundaries are misaligned, the correlation between consecutive samples is lost. The resulting signal mathematically behaves like high-amplitude pseudo-random white noise, completely overpowering any recognizable remnants of the original audio signal.