Feeding Raw 24-Bit Audio to libmp3lame

Feeding raw, packed 24-bit PCM audio directly into the standard libmp3lame encoder API causes severe mathematical misalignment of the audio data. Because the encoder natively expects 16-bit signed integers (two bytes per sample) or 32-bit floating-point numbers, interpreting a 24-bit byte stream (three bytes per sample) results in a stride mismatch. This mismatch distorts the amplitude, scrambles the sign of the audio samples, and increases the sample count by 50%, resulting in a pitch-shifted, catastrophic wall of digital noise.

The Mathematics of the Stride Mismatch

Raw 24-bit PCM audio represents each audio sample using 3 bytes (24 bits) in a packed format. Conversely, the default LAME input function (lame_encode_buffer) expects 16-bit signed integers, which use 2 bytes (16 bits) per sample.

When you pass a raw byte stream \(B = [b_0, b_1, b_2, b_3, b_4, b_5, \dots]\) to the encoder, LAME groups these bytes into 2-byte pairs to reconstruct 16-bit integers.

In a little-endian system, the original 24-bit samples (\(S\)) are reconstructed mathematically as:

\[S_0 = b_0 + (b_1 \cdot 256) + (b_2 \cdot 65536)\] \[S_1 = b_3 + (b_4 \cdot 256) + (b_5 \cdot 65536)\]

However, because LAME expects 16-bit inputs, it reads the exact same byte stream in 2-byte steps. The resulting samples (\(L\)) processed by the encoder become:

\[L_0 = b_0 + (b_1 \cdot 256)\] \[L_1 = b_2 + (b_3 \cdot 256)\] \[L_2 = b_4 + (b_5 \cdot 256)\]

The Resulting Waveform Distortion

This byte alignment shift causes three distinct mathematical anomalies in the output audio:

Byte Mixing (Phase and Sign Corruption): The sample \(L_1\) is constructed from \(b_2\) (the most significant byte of the first 24-bit sample) and \(b_3\) (the least significant byte of the second 24-bit sample). Combining the high-magnitude bits of one sample with the low-magnitude bits of another completely destroys the original waveform’s amplitude structure and flips the signs unpredictably.
Time and Pitch Scaling (1.5x Speedup): Since 24-bit audio uses 3 bytes per sample and 16-bit audio uses 2 bytes, every 2 samples of 24-bit audio (6 bytes) are interpreted by LAME as 3 samples of 16-bit audio. Mathematically, the number of samples increases by a factor of 1.5 (\(N_{new} = 1.5 \cdot N_{old}\)), resulting in a pitch shift and speed-up of 150% if played back at the original sample rate.
White-Noise Generation: Because the byte boundaries are misaligned, the correlation between consecutive samples is lost. The resulting signal mathematically behaves like high-amplitude pseudo-random white noise, completely overpowering any recognizable remnants of the original audio signal.