How libmp3lame Calculates MP3 Frame Sizes

This article explains the mathematical formulas and algorithms used by the libmp3lame library (the core engine behind the LAME MP3 encoder) to determine the size of MP3 audio frames. It covers the deterministic algebraic formulas used for Constant Bitrate (CBR) encoding, as well as the psychoacoustic algorithms and quantization loops used to dynamically calculate frame sizes in Variable Bitrate (VBR) encoding.

The Standard MP3 Frame Size Formula

For Constant Bitrate (CBR) streams, libmp3lame determines the size of an MP3 frame using a standardized mathematical formula derived from the MPEG-1 and MPEG-2 specifications. Because MP3 audio is divided into discrete frames that represent a fixed duration of time, the frame size in bytes is directly proportional to the bitrate and inversely proportional to the sample rate.

For MPEG-1 Layer III (which supports sample rates of 32 kHz, 44.1 kHz, and 48 kHz), each frame contains 1,152 samples. The mathematical formula to calculate the frame size in bytes is:

\[\text{Frame Size} = \left\lfloor 144 \times \frac{\text{Bitrate}}{\text{Sample Rate}} \right\rfloor + \text{Padding}\]

For MPEG-2 and MPEG-2.5 Layer III (which support lower sample rates of 8 kHz to 24 kHz), each frame contains 576 samples. The formula adjusts accordingly:

\[\text{Frame Size} = \left\lfloor 72 \times \frac{\text{Bitrate}}{\text{Sample Rate}} \right\rfloor + \text{Padding}\]

Understanding the Variables

Bitrate: The target bit rate in bits per second (bps).
Sample Rate: The frequency of the audio in Hertz (Hz).
Padding: An optional 1-byte (8-bit) slot. If the division of bitrate by sample rate does not yield an integer, LAME adds a padding byte to specific frames to maintain the overall target bitrate over time.
The Coefficients (144 and 72): These constants are mathematically derived from the number of samples per frame divided by 8 (to convert bits to bytes). For MPEG-1: \(1152 \text{ samples} / 8 \text{ bits} = 144\).

Dynamic Frame Size Calculation in VBR Mode

When encoding in Variable Bitrate (VBR) mode, libmp3lame cannot rely solely on a static formula. Instead, it uses a complex, iterative algorithmic process to determine the optimal bitrate—and consequently, the frame size—for each individual frame based on the complexity of the audio.

1. The GPSYCHO Psychoacoustic Model

LAME analyzes the input audio signal using a psychoacoustic model called GPSYCHO (an educational and highly optimized implementation of the ISO MPEG psychoacoustic model).

Fast Fourier Transform (FFT): The encoder applies a 1024-point FFT to the audio signal to convert it from the time domain to the frequency domain.
Masking Thresholds: The algorithm calculates the “Signal-to-Mask Ratio” (SMR). It determines which parts of the audio are audible to the human ear and which parts are masked (rendered inaudible) by louder, adjacent frequencies.

2. The Quantization Loop Algorithm (Bit Allocation)

Once the masking thresholds are established, libmp3lame employs a nested, two-loop search algorithm to determine how many bits (and what size frame) are required to encode the audio without introducing audible distortion.

The Outer Loop (Noise Control): This loop checks if the quantization noise (the distortion introduced by compressing the audio) exceeds the masking threshold calculated by the psychoacoustic model. If the noise is too high, it requests more bits for the frame.
The Inner Loop (Rate Control): This loop attempts to compress the Modified Discrete Cosine Transform (MDCT) coefficients of the audio using Huffman coding. It adjusts the quantizer step size to fit the data into the number of bits requested by the outer loop.

Through this iterative feedback loop, libmp3lame decides on the minimum bitrate required to maintain the user’s desired quality level (\(V0\) through \(V9\)). Once this optimal bitrate is determined for the current frame, it is plugged back into the standard MPEG frame size formula to physically write the frame.