How libmp3lame Decides Which Frequencies to Discard
This article explains the inner workings of libmp3lame,
the core engine of the LAME MP3 encoder, focusing on how it dynamically
discards audio frequencies during compression. By utilizing a highly
sophisticated psychoacoustic model, libmp3lame analyzes
incoming audio signals to identify and remove frequencies that the human
ear cannot perceive. This process allows the encoder to significantly
reduce file size while maintaining high perceived audio quality.
The Psychoacoustic Model (GPSYCHO)
At the heart of libmp3lame is its psychoacoustic model,
known as GPSYCHO. This model simulates how the human brain and auditory
system process sound. Instead of treating all audio frequencies equally,
libmp3lame uses GPSYCHO to analyze the audio spectrum in
real-time and calculate a “masking threshold.” Any frequency component
that falls below this threshold is deemed inaudible and is dynamically
discarded or heavily compressed.
Absolute Threshold of Hearing
The first filter applied by the encoder is the absolute threshold of
hearing. The human ear is naturally insensitive to extremely low and
extremely high frequencies, especially at low volumes.
libmp3lame maps the input signal against a standardized
curve of human hearing limits. Any audio frequencies that fall below
this baseline curve of quietness are immediately discarded because a
human listener would not be able to hear them anyway.
Simultaneous Masking (Frequency Masking)
Simultaneous masking occurs when a loud sound drowns out a quieter
sound occurring at the same time. This is a primary tool for
libmp3lame to discard unnecessary frequencies:
- Tone-Masking-Noise: A pure tone (like a flute) will mask noise at nearby frequencies.
- Noise-Masking-Tone: A noisy sound (like a cymbal crash) will mask pure tones close to it in frequency.
The encoder divides the audio signal into critical frequency bands.
If a dominant, loud frequency is present in a band,
libmp3lame calculates a masking curve around it. Any
quieter frequencies residing within this curve are discarded, as the
louder sound physically prevents the human brain from perceiving
them.
Temporal Masking (Time-Domain Masking)
Human hearing does not instantly reset after hearing a sound.
libmp3lame exploits this limitation using temporal masking,
which occurs in two ways:
- Forward Masking: After a loud sound stops, the ear
remains desensitized for up to 100–200 milliseconds.
libmp3lamedynamically discards quieter frequencies that immediately follow a loud transient (like a drum hit). - Backward Masking: For a tiny window of about 5–20 milliseconds before a loud sound occurs, the brain is distracted by the upcoming impulse. The encoder discards quiet signals immediately preceding a loud transient.
MDCT and Bit Allocation (Quantization)
To actually remove the frequencies, libmp3lame converts
the audio from the time domain to the frequency domain using the
Modified Discrete Cosine Transform (MDCT).
Once the frequency spectrum is mapped, the encoder applies the masking threshold calculated by GPSYCHO. During the quantization phase (where audio data is converted to digital bits), the encoder allocates bits based on necessity. If a frequency band’s energy is below the masking threshold, the encoder allocates zero bits to it. In digital audio, allocating zero bits to a frequency effectively discards it from the final MP3 file, resulting in highly optimized compression.