Vocals-Bass Source Separation Methods

In our iterative masking approach to stem separation, our second mask separates the harmonics into vocals and bass. Instead of using machine learning for vocals-bass source separation (VBSS), we take the opportunity to creatively apply digital signal processing tools from class. We present several approaches to VBSS, each of which leverages key observations about the nature of the vocals and bass stems. In general, we found the problem of vocals-bass source separation to be harder than the problem of harmonics-percussion source separation.

VBSS - Smoothing Filters

Our first approach to VBSS uses smoothing filters to create a binary mask. This approach is based on the observation that the frequency of the vocals tends to fluctuate over time while the frequency of the base tends to remain the same. In other words, the vocals tend to exhibit vibrato (i.e., the vocals tend to waver over time) while the bass does not. We first apply a moving median filter along the time axis of the harmonics spectrogram. In doing so, we create a harmonics-enhanced copy of the spectrogram that appears smeared across the time axis. This copy is “harmonics-enhanced” because harmonics (i.e., vocals and bass) occupy a narrow range of frequencies for a long time. By smearing the spectrogram across the time axis, these horizontally-oriented harmonics are made more pronounced. We then apply a moving median absolute deviations (MADs) filter along the time axis of the harmonics spectrogram. In doing so, we create a bass-diminished copy of the spectrogram. This copy is “bass-diminished” because the bass tends to occupy the same narrow range of frequencies for each note, forming nearly perfectly horizontal lines in the spectrogram. The MADs along a perfectly horizontal line is zero; therefore, applying the moving MADs filter turns the bass into time-frequency bins with magnitude zero.

Figure 1: Spectrogram of the song "Music Delta - Beatles" containing the vocals, bass, and drums stems (top) and the bass-diminished version of the same spectrogram (bottom). Note that in the bass-diminished version of the spectrogram, the flat horizontal lines at low frequency bins are virtually eliminated. The bass-diminished version of the spectrogram is compared against a harmonics-enhanced version of the spectrogram to create a binary mask.

For each time-frequency bin, we then compare the magnitude in the harmonics-enhanced copy of the spectrogram versus the magnitude in the bass-diminished copy of the spectrogram. If the magnitude is significantly greater in the harmonics-enhanced copy of the spectrogram, then the bin is added to the binary mask (i.e., the weight of the bin is set to one). Otherwise, the weight of the bin is set to zero. By multiplying the mask with the harmonics spectrogram, we obtain the estimated spectrogram for the bass. By multiplying the logical inverse of the mask with the harmonics spectrogram, we obtain the estimated spectrogram for the vocals. Taking the inverse short-time Fourier transform of the estimated spectrograms allows us to recover the vocals and bass stems.

VBSS - Vocals Harmonics 1

Our second approach to VBSS uses Gaussian envelopes to create a soft mask. This approach is based on the observation that the vocals are strongly harmonic while the bass is weakly harmonic. With this observation in mind, this approach masks the harmonics in the spectrogram to separate the vocals from the bass. The harmonics in the spectrogram are assumed to belong to the vocals while everything else is assumed to belong to the bass. In the spectrogram, harmonics appear as periodic peaks along the vertical axis, or frequency axis. To determine the periodicity of these peaks, we compute the discrete Fourier transform of the spectrogram along the frequency axis. Specifically, we compute the discrete Fourier transform of each vertical line of the spectrogram individually. In this project, we refer to a vertical line of the spectrogram as a time slice of the spectrogram because each vertical line represents the frequency content of the mixture within a particular window of time.

Figure 2: Time slice of the spectrogram for the song "Music Delta - Beatles". Only frequencies from 0 Hz to 10 kHz are shown, but the whole time slice includes frequencies up to the sampling frequency of 44.1 kHz. The spikes at integer multiples of ~370 Hz correspond to the vocals harmonics. The cluster of peaks below ~370 Hz roughly corresponds to the bass. While the vocals harmonics are very apparent and cover a broad range of frequencies, the bass harmonics are harder to distinguish.

The discrete Fourier transform of a time slice of the spectrogram contains multiple peaks. The tallest peaks are clustered at lower frequencies and correspond to slowly varying trends in the average value of the time slice. One of the smaller peaks corresponds to the fundamental frequency of the harmonics in the spectrogram. This smaller peak is our peak of interest, but it is difficult to extract programmatically. Thus, we choose to detrend the time slice before computing the discrete Fourier transform. In other words, we fit a high order polynomial to the time slice and then subtract said polynomial from the time slice before computing the discrete Fourier transform. Doing so attenuates the peaks clustered at lower frequencies and amplifies the peak of interest that corresponds to the fundamental frequency of the harmonics in the spectrogram. Our peak of interest is now the tallest peak in the discrete Fourier transform and is therefore much easier to extract programmatically.

Figure 3: DFT of the time slice shown above after the time slice has been detrended. The units of the horizontal axis can roughly be understood as the number of harmonics in the range from 0 Hz to the sampling frequency of 44.1 kHz. For instance, the peak at 121 indicates the strong presence of a fundamental frequency that has 121 harmonics in the range from 0 Hz to 44.1 kHz. This fundamental frequency is 44.1 kHz / 121 = 364.4628 Hz, which is clearly the fundamental frequency of the vocals harmonics. After detrending, the peak at 121 is now the tallest peak in the DFT, which makes it easy to extract programmatically.

In this fashion, we estimate the fundamental frequency of the harmonics for each time slice of the spectrogram. However, this approach occasionally returns an erroneous value for the fundamental frequency for a particular time slice. To mitigate these errors, we apply a moving median filter to the set of estimated fundamental frequencies to remove outliers. Each outlier is replaced with an interpolated value of the estimated fundamental frequency. We assume that the harmonics in each time slice of the spectrogram are located at integer multiples of the estimated fundamental frequencies. Thus, we place Gaussian envelopes at the estimated locations of the harmonics. This collection of Gaussian envelopes for each time slice of the spectrogram comprises our soft mask to recover the vocals stem.

Figure 4: The raw estimated fundamental frequencies for each time slice of the spectrogram (top) and the estimated fundamental frequencies for each time slice of the spectrogram after applying some filtering (bottom). Applying a moving median filter removes outliers and eliminates what is sometimes referred to as "salt and pepper noise". Outliers are replaced with interpolated values of the estimated fundamental frequencies. Our use of interpolation is based on the assumption that the vocals change roughly smoothly over time, which we believe is a reasonable assumption based on the real-world behavior of singers.

In this approach to VBSS, the soft mask to recover the bass stem is created separately from the soft mask to recover the vocals stem. To create the soft mask to recover the bass stem, we identify all peaks in every time slice of the spectrogram. For each peak that does not coincide with one of the estimated locations of the harmonics, we place a Gaussian envelope at the location of said peak. This collection of Gaussian envelopes comprises our soft mask to recover the bass stem. We apply a moving average and moving median filter to remove and replace outliers to denoise the soft mask to recover the bass stem.

VBSS - Vocals Harmonics 2

In our exploration of vocal harmonics, we noticed a phenomenon with the frequency content of the vocals stem. We noticed that as we analyzed higher harmonics, the actual frequencies of these harmonics deviated from the estimated frequencies. After doing some research on the topic, we found that this is a physical property that is present in all physical instruments (including voices) known as inharmonicity. This property led to problems with our vocal harmonics masking approach, with some of the higher harmonics not lining up with our mask. To combat this issue, we developed a new approach to our VBSS: “Vocals Harmonics 2”. At lower frequencies, this approach is identical to “Vocals Harmonics 1” described above, with the mask lining up with the harmonics of the vocals fundamental frequencies. However, this method is different at higher frequencies. Since it is difficult to assess the degree of inharmonicity for different singers (and thus predict where the higher harmonics will be), this approach simply passes all frequencies above a cutoff frequency into the vocal stem. In our algorithm we used 322 Hz as our cutoff because we expect all bass frequencies to be below that frequency, meaning the high pass component of our mask will not allow any of the bass stem into the recovered vocals stem.

In this approach, we compute the bass stem in tandem with the vocals stem. We experimented with taking the inverse of the vocals mask versus amplifying all non-vocals peaks as described in “Vocals Harmonics 1”. Ultimately, we found that the inverse of the vocals mask produced the best results. This means that for this procedure, we compute the vocals mask, multiply it by the mixture to extract the recovered vocals stem and then multiply the mask’s inverse by the mixture to extract the recovered bass stem.

This technique generally performed better than “Vocals Harmonics 1”, and performed the best of all of our algorithms for some songs. While it often produces good results, we acknowledge that this method is limited and could be improved. The main problem with this method is if the bass stem has notes above the cutoff frequency of 322 Hz. These notes will pass directly into the vocals stem and cause significant error. If we were to further pursue this method, we would likely try to use the characteristics of the bass stem to dynamically adjust the cutoff frequency for a given song. This algorithm, as is, could also run into issues with multiple singers singing at the same time, as we only extract vocals of one fundamental frequency for each time slice. This limitation also applies to “Vocals Harmonics 1” and could be addressed by modifying our vocals soft mask algorithm, but it is definitely an issue with how our program works right now.

VBSS - Bass Harmonics

In our previous approach to VBSS, we proved that we could mask the harmonics in the spectrogram corresponding to the vocals stem. Theoretically, it should also be possible to mask the harmonics in the spectrogram corresponding to the bass stem. After all, the bass is still harmonic, albeit weakly. To this end, we attempted to devise an approach to VBSS that masks the harmonics in the spectrogram corresponding to the bass stem. This approach was very similar to our previous approaches. For each time slice of the spectrogram, we computed the discrete Fourier transform after detrending the time slice. We picked the tallest peak in the discrete Fourier transform to estimate the fundamental frequency of the harmonics for each time slice of the spectrogram. We assumed that the harmonics in each time slice of the spectrogram were located at integer multiples of the estimated fundamental frequencies. We placed Gaussian envelopes at the estimated locations of the harmonics to create a soft mask to recover the bass stem.

However, there are a couple key differences between this approach and our previous approaches. First, for each time slice of the spectrogram, we only computed the discrete Fourier transform for a restricted range of frequencies. In practice, we only computed the discrete Fourier transform for the range of frequencies from 0 Hz to ~270 Hz for each time slice. We chose this range of frequencies because in this range, the harmonics corresponding to the bass stem were likely to dominate the harmonics corresponding to the vocals stem. By restricting our range of frequencies in this fashion, the discrete Fourier transform was more likely to have a strong peak corresponding to the fundamental frequency of the bass stem and less likely to have a strong peak corresponding to the fundamental frequency of the vocals stem. Second, for each time slice of the spectrogram, we detrended the time slice with a lower order polynomial rather than a higher order polynomial. We chose to detrend each time slice with a lower order polynomial because a lower order polynomial better achieved the desired behavior of attenuating unwanted peaks while amplifying our peak of interest corresponding to the fundamental frequency of the bass stem. In contrast, a higher order polynomial tended to attenuate our peak of interest in addition to the unwanted peaks.

Ultimately, this approach to VBSS was not viable. When several harmonics corresponding to the bass stem were localized in the restricted range of frequencies from 0 Hz to ~270 Hz and few harmonics corresponding to the vocals stem were localized in the same range, this approach to VBSS recovered the bass stem with reasonable accuracy. In other words, when the harmonics corresponding to the bass stem dominated the harmonics corresponding to the vocals stem in the restricted range of frequencies, our approach performed well. However, many notes on the bass only had a few harmonics in the range of frequencies from 0 Hz to ~270 Hz. For these notes, the harmonics corresponding to the bass stem no longer dominated the harmonics corresponding to the vocals stem in the restricted range of frequencies, and our approach performed poorly. We considered expanding the restricted range of frequencies beyond ~270 Hz; however, expanding the range of frequencies in this fashion captured new harmonics corresponding to the vocals stem in addition to new harmonics corresponding to the bass stem. Therefore, expanding the restricted range of frequencies did not help the harmonics corresponding to the bass stem dominate the harmonics corresponding to the vocals stem. If anything, performance degraded as the restricted range of frequencies expanded. Due to the infeasibility of this approach to VBSS, quantitative data was not collected.

Figure 5: Spectrogram of the ground truth bass stem for the song "Music Delta - Beatles" (top), the soft mask constructed to recover the bass stem based on the bass harmonics (middle), and the recovered bass stem obtained by applying the mask to the vocals-bass mixture (bottom). The recovered bass stem approximates the ground truth bass stem fairly well when the ground truth bass stem has several harmonics in the range from 0 Hz to ~270 Hz. However, when the ground truth bass stem has few harmonics in the range from 0 Hz to ~270 Hz, the recovered bass stem is significantly distorted and essentially unusable. In our attempts to tune parameters to improve the performance of the approach "Bass Harmonics", these spectrograms represent the best results we could achieve on the song "Music Delta - Beatles". Therefore, we decided not to pursue this approach any further.

VBSS - 2D FFT Peak Picking

Our fourth approach to VBSS uses the 2D FFT of the spectrogram to create a soft mask. This approach is based on the observation that the bass tends to be periodic in time while the vocals tend not to be. In other words, the bass tends to repeat the same riff over and over again while the vocals tend to vary across measures. With this observation in mind, this approach uses the 2D FFT based on the algorithm in [8]. According to [8], periodicity in an image manifests as a set of peaks in the 2D FFT of said image. This principle can be applied to the harmonics spectrogram. Periodicity in the harmonics spectrogram manifests as a set of peaks in the 2D FFT of the harmonics spectrogram. Specifically, periodicity along the horizontal axis, or time axis, of the harmonics spectrogram manifests as a set of vertical ridges in the 2D FFT of the harmonics spectrogram. Between the vocals stem and the bass stem, the bass stem is more likely to exhibit periodicity in time. Therefore, any vertical ridges in the 2D FFT of the harmonics spectrogram are more likely to belong to the bass stem than the vocals stem. Thus, this approach separates the 2D FFT of the harmonics spectrogram into ridges and everything else. The ridges in the 2D FFT of the harmonics spectrogram are assumed to correspond to the bass stem, and everything else is assumed to correspond to the vocals stem.

Figure 6: 2D FFT of the harmonics spectrogram for the song "Music Delta - Hendrix". The vertical ridges in the 2D FFT correspond to periodicity along the time axis in the harmonics spectrogram. We assume that the bass stem is more likely to exhibit temporal periodicity than the vocals stem. Thus, we believe that separating the 2D FFT of the harmonics spectrogram into vertical ridges and everything else should allow us to recover the bass stem and vocals stem separately from a vocals-bass mixture. Specifically, we believe that the ridges in the 2D FFT of the harmonics spectrogram correspond to the bass stem while everything else corresponds to the vocals stem.

Following the algorithm in [8], we compute the 2D FFT of the harmonics spectrogram. We then pick peaks in the 2D FFT by identifying the largest values in the 2D FFT within localized regions known as “neighborhoods”. Each rate-scale bin in the 2D FFT that corresponds to a peak is added to a binary mask (i.e., the weight of the bin is set to one). Otherwise, the weight of the bin is set to zero. We apply this binary mask to the 2D FFT to separate the ridges from everything else. Taking the inverse 2D FFT of the ridges yields a bass-enhanced copy of the spectrogram. This copy is bass-enhanced because the ridges in the 2D FFT are more likely to correspond to the bass stem than the vocals stem. Taking the inverse 2D FFT of everything else yields a vocals-enhanced copy of the spectrogram. This copy is vocals-enhanced because everything else in the 2D FFT is more likely to correspond to the vocals stem than the bass stem.

For each time-frequency bin in the harmonics spectrogram, we then compare the magnitude in the bass-enhanced copy of the spectrogram versus the magnitude in the vocals-enhanced copy of the spectrogram. If the magnitude is greater in the bass-enhanced copy of the spectrogram, then the bin is added to the binary mask (i.e., the weight of the bin is set to one). Otherwise, the weight of the bin is set to zero. By multiplying the mask with the harmonics spectrogram, we obtain the estimated spectrogram for the bass. By multiplying the logical inverse of the mask with the harmonics spectrogram, we obtain the estimated spectrogram for the vocals. Taking the inverse short-time Fourier transform of the estimated spectrograms allows us to recover the vocals and bass stems.

VBSS - Energy Thresholding

Our fifth and final approach to VBSS is a low-pass filter. This approach is based on the observation that the bass stem tends to occupy a lower range of frequencies than the vocals stem. In the harmonics spectrogram, there are a finite number of frequency bins. Therefore, an ideal low-pass filter is realizable using a binary mask. First, a threshold frequency, or cutoff frequency, is chosen. For each time-frequency bin in the harmonics spectrogram, we compare the frequency of the bin to the cutoff frequency. If the frequency of the bin is less than the cutoff frequency, then the bin is added to the binary mask (i.e., the weight of the bin is set to one). Otherwise, the weight of the bin is set to zero. By multiplying the mask with the harmonics spectrogram, we obtain the estimated spectrogram for the bass. By multiplying the logical inverse of the mask with the harmonics spectrogram, we obtain the estimated spectrogram for the vocals. Taking the inverse short-time Fourier transform of the estimated spectrograms allows us to recover the vocals and bass stems.

We could have chosen a static cutoff frequency that remains the same regardless of the song under consideration. However, we anticipated that the bass and vocals stems may occupy different ranges of frequencies across different songs. Thus, we decided to dynamically adjust the cutoff frequency depending on the song under consideration. Specifically, we compute the energy distribution of each time slice of the harmonics spectrogram. We then sum all of these energy distributions into a single energy distribution. We choose a cutoff frequency such that the cumulative energy from 0 Hz to our cutoff frequency is 30% of the total energy in the distribution. The value of 30% was chosen through trial and error. In this fashion, we use the energy in the harmonics spectrogram to dynamically adjust the cutoff frequency for our low-pass filter depending on the song under consideration.

Figure 7: Energy distribution of the harmonics spectrogram for the song "Music Delta - Hendrix". Most of the energy is concentrated at low frequencies, which we believe corresponds to the bass stem. Therefore, we use the energy distribution to determine a cutoff frequency for a low-pass filter that we hope allows us to separate a vocals-bass mixture into the vocals stem and bass stem. The cutoff frequency, represented by the vertical black line, is chosen such that the cumulative energy from 0 Hz to our cutoff frequency is 30% of the total energy in the distribution.