EECS351 STEM SEPARATION
Vocals-Bass Source Separation Results
In the evaluation of our approaches to VBSS, we compared six approaches to VBSS. Our first approach, dubbed “Smoothing Filters”, uses smoothing filters to create a binary mask. Our second approach, dubbed “Vocals Harmonics 1”, uses Gaussian envelopes to create a soft mask based on the harmonics of the vocals stem. Our third approach, dubbed “Vocals Harmonics 2”, combines Gaussian envelopes from “Vocals Harmonics 1” with a static cutoff frequency. Our fourth approach, dubbed “2D FFT Peak Picking”, uses the 2D FFT to create a soft mask based on the temporal periodicity of the bass stem. Our fifth approach, dubbed “Energy Thresholding”, applies a lowpass filter whose dynamic cutoff frequency is determined by the energy distribution of the spectrogram. Our sixth approach, dubbed “Ideal Soft Mask”, uses ideal soft masks, which are unattainable in practice but create an upper bound on the performance that we might expect from our other approaches. The SDR, SIR, and SAR values achieved by each of these approaches are summarized in the table below. Notable results are given further discussion following the table.
Table 1: Comparison of our different approaches to VBSS for six different songs: "Music Delta - Beatles", "Music Delta - Hendrix", "Music Delta - Rock", "BKS - Bulldozer", "Carlos Gonzalez - A Place For Us", and "Motor Tapes - Shore". The approaches "Vocals Harmonics 2" and "Energy Thresholding" appear to achieve the best SDR values overall. We believe that a combination of “Vocals Harmonics 2” with a dynamic cutoff frequency similar to “Energy Thresholding” has the best chance at achieving SDR, SIR, and SAR values that are comparable to those obtained by the ideal soft mask. Recovering the bass stem for the song "Carlos Gonzalez - A Place For Us" proves to be uniquely challenging as evidenced by the low SDR values achieved, even by the approach "Ideal Soft Mask".
​Our smoothing filters approach to VBSS is generally our worst-performing approach. Due to its lack of sophistication, the smoothing filters approach has the benefit of executing quickly but offers little else. For this reason, the smoothing filters approach serves as an excellent baseline against which we can compare our other approaches. We believe that the smoothing filters approach performs poorly because it is based on a shaky assumption. Namely, this approach is based on the assumption that the vocals tend to exhibit vibrato (i.e., the vocals tend to waver over time) while the bass does not. However, the vocals do not exhibit vibrato all the time and may actually appear quite flat in the spectrogram, such as when a singer sustains a note for an extended period of time without wavering. The shortcomings of the smoothing filters approach are most apparent in the song “Motor Tapes - Shore”, where the vocals exhibit very little vibrato. As a result, many harmonics corresponding to the vocals stem appear in the recovered bass stem, which leads to poor SDR, SIR, and SAR values. Across all songs, the smoothing filters approach tends to produce the worst SIR values for the vocals stem and the worst SAR values for the bass stem relative to our other approaches.
Figure 1: The ground truth vocals and bass stems for the song "Motor Tapes - Shore" compared to the recovered vocals and bass stems using the approach "Smoothing Filters". Note that the vocals harmonics appear especially flat in the spectrogram for the song "Motor Tapes - Shore" because the singer sustains several notes for extended periods of time. As a result, some of the vocals harmonics end up in the recovered bass stem instead of the recovered vocals stem. This contamination causes the SDR values for the recovered vocals stem and bass stem to suffer.
Our vocals harmonics approaches to VBSS generally perform well. We believe that these approaches perform well because they are based on a well-founded assumption. Namely, these approaches are based on the assumption that the vocals are strongly harmonic while the bass is weakly harmonic, which is almost always true. Our “Vocals Harmonics 2” approach performs better across the board relative to our “Vocals Harmonics 1” approach with few exceptions. We attribute the success of “Vocals Harmonics 2” to how this approach handles inharmonicity in the vocals. Due to the inharmonicity of the vocals, the soft mask in the “Vocals Harmonics 1” approach becomes misaligned with the actual harmonics corresponding to the vocals stem at higher frequencies. As a result, the “Vocals Harmonics 1” approach struggles to include these high-frequency harmonics in the recovered vocals stem. On the other hand, “Vocals Harmonics 2” includes all time-frequency bins above a certain cutoff frequency in the recovered vocals stem. This approach guarantees that high-frequency harmonics corresponding to the vocals stem are included in the recovered vocals stem. The recovered bass stem is largely unaffected because most of the bass stem lies below the cutoff frequency. The inclusion of these high-frequency harmonics corresponding to the vocals stem makes a difference; the performance gap between “Vocals Harmonics 2” and “Vocals Harmonics 1” is most pronounced in the SDR values for the vocals stem.
The biggest performance gap between “Vocals Harmonics 2” and “Vocals Harmonics 1” occurs for the song “Carlos Gonzalez - A Place For Us”. The vocals for this song are challenging to recover because two voices are singing in unison, which creates two sets of harmonics corresponding to the vocals stem. “Vocals Harmonics 1” is only designed to detect and extract a single set of harmonics, so this approach struggles and produces a low SDR value for the vocals stem. On the other hand, “Vocals Harmonics 2” includes all time-frequency bins above a certain static cutoff frequency in the recovered vocals stem. It just so happens that this cutoff frequency is located very fortunately. The second set of harmonics only becomes especially prevalent above this cutoff frequency, and most of the bass stem is localized below the cutoff frequency. In large part due to the favorable location of the cutoff frequency, “Vocals Harmonics 2” actually ends up being the best approach for the song “Carlos Gonzalez - A Place For Us” by a significant margin. Overall, we consider “Vocals Harmonics 2” to be a contender for our best approach alongside “Energy Thresholding”.
Figure 2: The ground truth vocals and bass stems for the song "Carlos Gonzalez - A Place For Us" compared to the recovered vocals and bass stems using the approaches "Vocals Harmonics 1" (top) and "Vocals Harmonics 2" (bottom). Note that "Vocals Harmonics 1" struggles to recover the vocals stem due to the two voices singing in unison. On the other hand, "Vocals Harmonics 2" uses a cutoff frequency that neatly separates the two voices singing in unison from the bass. For this reason, "Vocals Harmonics 2" achieves significantly higher SDR values for the recovered vocals and bass stems relative to "Vocals Harmonics 1".
​“2D FFT Peak Picking” performs reasonably well but is overshadowed by our other approaches. “2D FFT Peak Picking” rivals “Vocals Harmonics 1” in performance, typically achieving slightly higher SDR values for the vocals stem but producing slightly lower SDR values for the bass stem. However, “Vocals Harmonics 2” achieves higher SDR values for both the vocals stem and bass stem across all songs relative to “2D FFT Peak Picking”. The best results for “2D FFT Peak Picking” come from the song “Music Delta - Hendrix”, which contains a short and simple bass riff that repeats multiple times while the vocals do not repeat. In other words, “2D FFT Peak Picking” works best when the bass is largely periodic while the vocals are largely aperiodic. Given that the “2D FFT Peak Picking” approach is designed to exploit the strong periodicity of the bass stem relative to the weak periodicity of the vocals stem, we do not find these results surprising.
Figure 3: The ground truth vocals and bass stems for the song "Music Delta - Hendrix" compared to the recovered vocals and bass stems using the approach "2D FFT Peak Picking". "2D FFT Peak Picking" does a decent job of separating the vocals stem from the bass stem and minimizing contamination but also leaves behind horizontal streaks in the recovered spectrograms that introduce distortion. For this reason, "2D FFT Peak Picking" achieves higher SIR values than SAR values for the recovered vocals and bass stems. This distortion is especially audible in the audio sample for the recovered bass stem provided below.
​The most notable feature of “2D FFT Peak Picking” is noise in the recovered bass stem. Across all songs, the recovered bass stem seems to include a high-pitched, distorted version of the vocals stem. We were surprised that “2D FFT Peak Picking” achieved such high SDR values given the consistent presence of this distinct noise. We believe this noise comes from the masking of vertical ridges in the 2D FFT of the spectrogram. Masking vertical ridges in the 2D FFT appears to translate into horizontal streaks in the masks that are applied to the spectrogram itself. These horizontal streaks at high frequencies appear to contribute to the distinct noise observed in the recovered bass stem. As a result, we believe that the qualitative performance of this approach is rather poor when taking human perception into account.
We were surprised by how well “Energy Thresholding” performed. This approach alongside “Vocals Harmonics 2” are the contenders for our best approach. Although we found the performance of “Energy Thresholding” to be shocking, perhaps we shouldn’t have been so surprised. After all, this approach is based on quite possibly the most well-founded assumption, which is that the bass tends to occupy a lower range of frequencies than the vocals. In some cases, “Energy Thresholding” achieves SDR values that are roughly ten decibels greater than the SDR values that “Vocals Harmonics 2” achieves, representing improvement by a whole order of magnitude. Although we experimentally fixed the value of the energy threshold at 30% for the purposes of this project, we observed that this value is not optimal for all songs. For example, reducing the value of the energy threshold to 20% for the song “Motor Tapes - Shore” results in SDR values of 16.9372 and 11.5977 for the vocals stem and bass stem respectively. Thus, we believe that an approach that dynamically tunes the value of the energy threshold according to the song under consideration merits further exploration. Unfortunately, we did not have time to explore this possibility for this project.
Figure 4: The ground truth vocals and bass stems for the song "Music Delta - Hendrix" compared to the recovered vocals and bass stems using the approach "Energy Thresholding". Despite its simplicity, "Energy Thresholding" performs remarkably well. For the song "Music Delta - Hendrix", "Energy Thresholding" achieves SDR values for the recovered vocals and bass stems that are ~10 decibels better than the next best approach, representing improvement by a whole order of magnitude. If we had more time, future work would entail exploring this approach further and potentially dynamically setting the value of the energy threshold, which is currently fixed at 30%.
All approaches are still a ways away from the SDR, SIR, and SAR values achieved by the ideal soft mask. We believe that a combination of “Vocals Harmonics 2” with a dynamic cutoff frequency similar to “Energy Thresholding” has the best chance at achieving SDR, SIR, and SAR values that are comparable to those obtained by the ideal soft mask.
We are interested in the fact that recovering the bass stem in the song “Carlos Gonzalez - A Place For Us” proves to be especially challenging for all of our approaches. We believe that this song poses a unique set of challenges that make VBSS difficult. Namely, these challenges consist of multiple singers, a bass stem with strong harmonics that approach 10 kHz, a bass stem with weak temporal periodicity, and a bass stem that consists of many short notes. These properties of the vocals stem and bass stem defy the assumptions that underpin our approaches, hence the difficulty. Similarly, many of our approaches struggle to recover the vocals stem in the song “BKS - Bulldozer”. We believe that this difficulty arises from the fact that the vocals stem is smeared along the frequency axis rather than being concentrated in a set of harmonics that can be easily extracted. In general, our approaches tend to achieve higher SDR values for the vocals stem rather than the bass stem. We believe that this behavior comes from the fact that the vocals stem has been easier to characterize with assumptions compared to the bass stem.