EECS351 STEM SEPARATION
Vocals-Bass-Drums Source Separation Results
​To evaluate our capabilities to perform vocals-bass-drums source separation (VBDSS), we used the recovered harmonics and percussion stems from our “Mask Inference 2” approach to HPSS, which we considered to be conclusively our best approach to HPSS. We then applied all five of our approaches to VBSS to the recovered harmonics stem to recover the vocals and bass stems. We once again also considered the “Ideal Soft Mask” approach, which uses soft masks that are unattainable in practice but that create an upper bound on the performance that we might expect from our other approaches. The SDR, SIR, and SAR values achieved by each of these approaches are summarized in the table below. Notable results are given further discussion following the table.
Table 1: Comparison of our different approaches to VBDSS for six different songs: "Music Delta - Beatles", "Music Delta - Hendrix", "Music Delta - Rock", "BKS - Bulldozer", "Carlos Gonzalez - A Place For Us", and "Motor Tapes - Shore". The approach "Vocals Harmonics 2" is not as dominant as it was for VBSS, but "Energy Thresholding" remains a strong contender for our best approach. We still believe that a combination of “Vocals Harmonics 2” with a dynamic cutoff frequency similar to “Energy Thresholding” has the best chance at achieving SDR, SIR, and SAR values that are comparable to those obtained by the ideal soft mask. Once again, recovering the bass stem for the song "Carlos Gonzalez - A Place For Us" proves to be uniquely challenging as evidenced by the low SDR values achieved, even by the approach "Ideal Soft Mask".
In general, the trends in performance across our approaches to VBDSS agree with the trends in performance across our approaches to VBSS. For an in-depth evaluation of our approaches to VBSS, we refer the reader to our previous discussion. In this section, we only make note of when these trends in performance differ.
In “Smoothing Filters”, we compare the magnitude in the harmonics-enhanced copy of the spectrogram versus the magnitude in the bass-diminished copy of the spectrogram for each time-frequency bin. A threshold is used to determine whether the bin should be added to the binary mask or not. When applying “Smoothing Filters” to VBSS, the value of this threshold is set to 90%. When applying “Smoothing Filters” to VBDSS, we discovered that the optimal value of the threshold is actually 80%. We believe that the value of the threshold changes due to interference from the drums stem in the recovered harmonics stem. In general, we expect noise to potentially shift the values of optimal parameters for any approach; however, “Smoothing Filters” is the only approach for which we directly observed this shift.
“Vocals Harmonics 2” still achieves some of the highest SDR values for the vocals stem but no longer achieves consistently high SDR values for the bass stem. On several occasions, “Vocals Harmonics 1” outperforms “Vocals Harmonics 2” in terms of SDR values for the bass stem. Even “2D FFT Peak Picking” outperforms “Vocals Harmonics 2” in terms of SDR values for the bass stem for the song “BKS - Bulldozer”. We theorize that “Vocals Harmonics 2” struggles to recover the bass stem in the presence of interference from the drums stem because the drums stem frequently has concentrated energy at low frequencies. This concentrated energy is captured with the recovered bass stem in “Vocals Harmonics 2” due to the use of a cutoff frequency, which we believe adversely affects the associated SDR, SIR, and SAR values.
Lastly, we observe that “Energy Thresholding”, although not as dominant as before, remains the strongest contender for our best approach. We believe that “Energy Thresholding” is resilient to interference from the drums stem because interference from the drums stem adds energy to all frequency bins more or less equally; therefore, the position of the dynamic cutoff frequency is minimally affected. Even when “Energy Thresholding” appears to underperform, such as in terms of the SDR value for the bass stem for the song “Motor Tapes - Shore”, we once again find that adjusting the value of the energy threshold can improve performance. For example, reducing the value of the energy threshold to 20% for the song “Motor Tapes - Shore” results in SDR values of 13.6887 and 4.9949 for the vocals stem and bass stem respectively. Thus, we find further evidence that exploring dynamic adjustment of the value of the energy threshold merits further consideration.
Figure 1: Spectrogram and audio samples for Music Delta - Beatles​
Figure 2: Spectrogram and audio samples for Music Delta - Hendrix
Figure 3: Spectrogram and audio samples for Music Delta - Rock
Figure 4: Spectrogram and audio samples for BKS - Bulldozer
Figure 5: Spectrogram and audio samples for Carlos Gonzalez - A Place For Us
Figure 6: Spectrogram and audio samples for Motor Tapes - Shore