Harmonic-Percussive Source Separation Results

In the evaluation of our approaches to HPSS, the first step was to decide on a single neural network architecture with which to proceed. We wished to determine whether the fully connected neural network, U-Net architecture, or mask inference architecture showed the greatest promise. To this end, we developed a toy example. We created a spectrogram of the mixture of the vocals, bass, and drums stems for the song “Music Delta - Beatles”. We trained and validated all three architectures on overlapping chunks of this spectrogram. We then tested all three architectures on the song “Music Delta - Beatles”. Each architecture recovered a harmonics stem and a percussion stem. We compared these recovered stems to the ground truth stems and computed the relevant SDR, SIR, and SAR values. We then evaluated the performance of each neural network relative to one another and relative to the ideal soft mask. Because the training set and test set were the same in this toy example, we expected all three architectures to perform very well.

Table 1: Comparison of different neural network architectures on the song "Music Delta - Beatles". All architectures were trained and validated on overlapping chunks of the spectrogram for the song "Music Delta - Beatles". All architectures were then tested on the song "Music Delta - Beatles" and tasked with recovering a harmonics stem and a percussion stem. Because the training set and testing set were the same in this toy example, all three architectures performed very well. The U-Net architecture and the mask inference architecture achieved near-ideal SDR, SIR, and SAR values. The mask inference architecture achieved the best SDR values for both recovered stems.

The mask inference architecture achieved the highest SDR values for the harmonics and percussion stems. The U-Net architecture achieved comparable SDR values while the fully connected network achieved the lowest SDR values by a significant margin. Both the U-Net architecture and the mask inference architecture achieved near-ideal performance. Ultimately, we decided to proceed with the mask inference architecture because the mask inference architecture had slightly edged out the U-Net architecture in terms of performance even though the U-Net architecture had been trained on six times more data. If we had more time, we would have given the U-Net architecture further consideration. However, we decided to proceed with only the mask inference architecture due to limited time and compute resources.

Having decided on a single neural network architecture with which to proceed, the second step was to compare our four approaches to HPSS. Our first approach, dubbed “Smoothing Filters”, uses smoothing filters to create a binary mask. Our second approach, dubbed “Mask Inference 1”, uses the mask inference architecture, which has only been trained on overlapping chunks of the spectrogram of the song “Music Delta - Beatles”. In other words, our second approach uses the mask inference architecture from the toy example described above. Our third approach, dubbed “Mask Inference 2”, uses the mask inference architecture, which has been trained on non-overlapping chunks of the spectrograms of all thirteen songs by “Music Delta”. Our fourth approach, dubbed “Ideal Soft Mask”, uses ideal soft masks, which are unattainable in practice but create an upper bound on the performance that we might expect from our other approaches. The SDR, SIR, and SAR values achieved by each of these approaches are summarized in the table below. Notable results are given further discussion following the table.

Table 2: Comparison of our different approaches to HPSS for six different songs: "Music Delta - Beatles", "Music Delta - Hendrix", "Music Delta - Rock", "BKS - Bulldozer", "Carlos Gonzalez - A Place For Us", and "Motor Tapes - Shore". The approach "Mask Inference 2" appears to achieve the best SDR values overall. On a few occasions, the approach "Smoothing Filters" appears to achieve comparable performance relative to the approach "Mask Inference 2". However, we are confident that training the mask inference architecture on more data would allow it to consistently outperform "Smoothing Filters" across an even larger number of songs.

Our smoothing filters approach to HPSS is a fairly unsophisticated approach. Many resources online suggest a similar algorithm for HPSS that involves applying moving median filters or moving average filters along the time and frequency axes of a spectrogram to enhance the harmonics and percussion respectively, allowing the creation of binary or soft masks. Due to its lack of sophistication, the smoothing filters approach serves as an excellent baseline against which we can compare our mask inference approaches. Given the time and compute resources spent training the mask inference architecture, which is a neural network with millions of learnable parameters, we should sincerely hope that the mask inference approaches outperform the smoothing filters approach.

However, we observe that the mask inference approaches are not quite as dominant as we might have hoped. “Mask Inference 1” is the best-performing approach to HPSS for the song “Music Delta - Beatles”. This result is not surprising. In “Mask Inference 1”, the mask inference architecture was trained exclusively and extensively on the song “Music Delta - Beatles”. When we test the mask inference architecture on the same data used to train it, we expect near-ideal output. Our observations match our expectations as “Mask Inference 1” achieves near-ideal SDR, SIR, and SAR values. However, “Mask Inference 1” performs poorly on other songs. It performs significantly worse than “Mask Inference 2” across the board, and it frequently underperforms “Smoothing Filters”. We conclude that “Mask Inference 1” struggles because the mask inference architecture has become overfitted to the training data. In other words, the mask inference architecture performs extremely well on its training data, but it struggles to generalize to other songs outside of its training set.

“Mask Inference 2” performs worse than “Mask Inference 1” on the song “Music Delta - Beatles”, but we believe that this result is actually promising. The fact that “Mask Inference 2” performs worse than “Mask Inference 1” on the song “Music Delta - Beatles”, which is in the training set for both approaches, suggests that the mask inference architecture in “Mask Inference 2” has not become overfitted to the training data. “Mask Inference 2” is the best-performing approach to HPSS for the songs “Music Delta - Hendrix” and “Music Delta - Rock”. Again, this result is not surprising because both of these songs are in the training set for the mask inference architecture in “Mask Inference 2”. Neither of these songs were in the training set for the mask inference architecture in “Mask Inference 1”.

Importantly, “Mask Inference 2” is the best-performing approach to HPSS or is comparable to the best-performing approach to HPSS for the songs “BKS - Bulldozer”, “Carlos Gonzalez - A Place For Us”, and “Motor Tapes - Shore”. For these three songs, which are outside of the training set for both mask inference approaches, “Mask Inference 2” significantly outperforms “Mask Inference 1”. This result is a testament to the viability of machine learning. Just by increasing the size and variety of the training set, the performance of the mask inference architecture when tested on songs outside of the training set improved. Thus, training is effective and produces tangible benefits for performance. Although “Mask Inference 2” is occasionally outperformed by “Smoothing Filters”, most notably in the SDR value for the percussion stem for the song “Motor Tapes - Shore”, our results suggest that “Mask Inference 2” could surpass “Smoothing Filters” in performance with additional training. “Mask Inference 2” has considerable potential just by expanding the training set. On the other hand, we do not expect “Smoothing Filters” to improve dramatically just by tuning parameters unless the algorithm undergoes some major modification(s).

Figure 1: The ground truth harmonics and percussion stems for the song "Motor Tapes - Shore" compared to the recovered harmonics and percussion stems using the approach "Mask Inference 2". The recovered harmonics stem retains faint traces of the percussion stem but is otherwise very high-quality. "Mask Inference 2" achieves an SDR value of 11.8886 for the recovered harmonics stem, which is not too far from the ideal value of 15.4958. On the other hand, the recovered percussion stem retains significant traces of the harmonics stem as evidenced by the strong horizontal lines in the spectrogram. This contamination lowers the SIR value to 0.0058 for the recovered percussion stem, which in turn drags down the SDR value. We theorize that "Mask Inference 2" struggles with the percussion stem for the song "Motor Tapes - Shore" because the percussion is quite broad as opposed to forming distinct vertical lines or delta functions in the spectrogram.

All approaches are still a ways away from the SDR, SIR, and SAR values achieved by the ideal soft mask. We believe that a mask inference approach or similar machine learning approach has the best chance at achieving SDR, SIR, and SAR values that are comparable to those obtained by the ideal soft mask. That being said, we are particularly interested in the fact that the ideal soft mask produces unusually low SDR values for the percussion stem for “Music Delta - Rock”, “Carlos Gonzalez - A Place For Us”, and “Motor Tapes - Shore”. Upon further investigation, we conclude that the SDR values are unusually low because the percussion appears weakly smeared along the time axis in the spectrograms for these songs. In other words, the percussion does not form neat delta functions in the spectrogram but instead broad peaks. Because the drums are smeared in the spectrogram, we believe that the overlap in time-frequency bins occupied by both the harmonics and the percussion increases, which makes separating the two stems difficult. In general, we expect that the harmonics stem strongly occupies many of these bins while the percussion stem weakly occupies many of these bins, hence why the SDR values for only the percussion stem seem to suffer.