Challenges

We encountered several challenges while developing our source separation algorithms. Some of the most notable or interesting challenges are documented on this page. Other challenges including non-negative matrix factorization, Wiener filtering, and phase estimation of the recovered stems were deemed to be outside of the scope of this project and are not documented here.

How to Train Your Neural Network

As it turns out, there are several practical considerations for training a neural network. Each of our neural networks receives as input a chunk of the spectrogram for a given song. The neural network then returns as output a soft mask for the chunk to separate the harmonics from the percussion. One of the parameters that we can control in the training of our neural networks is the resolution of the spectrogram. In other words, we can control the number of time-frequency bins in the spectrogram. Ideally, we would like to use a spectrogram with as high a resolution as possible (i.e., as many windows in time as possible and as many frequency bins as possible). A higher resolution would theoretically allow us to recover higher-quality stems using binary or soft masking because we would be able to partition the energy in the spectrogram with more fine-grained control. However, training our neural networks on higher-resolution spectrograms would be more time and compute intensive. Increasing the resolution of the spectrogram would increase the size of our training set and increase the number of learnable parameters in our neural networks. As a result, training our neural networks would take longer and require more memory. Thus, there is a tradeoff between the quality of the recovered stems and the compute resources required to train and test our neural networks. Ultimately, we settled on a resolution that allowed us to reasonably train and test our neural networks on the Great Lakes computing cluster while still yielding recovered stems of an acceptable quality. If we had more time, we would like to further explore the ramifications of changing the resolution of the spectrogram and systematically determine a point beyond which increasing the resolution yields diminishing returns on the quality of the recovered stems.

All of our neural networks leverage the fact that the mixture of the desired stems for a given song is a real-valued audio signal. As a result, the spectrogram for a given song exhibits symmetry along the frequency axis. Specifically, the magnitude of the spectrogram exhibits even symmetry, and the phase of the spectrogram exhibits odd symmetry. Thus, half of the spectrogram can be discarded without any loss of information. We can recover the discarded half of the spectrogram at any time by taking the complex conjugate of the remaining half of the spectrogram. With this observation in mind, we choose to train and validate our neural networks on chunks of only half of the spectrogram for a given song, which reduces the size of our training set and reduces the number of learnable parameters in our neural networks.

However, discarding half of the spectrogram must be done with care. Our neural networks operate on spectrograms that have 1024 frequency bins. It is incorrect to discard frequency bins 513 - 1024. Equivalently, it is also incorrect to discard frequency bins 1 - 512. In the spectrograms that are fed into our neural networks, frequency bin 512 is actually the DC component corresponding to a frequency of 0 Hz. Consequently, frequency bin 512 is the point of reflection about which the magnitude of the spectrogram exhibits even symmetry and the phase of the spectrogram exhibits odd symmetry. Frequency bin 1024 is the other such point of reflection. It is actually frequency bins 1 - 511 and 513 - 1023 that are symmetric; if one of these batches of frequency bins is known, then the other batch can be recovered by employing the use of the complex conjugate. As a result, either bins 1 - 511 or bins 513 - 1023 can be discarded. In general, we opt to discard bins 1 - 511.

For the fully connected neural network and the mask inference architecture, discarding bins 1 - 511 and using bins 512 - 1024 occurs without issue. However, the structure of the U-Net architecture requires that the number of frequency bins used must be a power of two. In other words, the U-Net architecture must be trained on exactly 512 of the remaining 513 frequency bins. In our case, we decided to train and test the U-Net architecture on frequency bins 512 - 1023. By symmetry, we are able to recover frequency bins 1 - 511, which means that the only frequency bin that we cannot recover is frequency bin 1024. Fortunately, frequency bin 1024 is real-valued and typically has a magnitude on the order of 10-4, so we can reasonably approximate the value of this frequency bin to be zero. Alternatively, we could have trained the U-Net architecture on frequency bins 513 - 1024 and recovered frequency bins 1 - 511 by symmetry, which means that the only frequency bin that we would not have been able to recover would have been frequency bin 512. However, frequency bin 512 is the DC component and has a significant magnitude that cannot be approximated to be zero. For this reason, we trained and tested the U-Net architecture on frequency bins 512 - 1023.

The final practical consideration for training our neural networks was our access to the Great Lakes computing cluster. Due to the time and memory requirements of training and testing our neural networks, we were unable to train and test our neural networks locally (i.e., on our own laptops and computers). Instead, we trained and tested our neural networks on the Great Lakes computing cluster. To run Matlab on the Great Lakes computing cluster, it is necessary to submit a job. A job requests a certain number of hours, number of cores, and amount of memory to be allocated to the Matlab instance. Once a job is submitted, it sits in a queue and waits to be granted. Only once a job is granted are we able to launch Matlab on the Great Lakes computing cluster. The amount of time that a job spends sitting in a queue before it is granted depends on the compute resources requested; typically, requesting more compute resources tends to result in longer wait times.

In this fashion, our ability to train and test our neural networks was constrained. If training or testing our neural network required too much time or memory, then our job risked never getting granted. This constraint became most apparent when training our mask inference architecture on all thirteen songs by “Music Delta”. Thankfully, we were able to run Matlab on the Great Lakes computing cluster once for 24 hours with 500 GB of memory, which was just enough time and memory to train our mask inference architecture as desired. Unfortunately, we were unable to reserve enough compute resources on the Great Lakes computing cluster to train our mask inference architecture again with different hyperparameters. In fact, in the last weeks of the semester, we had trouble getting any jobs granted at all. We theorize that the Great Lakes computing cluster was experiencing heavier than normal use at the end of the semester, resulting in greater competition for limited compute resources. As a result, we were unable to explore the effects of various hyperparameters (e.g., number of hidden values, sequence length, number of layers, etc.) on the performance of our mask inference architecture for HPSS. If we had more time, we would have liked to explore the variation of hyperparameters further.

Objective Measures for Evaluating Performance

We had the objective measures used in our project (SDR, SAR, SIR), but we observed that those measures did not always directly correlate with good recovered audio. For example, a recovered stem may achieve a high SDR but have unacceptable levels of interference from another stem or unacceptable distortion due to the choice of source separation algorithm. As things currently stand, we used a combination of the quantitative metrics and qualitative analysis of the audio to determine the accuracy of our stems. In other words, we made sure to listen to the recovered stems to assess the effectiveness of our source separation algorithms rather than just blindly accepting the SDR, SAR, and SIR values. To further remedy this problem, we could have possibly used another metric to analyze our stems. The authors of the paper that proposed SDR, SAR, and SIR also proposed a separate Matlab toolbox referred to as PEASS that similarly computes objective, quantitative measures for the evaluation of source separation algorithms. However, the quantitative measures in PEASS are purportedly more correlated with human perception than SDR, SAR, and SIR. Over the course of this project, our decision to stick with SDR, SAR, and SIR was motivated by the fact that most papers reported these numbers, and these values were easy to compute. With more time, we would have liked to explore PEASS and see how these different sets of objective measures each compared to our subjective perception.