Masking

Our stem separation methods center around the approach of masking. Masking is based around generating a matrix (called a mask) that has a different weight for each time-frequency bin for the short-time Fourier transform of a mixture of stems. This mask is then multiplied by the spectrogram of the mixture to extract a desired stem. In our approach, we attempt to generate masks that match the distinguishing characteristics of the stems so that we can extract each audio signal. An example of this process is shown below for a vocals soft mask to separate the vocals from the vocals-bass mixture for the song "Music Delta - Beatles".

Figure 1: Vocals-Bass Mixture, Vocals Soft Mask, and Output Spectrograms

Characteristics that we observed and tried to leverage in our masking approaches include: vocals are horizontal in the spectrogram, waver over time, are strongly harmonic, occupy higher frequencies, and are largely aperiodic in time; bass is horizontal in the spectrogram, flat over time, weakly harmonic, occupies lower frequencies, and is largely periodic in time; drums are vertical and pronounced. For example, the figure below shows a time slice of the spectrogram of the vocals-bass mixture with our vocals mask. In this example, we leverage the strongly harmonic nature of the vocals to generate a soft mask, in which the mask peaks where we see harmonics of the mixture that we anticipate correspond to the vocals stem.

Figure 2: Vocals Harmonics Mask Time slice and Vocals Base Spectrogram Time Slice

While we could theoretically generate three different masks and extract all of the stems at once, we noticed that the bass and vocals are similar while the drums are very different, so we choose to use an iterative masking approach. The first mask we generate separates the drums from the vocals and bass in a process referred to as harmonic-percussive source separation (HPSS), and we then generate a second mask that separates the bass from the vocals in a process referred to as vocals-bass source separation (VBSS). We’ve generally found this approach to be quite successful, and we use it as the basis for our algorithms discussed in this report. All of our work and results center around different methods to generate the HPSS and VBSS masks.