Harmonic-Percussive Source Separation Methods

Our approach to stem separation focuses on time-frequency masking. In time-frequency masking, we consider the spectrogram of the mixture of the desired stems. This spectrogram, which is obtained by taking the short-time Fourier transform, consists of many time-frequency bins. By applying a set of weights to these bins, we can extract a portion of the mixture. Ideally, an appropriately chosen set of weights should allow us to recover a particular stem. This set of weights is referred to as a mask. In a binary mask, the weight applied to each bin is either zero or one. In a soft mask, the weight applied to each bin is some real number. Our approach to stem separation uses iterative masking. In other words, multiple masks are applied in succession to recover the desired stems. Our first mask separates the mixture into harmonics and percussion. In this case, harmonics refers to the vocals and bass while percussion refers to the drums. Our second mask further separates the harmonics into vocals and bass.

HPSS - Smoothing Filters

Our first approach to harmonic-percussive source separation (HPSS) uses smoothing filters to create a binary mask based on [3]. We first apply a moving median filter along the time axis of the baseline spectrogram. In doing so, we create a harmonics-enhanced copy of the spectrogram that appears smeared across the time axis. This copy is “harmonics-enhanced” because harmonics (i.e., vocals and bass) occupy a narrow range of frequencies for a long time. By smearing the spectrogram across the time axis, these horizontally-oriented harmonics are made more pronounced. We then apply a moving median filter along the frequency axis of the baseline spectrogram. In doing so, we create a percussion-enhanced copy of the spectrogram that appears smeared across the frequency axis. This copy is “percussion-enhanced” because percussion (i.e., drums) occupy a wide range of frequencies for a short time. By smearing the spectrogram across the frequency-axis, this vertically-oriented percussion is made more pronounced.

Figure 1: Spectrogram of the song "Music Delta - Beatles" containing the vocals, bass, and drums stems (top) and the harmonics-enhanced version of the same spectrogram (bottom). Note that the harmonics-enhanced version of the spectrogram appears smeared along the time axis, which has the effect of making the harmonics (i.e., the vocals and bass) more pronounced while diminishing the percussion (i.e., the drums). The harmonics-enhanced version of the spectrogram is compared against a percussion-enhanced version of the spectrogram to create a binary mask.

For each time-frequency bin, we then compare the magnitude in the harmonics-enhanced copy of the spectrogram versus the magnitude in the percussion-enhanced copy of the spectrogram. If the magnitude is greater in the harmonics-enhanced copy of the spectrogram, then the bin is added to the binary mask (i.e., the weight of the bin is set to one). Otherwise, the weight of the bin is set to zero. By multiplying the mask with the baseline spectrogram, we obtain the estimated spectrogram for the harmonics. By multiplying the logical inverse of the mask with the baseline spectrogram, we obtain the estimated spectrogram for the percussion. Taking the inverse short-time Fourier transform of the estimated spectrograms allows us to recover the harmonics and percussion stems.

Machine Learning

Our subsequent approaches to HPSS use machine learning to create a soft mask. Machine learning is a tried and tested approach to stem separation. In our machine learning approaches, a neural network is trained to estimate the ideal soft mask. Specifically, the neural network receives as input a chunk of the spectrogram of the mixture of the desired stems. The neural network returns as output a soft mask for the chunk to separate the harmonics from the percussion. The neural network is trained and validated on sets of spectrogram chunks and the known ideal soft masks for those chunks. In training, the neural network optimizes its learnable parameters (i.e., weights) to minimize the half-mean-squared-error loss between the output soft mask and the ideal soft mask.

We observe that the mixture of the desired stems is a real-valued audio signal. As a result, the spectrogram of the mixture of the desired stems exhibits symmetry along the frequency axis. Specifically, the magnitude of this spectrogram exhibits even symmetry, and the phase of this spectrogram exhibits odd symmetry. As a result, half of the spectrogram can be discarded without any loss of information. We can recover the discarded half of the spectrogram at any time by taking the complex conjugate of the remaining half of the spectrogram. With this observation in mind, we choose to train and validate our neural networks on chunks of only half of the spectrogram of the mixture of the desired stems. Only using half of the spectrogram instead of the whole spectrogram reduces the number of learnable parameters in our neural network and reduces the size of our training and validation sets. Therefore, only using half of the spectrogram reduces the compute resources needed to train our neural network at no cost.

Lastly, instead of using the spectrogram directly as input to the neural network, we actually use the log-magnitude representation of the spectrogram. Using the log-magnitude representation of the spectrogram results in a smoother and broader distribution of input values, which is preferable for a neural network. Additionally, the log-magnitude representation of the spectrogram more closely resembles the human perception of sound, which is logarithmic rather than linear.

HPSS - Fully Connected Neural Network

Our first neural network architecture is a fully connected neural network based on [4]. In a fully connected neural network, each node in a given layer is connected to every node in the next layer. In [4], this architecture is used for speech separation and produces good results. However, we weren’t sure that this architecture would generalize to the more complex problem of stem separation. The speech signals in [4] are sampled at only 4 kHz while the songs in the musdb18 dataset are sampled at 44.1 kHz. Due to the lower sampling frequency, [4] is able to train its neural network on spectrograms with only 128 frequency bins. In this case, each frequency bin corresponds to a frequency range of ~30 Hz. If we trained our neural network on spectrograms with only 128 frequency bins, each frequency bin would correspond to a frequency range of ~350 Hz, which is unacceptable. Thus, we decided to train our neural network on spectrograms with 1024 frequency bins. In our case, each frequency bin corresponds to a frequency range of ~40 Hz, which is much more acceptable. However, our neural network is much larger as a result. We also added an additional fully connected layer relative to [4]. This architecture primarily served as a feasibility study to see if machine learning could reasonably be applied to stem separation.

Figure 2: Example of a fully connected layer within a neural network. Note that a fully connected layer can transform an input layer of any size into an output layer of any size. Therefore, a fully connected layer can be useful for reshaping data. A fully connected layer also introduces a large number of learnable parameters. [7]

Figure 3: Detailed analysis of the fully connected neural network. Fully connected layers are combined with batch normalization layers and dropout layers to create the neural network. A sigmoid activation layer is used to create the soft mask. Due to the use of fully connected layers, the number of learnable parameters is a whopping 421.1 million. This number of learnable parameters is a couple orders of magnitude greater than the number of learnable parameters for our other neural network architectures.

HPSS - U-Net Convolutional Neural Network

Our second neural network architecture is a convolutional neural network known as a U-Net. The U-Net architecture was originally proposed in [5] for use in biomedical imaging to classify individual pixels. In [6], the U-Net architecture is adapted for use in stem separation. Specifically, [6] uses the U-Net architecture to separate the vocals from the accompaniment. The first half of the U-Net architecture consists of a series of encoding layers that form a contracting network. This contracting network receives an image as input and creates a compressed and deep representation of the input image. This representation is “compressed” because the height and width of the image shrink as you progress further into the contracting network. This representation is “deep” because the number of features channels increases as you progress further into the contracting network. In this fashion, the purpose of the contracting network is to extract high-level information from the input image and forward this information to subsequent layers. The second half of the U-Net architecture consists of a series of decoding layers that form an expansive network. This expansive network receives the compressed and deep representation of the input image from the contracting network and returns an image as output. This output image has the same height and width of the input image and only has a single feature channel. In this fashion, the purpose of the expansive network is to use the high-level information from the contracting network to produce a high resolution output image. In addition to the contracting network and expansive network, skip connections connect layers at the same depth. Skip connections allow low-level information to bypass deeper layers in the U-Net architecture, which is necessary to preserve fine details in the output image. In stem separation, these fine details may drastically influence human perception.

Figure 4: The U-Net architecture for separating vocals from accompaniment. Note that the dimensions of the input spectrogram chunks must be powers of two. The contracting network creates a compressed and deep representation of the input spectrogram chunk, and then the expansive network transforms this representation into a soft mask. The exact nature of the skip connections (i.e., the concatenation operations) is not specified in the paper. [6]

Matlab provides an implementation of the U-Net architecture according to [5]. However, we modified this implementation according to the specifications in [6] (e.g., stride, kernel size, dropout, etc.) with a few notable exceptions. [6] downsamples the input audio to 8192 Hz to speed up processing. We do not downsample the input audio from 44.1 kHz for fear of introducing distortion into the recovered stems that may adversely affect the SAR. [6] uses a hop length of 768 frames when computing the short-time Fourier transform. We use a hop length of a single frame to increase the resolution of our spectrograms. [6] computes the loss between a target spectrogram and the output mask applied to the input spectrogram. We compute the loss between a target mask and the output mask, which reduces the complexity of our neural network. Lastly, the exact nature of the skip connections is unclear in [6]. We use depth concatenation layers to implement the skip connections to the best of our ability in a manner that seems reasonably efficient.

Figure 5: The implementation of the U-Net architecture provided by Matlab (top) versus our own implementation of the U-Net architecture according to [6] (bottom). Unlike the default implementation of the U-Net architecture provided by Matlab, our contracting network does not make use of pooling layers. Our contracting network uses fewer convolutional layers and more batch normalization layers. Our U-Net architecture does not include a bridge between the contracting network and the expansive network. Our expansive network uses fewer convolutional layers and simplifies the implementation of the skip connections.

HPSS - Mask Inference Architecture

Our last neural network architecture is a recurrent neural network known as the mask inference architecture. This architecture consists of a stack of bidirectional long short-term memory (BLSTM) layers, which allow the neural network to learn dependencies between data at different points in time. Specifically, the neural network treats each chunk of the spectrogram of the mixture of the desired stems as a sequence of vectors. Each vector represents the frequency content of the mixture within a particular window of time. The BLSTM layers allow the neural network to look forward and backward in time to predict features of the output soft mask based on the input sequence. Our implementation of the mask inference architecture is based on the specifications in [7] (e.g., number of layers, hidden size, dropout, etc.). Unlike [7], which computes the loss between a target spectrogram and the output mask applied to the input spectrogram, we compute the loss between a target mask and the output mask for the same reasons as described above for the U-Net architecture.

Figure 6: The mask inference architecture as presented in [7]. A stack of bidirectional long short-term memory (BLSTM) layers allows the neural network to learn dependencies between data at different points in time. A fully connected layer reshapes the data into the appropriate dimensions for the soft mask. A sigmoid activation layer is used to create the soft mask.

Figure 7: Detailed analysis of the recurrent neural network known as the mask inference architecture. BLSTM layers are combined with dropout layers to create the neural network. A sigmoid activation layer is used to create the soft mask. The number of learnable parameters is a modest 8.7 million, which is on par with the number of learnable parameters in our implementation of the U-Net architecture. This number of learnable parameters is a couple orders of magnitude less than the number of learnable parameters for the fully connected neural network.