EECS351 STEM SEPARATION
Revisting ICA
Fast ICA was an approach we explored early in the project but chose to abandon. It quickly became clear that although ICA works well in the case that your audio sources generate distinct mixes, in the case of stereo music audio files, it would not be possible to separate more than two instruments from a standard mix of the stems. In the case of three stem mixes, it would be impossible as the sources would be underdetermined. Given the results of HPSS via the machine learning algorithm, a new opportunity arose to be able to manipulate the harmonic and percussive separations such that using ICA would no longer be undetermined.
The goal of ICA is to transform components to a representation such that they are as statistically independent as possible. In our analysis, we used a fast ICA implementation provided by Brian Moore in the “PCA and ICA Package” add-on for Matlab. Moore developed three slightly different algorithms.
Two of the algorithms (fastICA negentropy and kurtosis) differ only by the measurements of nongaussianity. One uses a simple estimate for negentropy, which is a normalized differential entropy function. Negentropy is a measure of nongaussianity. A single independent component can then be projected by finding a single direction which maximizes negentropy. Kurtosis is the other measure used to evaluate gaussian density. These two algorithms are also iterative.
The third algorithm is Kurtosis Maximization ICA (kICA). This algorithm differs in that it whitens the input data to a higher order, and maps fourth order statistics to estimate the direction of kurtosis. The code for this max-kurtosis is only a few lines, is not iterative, and often scored the highest evaluations on its source estimations. All three algorithms use eigenvectors of gaussian projections to estimate the direction to decompose the mixes.
To establish a baseline on performance, evaluations were made using the Music Delta - Beatles stem data to generate independent left channel only, right channel only, and mono stem mixes via a random mixing matrix. Although kICA had good evaluation metrics, both fastICA’s had poor SDR SIR and SAR evaluations estimating vocal and bass stems.
Given the evaluations had nearly identical results when evaluating a single channel at a time or a mono mix, we determined there should be no qualitative difference between working with a single channel alone or a mono mixture. We then moved on to try using the machine learning DNN harmonic and percussive separations on the left channel to produce two additional independent mixtures to accompany the standard unweighted mixture. This was another preliminary baseline, as it was not likely this attempt would have poor results, because the harmonic separation would not have any independent control between the bass and vocals.
A side effect of our ML algorithm was the length of the estimated harmonic and percussive stem was 95,409 samples less than the original data. This truncating within the neural net was likely due to windowing. An interesting accidental feeding of mixtures of two different lengths into the ICA algorithm produced an echo that was qualitatively observable in the vocals. Evaluations were then made by truncating the ground truth mix to the size of the randomly mixed harmonic and percussive separations, another tuning that randomly produced matrix, and lastly augmenting the size of the tuned harmonic and percussive separations mixtures to the size of the original ground truth mixtures. To its detriment, SDR SIR and SAR evaluated the mismatched sized vectors that exhibited an echo in the vocals markedly better than all but the augmented HPSS mixtures. This speaks to the discrepancy between the SDR SIR and SAR evaluations and what sounds good to the human ear. When truncating the ground truth mixes to the size of the harmonic and percussive stems, the estimated stems also exhibited a significant reduction in magnitude in comparison to their ground truth counterparts and the estimated stems that did not involve truncating the ground truth mix.
It’s possible ICA could be conducted successfully using only the left and right channels without any alterations, if under the right conditions, ie. due to the stereo panning of instruments, the use of multiple microphones positioned at different distances from one or multiple sources, a pronounced stereo image etc. Because the masking process done by the NN is a non-linear process, it seemed worth while exploring if the idiosyncratic differences between the left and right channels of the stems could produce enough independence in the HPSS results of left verse right channel for ICA to decompose the mixes with any success.
Starting with the ground truth baseline, ICA made estimations using the left and right channels of the vocal and bass mix as individual mixtures. The results were rather poor across all the metrics for each ICA algorithm. Qualitatively, each algorithm had essentially the same estimations. One stem sounded virtually identical to the vocal bass mixture without any source extracted, and the other stem sounded garbled, heavily distorted, thin, and attenuated. Due to DFT’s taken on these data previously, this input was anticipated to be underdetermined due to the minimal differences between the left and right channel of the vocals and the bass stereo channels being identical.
To further get a quantitative baseline for separation via stereo mix, ICA estimations were next made using the stereo channels of the vocal and drum ideal ground truth mix. Due to common recording techniques using multiple mics to record a drumset, and previous confirmation of the independence of the channels by taking the DFT of the drum stem, these evaluations were anticipated to be an improvement over the stereo estimations of the vocal bass mix. These estimations were again approximately all the same across the different ICA algorithms. One stem sounded qualitatively close to that of the mix, and the other extracted and isolated a large amount of the drums. Although this extracted drum stem would not be useful in any remixing application, the improvement over the previous results offers some confirmation into our intuition that this method could be more successful with more independence within a stereo mixture.
When using the left and right harmonic separations of the RNN as input mixtures, the results did include more variance in quality and in evaluations than the ideal cases, but almost all the estimations sounded highly distorted and attenuated. We can then conclude that the left and right harmonic separations do exhibit more independence than the ideal case due to the variance in the estimations. Although the channels might have independence, that alone is not enough to recover good estimations. Henceforth, all referenced mixtures contain the left channel only.
The next method that seemed worthy of approach was using highpass and lowpass filters on vocal bass mixtures, then randomly remixing the filtered mixes to generate input mixtures for ICA. The filters used were the matlab functions lowpass() and highpass() which are generic ideal filter functions. The filtering method is something that could be expanded upon and improved to make a more robust BSS algorithm.
The initial parameters of the lowpass filter were a cutoff frequency of 300 Hz, a steepness ratio of .99, and a bandstop attenuation of 90 db. Highpass filter parameters were a cutoff frequency of 280 Hz, a steepness ratio of .99, and a bandstop attenuation of 200 db. Both filters were set to ‘fir’ rather than ‘iir’ to keep the phase intact. Using the random mixing matrix, evaluations of the max-kurtosis and negentropy algorithms were approximately 10 db and 9 db for SDR on the vocals respectively and roughly 3 db for SDR in the bass estimations. Although these numbers are a large improvement over previous methods, when evaluating the filtered mixes before ICA, the proved to be better estimations by all SDR SIR and SAR that the ICA estimations
The matrix was then tuned to attempt results approaching that of the SDR, SIR, and SAR evaluations of the filtered mixes.
Next, using the hand tuned mixtures from the previous step, ICA was calculated using one hand tuned filtered remixture and one ground truth vocal bass mixture. This was conducted twice, once for each hand tuned filtered mixture. The evaluations of these estimations were similar to those of the previous step. In many cases rather low SIR correlated directly to a lack of separation in the estimations. We then used SDR SIR and SAR to evaluate the estimations relative to the input mixture. High numbers for all three metrics correlates to the preservation of an input stem in the estimation.
​
Last, by using adjusting the parameters of the lowpass filter steepness to .6 and the highpass filter cuttoff frequency to 270 Hz and steepness ratio of .69, we were able increase the evaluations of the bass estimations above 10 db for all metrics.
To apply the filtering method to the HPSS harmonic separation, the lowpass and highpass were tuned by hand and estimation. The values in red indicate stems with the highest evaluations we were able to obtain for each filter, which were subsequently saved for later use.
Using the filtered harmonic stems to gain some independence between the vocals and bass, we then finally ran the fast ICA algorithms using the HPSS harmonic LPF, harmonic HPF, and percussive separations to generate 3 independent mixtures. Although these estimations are not the best we've obtained, we believe with better filtering methods, ICA could be expanded into a larger valid algorithm.