SDR, SIR, SAR

One important task that we accomplished early on in this project was identifying a set of quantitative metrics to evaluate the performance of our source separation algorithms. Although we could listen to our recovered stems and form a subjective, qualitative evaluation of each of our approaches to source separation, this process would be time consuming and prone to human error. Objective, quantitative measures allow us to more quickly and consistently evaluate the performance of our approaches to source separation. These quantitative metrics, introduced by Vincent et al. in [2], are the source-to-distortion ratio (SDR), source-to-interference ratio (SIR), and source-to-artifacts ratio (SAR). These metrics are the most widely used objective measures in source separation literature.

For a given ground truth stem s-j, Vincent et al. assume that the source separation algorithm recovers an estimated stem ŝ-j. Furthermore, Vincent et al. assume that the estimated stem ŝ-j can be decomposed into the sum of four components as shown below:

Figure 1: Equation decomposing the estimated, or recovered, stem into the sum of four components: a modified version of the ground truth stem, an error term due to interference, an error term due to noise, and an error term due to artifacts. [2]

The first component s-target represents a modified version of s-j that has undergone some allowed distortion (e.g., a time-invariant gain). The second component e-interf represents the error due to the presence of unwanted stems (e.g., unwanted drums in the estimated vocals stem). The third component e-noise represents the error due to noise. Finally, the fourth component e-artif represents the error due to artifacts of the source separation algorithm.

SDR, SIR, and SAR are computed by taking the energy ratios of subsets of these four components. SDR, SIR, and SAR are then reported in units of decibels where larger values indicate better performance. Specifically, SDR is the energy ratio of the target stem to the sum of all three error terms. SIR is the energy ratio of the target stem to the interference error term. Lastly, SAR is the energy ratio of the target stem, interference error term, and noise error term to the artifact error term. The exact equations for SDR, SIR, and SAR are shown below:

Figure 2: Equations for computing the SDR, SIR, and SAR associated with a recovered stem. If we recover multiple stems (e.g., we recover a vocals stem, a bass stem, and a drums stem), then each recovered stem has its own SDR, SIR, and SAR values. [2]

Based on these definitions, we can ascertain the qualitative meaning of each metric. Suppose we determine the SDR, SIR, and SAR for the estimated vocals stem. A large SIR means that the estimated vocals stem has very little drums, bass, or other while a small SIR means that the estimated vocals stem is dominated by drums, bass, or other. In other words, SIR indicates how much the estimated stems bleed into one another. A large SAR means that the estimated vocals stem has little distortion due to the source separation algorithm while a small SAR means that the estimated vocals stem has a lot of distortion due to the source separation algorithm. For example, a binary mask that produces a crackly estimated vocals stem would result in a small SAR. Note that SIR and SAR are independent of each other. An estimated stem might have a large SIR but a small SAR or vice-versa.

SDR acts as a general measure of performance. If a paper only reports a single number for each stem, then that number is likely to be the SDR. In practice, we’ve found that the SDR for a particular estimated stem is always smaller than the minimum of the SIR and SAR for that stem.

In the original paper, Vincent et al. describe how to calculate the SDR, SIR, and SAR for an estimated stem using a set of orthogonal projectors. Thankfully, the authors also provide a GitLab repository with a Matlab function bss_eval_sources.m that can compute the SDR, SIR, and SAR for us. In order to obtain these values, we only need to supply two inputs to the function. The first input is an array of the estimated stems while the second input is an array of the ground truth stems. The function then returns the SDR, SIR, and SAR for each estimated stem. Whenever we report SDR, SIR, and SAR in this project, we report the values returned by this function.