DOSE : Drum one-shot Extraction

${SuntaeHwang }^1,{Seonghyeon Kang}^2, {Kyungsu Kim}^3, {Semin Ahn}^4, {Kyogu Lee}^5$

Drum one-shot samples are crucial for music production, particularly in sound design and electronic music. This paper introduces Drum One-Shot Extraction from music mixtures, a task designed to extract drum one-shots directly from reference mixtures. To facilitate this, we propose the Random Mixture One-shot Dataset (RMOD), comprising large-scale, randomly arranged music mixtures paired with corresponding drum one-shot samples. Our proposed model, Drum One- Shot Extractor (DOSE), leverages neural audio codec language models for end-to-end extraction, bypassing traditional source separation steps. Additionally, we introduce a novel onset loss function, which emphasizes accurate prediction of the initial transient of drum one-shots, crucial for capturing timbral characteristics. We compare this approach against a source separation-based extraction method as a baseline. The results, evaluated using Fre ́chet Audio Distance (FAD) and Mel-Spectrogram Similarity (MSS), demonstrate that DOSE, enhanced with onset loss, outperforms the baseline, providing more accurate and higher-quality drum one-shots from music mixtures.

Fig. 1: Illustration of our approach. Given an audio mixture as input,
each Drum One-Shot Extractor(DOSE) model extract one-shot audio
samples for kick, snare, and hi-hat drums.

Fig. 1: Illustration of our approach. Given an audio mixture as input, each Drum One-Shot Extractor(DOSE) model extract one-shot audio samples for kick, snare, and hi-hat drums.

Fig. 2. Proposed Method. The input audio mixture is encoded into a sequence of discrete tokens using a frozen DAC encoder, which are then fed into a decoder-only Transformer. The Transformer is trained to autoregressively predict the groundtruth drum one-shot tokens by minimizing two losses: onset loss and full-length loss. Finally, the predicted token sequence is decoded into drum one-shot audio using the DAC decoder.

Fig. 3. Dataset generation process. First, kick, snare, and hi-hat loops are synthesized from one-shot drum audio samples using randomly generated MIDI notes. Next, optional bass, piano, guitar, and vocal loops are selected. The drum loops and other musical loops are then processed through independent mixing chains, which apply gain, EQ, compression, panning, limiting, delay, and reverb effects. Finally, all tracks are combined and passed through a mastering chain consisting of EQ and limiter effects.

Experiments with RMOD

ex1)
ex2)
ex3)

Experiments with RMOD (drum only)

ex1)
ex2)
ex3)

Experiments with GrooveMIDI dataset

ex1)
ex2)