A detailed document describing how to train an encoder-only transformer on drum piano-rolls

Transformer Architecture for Generating Short Drum Loops Given a Performed Monotonic Groove

Originally Submitted to International Conference on New Interfaces for Musical Expression • NIME 2022

Transformer Architecture for Generating Short Drum Loops Given a Performed Monotonic Groove

URL: https://nime.pubpub.org/pub/fvl5su34

License: Creative Commons Attribution 4.0 International License (CC-BY 4.0)

Abstract

In the last few years, variations of transformer neural networks have proved quite effective in many symbolic music generation tasks. This paper investigates and demonstrates the effectiveness of transformer architectures paired with direct representation of musical events (i.e. representation without tokenization) for generating drum performance loops given a performed monotonic groove (containing velocity and timing dynamics). We demonstrate that early investigations into this topic provide promising results. Moreover, we provide a detailed explanation of the training process involved, while also discussing a number of shortcomings to be addressed in the future iterations of this work. Lastly, we provide a discussion on the possibilities and the limitations of the proposed model for real-time use cases.

Author Keywords

Transformer Neural Networks, Music Generation, Drum Generation, Groove, Rhythm

CCS Concepts

•Applied computing → Sound and music computing; Performing arts; •Computing Methodologies → Machine Learning; •Computing Methodologies → Artificial Intelligence;

Introduction

There has been a significant surge in employing deep learning for music generation tasks. These models commonly require to be trained on a large corpus of data so as to generalize musical characteristics of the available samples in the dataset. In other words, the aim of these models is to (1) learn the outlying patterns in the data, and (2) use the learned patterns to generate new content. While not creative on their own, these models can be used in many creative contexts. For instance, to name a few, they can be used to assist non-specialized composers and producers, be used to generate or speed up the generation of creative ideas or be used to complete partial ideas.

Since 2017 [1], variations of transformer architectures have shown to be very promising in modeling and generating sequential data. The success of these models has been most prominent in the Natural Language Processing (NLP) domain [2][3][4], and their application in music generation has been strongly inspired and adapted from language generation tasks.

In language generation, the learning process leads to an embedding space that is organized in a semantically meaningful manner. In most NLP tasks, these models are however extremely costly to train, and their success depends on timely training on massive datasets containing a large set of vocabularies (tokens). These tokens are then associated with high dimensional embedding vectors, the contents of which are learned during the training process. Music generation tasks, which typically adopt these NLP architectures, do also require a temporal sequence of tokenized events where each token represents a discreet musical event [5][6][7][8][9][10][11]. However, unlike NLP tasks, which typically deal with tens of thousands of vocabularies (e.g. 50,257 tokens in GPT-2 [3]), symbolic music generation tasks deal with much smaller vocabularies (typically hundreds of tokens [10]). Moreover, whereas in language, words generally serve limited semantic functions, the ‘semantic’ function of musical events is highly dependent on the context in which these events are used. We hence speculated that perhaps tokenization of events is not strictly necessary and can be avoided for certain music generation tasks.

Many works conducted on using transformers for symbolic music generation, so far, focus on creating stylistically consistent [7][8][10] and/or ‘real’ (or rather ‘expert’) [5][6][9] sounding content. As of today, most of these works still focus on offline, non-real-time content generation, which clearly limits the scope of their use. Although a number of works have indeed addressed drum generation in real-time, they use other generative techniques and there is still a lack of literature on real-time drum generation using transformers. To name a few, in [12], Gómez-Marín et. al. present a system that generates drum patterns by navigating a similarity space constructed by incorporating multiple rhythm similarity measures. Similarly, in [13], Vogl et. al. demonstrate a generative drum machine with a GAN-based generation engine, the generations of which are user-controllable by a number of parameters such as genre, complexity, and loudness. Finally, McCormack et. al. presented a TCN-based [14] AI drummer that, in real-time, responds to a human instrumentalist improvisation [15].

Given our interests in drum generation and inspired by the success of transformers in many symbolic music generation tasks, we have started on a new line of research focused on drum generation using transformers. Our ultimate goal is to develop a set of transformer-based drum generation systems that are deployable in real-time settings. The current work presented here is an initial attempt at this topic. More specifically, in this work, we investigate (1) the effectiveness of transformer models in generating short drum loops given a performed monotonic groove (i.e. a single voice groove that contains velocity and timing information, see Image 1), and (2) whether discretization of events (i.e. tokenization) can be replaced by a more direct method of representing the drum events.

**Image 1**
Monotonic groove performance (left piano roll) to full drum beat (right piano roll) conversion

In this section, we briefly discuss some of the most important related works studied for this work.

Drum Generation using Transformers

In 2019, Huang et. al. introduced the Music Transformer [5], a transformer model similar to the vanilla transformer [1], however, with a modified attention mechanism, capable of generating long-term structured piano performances. Following this work, Choi et. al. presented a more controllable variation of the Music Transformer in which the piano generations can be controlled via a given performance specifying the style of the desired generations [8]. In these models, the input/output sequences are represented as a sequence of chronological events (such as note-on/off, discretized time shifts, and velocity levels). A number of more recent works on single instrument score/performance generation ([9][16][17][18][10]) have also experimented with alternative methods of tokenizing events as well as other transformer variations such as BERT [19], GPT-2 [3] and Transformer-XL [20].

Another series of works have been focused on multi-instrument score/performance generations. While these models are not designed explicitly for drum generation, they are capable of generating drum/percussion scores accompanying other instruments. The earliest of these works is MuseNet [21], in which a GPT-2 [3] transformer has been trained on multi-track scores from various artists and styles. Following this work, Donahue et. al. presented the LakhNES model in [7]. LakhNES is a Transformer-XL model pre-trained on the Lakh MIDI Dataset 1 [22] and then trained on the NES-MDB dataset [23]. In this work, the authors show that by using the proposed pre-training regime, the model is capable of generating a multi-instrument score from scratch or in a guided context. Finally, in [24], Ens and Pasquier explore the effect of various sequence representations on controllability of multi-track generations using their proposed model: Multi-track Music Machine (MMM), an architecture based on GPT-2 [3].

Finally, to the best of our knowledge, [25] is the only work so far focused explicitly on generating drum patterns using Transformers. In this work, Nuttall et. al. propose that a Transformer-XL architecture using a small vocabulary of tokens (composed of 36 pitch-velocity pair tokens and 5 time-difference tokens) is capable of generating long consistent drum performances either from scratch or given a priming pattern.

Groove to Drum Performance

In [26], Gillick et. al propose the GrooVAE Tap2Drum model, a sequence to sequence variational auto-encoder network that converts a tapped sequence into a multi-voice human-like performance on a drum kit [26]. This system not only generates a score but also generates velocity levels and unquantized timing of events. In this system, however, the input sequence is a ‘tapped’ pattern that does not contain any velocity levels. Lacking tapped patterns associated with the performances available in the training set2, the authors propose a method to extract pseudo-tapped sequences by squeezing all drum events into a single voice while ignoring velocity levels. Consequently, the proposed model is trained on the pair of tapped and performance sequences. Many of the methodologies used in our work have been inspired by the GrooVAE model.

Methodology

In this section, we provide a detailed explanation of the methodology used for carrying out the objectives of this work.

Dataset

The model is trained using Magenta’s Groove MIDI Dataset (GMD)[26], a dataset containing roughly 13.6 hours of drum performances in the format of beats and fills, classified by genre and mostly in 4/4 time signature. Our experiments were focused on 2-bar beat loops performed in a 4/4 time signature resulting in the distribution shown in Image 2.

For this work, the experiments are conducted using the same partitions as provided in GMD (see Table 1).

Number of 2-Beat Loops in 4/4 within the Selected Subset of GMD Dataset

Table 1
Split	2-measure beats in 4/4
Train (∼80%)	16195
Test (∼10%)	2054
Validation (∼10%)	2021
Total	20270

The drum performances are recorded in MIDI format on a Roland TD-11 electronic drum kit containing 22 diﬀerent MIDI pitches (Table 3.2). This is a large vocabulary and with strong imbalances in terms of instrument hit occurrence, which is common among drum datasets [27], since some instruments, due to their ‘role’ in the drum kit, are played more often than others (e.g., snares vs. crash). The authors of the dataset propose to reduce the performances to a 9-voice vocabulary using the mapping shown in Table 2 [26]. For this work, we used the same mapping as below for reducing the number of voices.

Voice Mapping and Corresponding Percentage of Hits for Each Instrument. (Mapping and data extracted from [26])

Table 2
Midi Pitch	Roland TD-11 Vocabulary	Mapped 9-Voice Vocabulary \|(MIDI Pitch)	Hit %
36	Kick	Kick (36)	19.6 %
38, 40, 37	Snare Head, Rim, X-Stick	Snare (38)	30 %
45, 43, 58	Tom 2, Tom 3 Head & Rim	Low Tom (45)	3.6 %
48, 47	Tom 1, Tom 2 Rim	Mid Tom (48)	3.2 %
50	Tom 1 Rim	High Tom (50)	0.3 %
46, 26	OH (Bow, Edge)	Open Hi-Hat (46)	3.2 %
42, 22, 44	CH (Bow, Edge, Pedal)	Closed Hi-Hat (42)	26.5 %
49, 55, 57, 52	Crash 1 and 2 (Bow, Edge)	Crash (49)	2.0 %
51, 59, 53	Ride (Bow, Edge, Bell)	Ride (51)	11.5 %
Total Number of Hits			448783

Data Representation

When dealing with transformer architectures, commonly the input/output space of possibilities is quantized, and then tokenized so as to learn a meaningful representation space in which each of the input/output tokens is embedded. However, for this work, we purposefully decided to replace the tokenized representation of events with that of a direct representation. To this end, we used the same representation as proposed by Gillick et. al. [26]. In this representation (from now on, called HVO, denoting hits, velocities, and offsets), the input and output sequences are directly represented by three stacked T × M matrices, where T corresponds to the number of time-steps, in this case, 32 (2 bars with 16 sub-divisions each), and M corresponds to the number of instruments, in this case, 9. The three matrices are deﬁned as follows:

Hits: Binary-valued matrix that indicates the presence (1) or absence (0) of a drum hit.
Velocity: Continuous-valued matrix of velocity levels in the range [0, 1]
Oﬀsets: Continuous-valued matrix of oﬀset deviations from the nearest 16th note grid line, in the range [−0.5, 0.5] where ±0.5 implies mid-way between a grid line and the following/preceding gridline

This results in an HVO matrix of dimension 32 × 27. An example of an HVO representation (with 4 voices and 4 timesteps) is shown in Image 3.

**Image 3**
An example of HVO derivation from the piano rolls

For the input to our system, we extract a monotonic groove from the GMD performances by squeezing all events at any given time step to a single voice. This approach is similar to GrooVAE with the exception that, in our work, the velocity information is not disregarded (see Image 4).

**Image 4**
Extracting a monotonic groove sequence from a drum performance

Architecture

An overview of the Transformer architecture used in this work is shown in Image 5.

As shown in Image 5, a monotonic groove with velocity and timing information (extracted from the target drum performance) is used as the input to the system. After passing through a linear layer, we apply positional encoding similar to the vanilla transformer[1]. Afterward, the data passes through a Transformer encoder, which comprises several layers, each of them composed of a multi-head attention module and a feedforward layer.

The multi-head attention module uses several parallel attention mechanisms (n_heads), in each of which, the attention values are computed in parallel for each head and the resulting values are then concatenated.

Finally, following $N$ attention layers, the linear layer maps the final attended tensor of shape 32 x d_model into the same shape as the input tensor (32 x 27). The resulting tensor is then split into three tensors of shape Batch Size x 32 x 9. As shown in Image 5, either sigmoid or hyperbolic tangent (tanh) activations are used to predict the hits, velocities, or micro-timings (offsets). Similar to [26], a binary cross-entropy loss (BCE) is used for the hits, and Mean Square Error losses (MSE) are used for velocities and micro-timings.

In the ground truth drum scores, most entries in the HVO matrix are zero (as there are no corresponding events). In order to help the model give more importance to activated voices (i.e. where the expected hits are 1), we applied a penalty or multiplying factor (in the range of 0 to 1) to the hit, velocity, and offset loss values where the expected hits are 0 (see next section for justification of this modification to the loss calculation).

The loss calculations are detailed in (ref?) to (ref?).

\tiny total\,\,loss = hit\,\,loss \,\,+ velocity\,\,loss \,\,+ offset\,\,loss \tag{1}

where the losses are defined as:

\tiny hit\,\,loss \,\, = \sum_{i,j}{P \odot BCE(H_{predicted}, H_{expected})_{i,j}} \tag{2}

\tiny velocity\,\,loss \,\, = \sum_{i,j}{P \odot MSE(V_{predicted}, V_{expected})_{i,j}} \tag{3}

\tiny offset\,\,loss \,\, = \sum_{i,j}{P \odot MSE(O_{predicted}, O_{expected})_{i,j}} \tag{4}

\tiny where\,\, P = p \, \cdot \, (J - H_{expected}),\,\, 0\le p \le 1 , \,\,J = \begin{pmatrix} 1 & 1 & \dots &1\\ 1 & 1 & \dots &1\\ \vdots & \vdots & \ddots &\vdots\\ 1 & 1 & \dots &1 \end{pmatrix} \tag{5}

Training

We used Weights & Biases (W&B) [28] in the training pipeline of our experiments. The results of our training, our evaluations as well as all our training conditions are publicly available to ensure the accuracy and reproducibility of the results presented in this paper.

To tune the hyper-parameters, we took advantage of the random sweep tool provided by W&B. In our initial experiment, we assumed the same loss function as used in [26] (i.e. we did not apply a loss multiplier to hits - $P$ in (ref?) was assumed to be a matrix of ones). This initial experiment immediately showed that while the models were somewhat able to generalize velocities and hits, they greatly struggled in generalizing offsets. (See Image 6, a description of the plot can be found here3)

**Image 6**

Velocity heatmaps of Kick for run northern-sweep-26 for three genres, comparing ground truths (above) and predictions (below).

An interactive version with all voices and styles can be found here.
——————————————————————————————————————————

x-axis: timing of hits
y-axis: velocity of hits
**Scatter Plot:** All the predicted hits compiled into a single plot
Underlying Heatmap Plot: the probability of hits calculated from the scatter plot

——————————————————————————————————————————

We believe that the lack of offset variability might be related to the proportions of the offset loss in the total loss calculation (see Image 7).

**Image 7**
Individual training losses over 100 epochs of run northern-sweep-26

Multiple experiments were run in an effort to fix this issue4. Of all the experiments attempted, the one we found to be most promising involved adding a penalty factor or loss multiplier (p in (ref?)). This factor multiplies the hit, velocity, and offset losses where the ground truth hits are 0, giving less importance to the places where there are no expected hits. (ref?) shows an example of this calculation (reduced to 4 voices, 3 time-steps).

\tiny BCE(H_{predicted}, H_{expected}) = \begin{pmatrix} 0.52 & 0.10 & 0.27 & 0.32\\ 0.74 & 0.52 & 0.14 & 0.32\\ 0.64 & 0.6 & 0.30 & 0.34 \end{pmatrix} \\ \tiny for \ p = 0.5 \ \& \ H_{Expected} = \begin{pmatrix} 1 & 0 & 1 & 0\\ 0 & 0 & 0 & 0\\ 0 & 1 & 1 & 0 \end{pmatrix} \implies P = \begin{pmatrix} 1 & 0.5 & 1 & 0.5\\ 0.5 & 0.5 & 0.5 & 0.5\\ 0.5 & 1 & 1 & 0.5 \end{pmatrix} \\ \tiny hit\,\,loss \,\, = \sum_{i,j}{P \odot BCE(H_{predicted}, H_{expected})_{i,j}} = \sum \begin{pmatrix} 0.52 & 0.05 & 0.27 & 0.16\\ 0.37 & 0.26 & 0.07 & 0.16\\ 0.32 & 0.6 & 0.30 & 0.17 \end{pmatrix} = 3.25 \tag{6}

Accordingly, in the following experiments, the loss function was modified by applying a hit loss multiplication factor, $p$ as in (ref?), resulting in the final set of hyper-parameters detailed in Table 3.

List of hyper-parameters tuned using a random sweep

Table 3
Parameter Type	Parameter
Architectural	Model Embedding Dimension (d_model in Image 5)
	Number of Transformer Blocks
	Number of Attention Heads
	Feedforward Layer Dimension
Training	Dropout Value
	Learning rate
	Batch Size
	Hit Loss Multiplier (p as noted in (ref?))

While tuning the model, a criterion needs to be used so as to select more suitable candidates. At first glance, the final loss value may seem to be a good candidate. However, since each training run uses a different p or loss multiplier, loss values are not directly comparable. For example, hyper-parameters for two runs (absurd-sweep-24 and lyric-sweep-12) with test losses of 0.04 and 1.18 respectively are shown in Image 8.

**Image 8**
Sweep j9r6pt0s, highlighting runs lyric-sweep-12 and absurd-sweep-24

A lower loss value could lead us to believe that absurd-sweep-24 is the better model of the two. Nonetheless, as shown in Image 9(b), this run generates completely saturated predictions (all time-steps and voices have their hits set to 1). On the other hand, the lyric-sweep-12 run, with a higher test loss value, generates predictions that are closer to the ground truth (see Image 9(c)).

**Image 9**
Example ground truth and predictions, absurd-sweep-24 vs lyric-sweep-12

As a result, we decided to use another criterion that would be comparable across different training runs, namely hit accuracy (defined as correctly predicted hits over the total number of predicted hits). For the two runs discussed above, absurd-sweep-24 and lyric-sweep-12, the test hit accuracies reported on the last epoch are 12.14% and 90.38%, attesting that this criterion is more reliable for narrowing down the scope of the selection of the candidate hyper-parameters.

In order to have a baseline comparison, we calculated the hit accuracy of the GrooVAE Tap2drum model [26] (evaluated using the model checkpoint provided here 5 by Google Magenta). This evaluation resulted in a hit accuracy of 87.42%, calculated over the test subset. In order to narrow down the scope of our model selection, we selected the runs with over 88% test hit accuracy. This resulted in 12 hyper-parameter configuration candidates. In order to further reduce our selection, 4 runs out of these 12 candidates with the lowest offset loss were chosen, in an attempt to improve offset variability.

Image 10 shows the hyper-parameters of the selected runs, together with the offset loss and hit accuracy values over the test set. All of these models have been trained for 100 epochs (no stop condition has been implemented). Throughout the training process, we used an SGD optimizer with a constant learning rate.

**Image 10**
The final set of hyper-parameter candidates

In order to avoid under/over-fitting, we retrained the final four models using the early stopping regularization method described in [29]. The progression of loss values for the re-trained models is shown in Image 11.

**Image 11**
Training vs. Test loss over training epoch of the four selected and re-trained models (smoothed with exponential running average)

As a result, the models chosen for the final evaluation are:

The hyper-parameters of the selected models are shown in Table 4.

Hyper-parameter Configurations for the Final Selected Models (all models trained using SGD optimizer)

Table 4
Model	rosy-durian-248	misunderstood-bush-246	solar-shadow-247	hopeful-gorge-252
Batch Size	16	16	16	16
d model	512	128	128	512
dim forward	16	128	16	64
n heads	4	4	1	4
n layers	6	11	11	8
dropout	0.109	0.104	0.1594	0.171
learning rate	0.039	0.037	0.037	0.007
$p$ 6	0.53	0.27	0.49	0.33

The source code and the trained checkpoints are available here 7. Moreover, an interactive Google Colab notebook to use the trained models to generate samples from MIDI is also available here 8.

Evaluation

In this section, we provide a number of objective evaluations of the final selected models. Given that the current work was a preliminary study of the effectiveness of transformer architectures in generating drum loops (using a direct representation of inputs/outputs - i.e. HVO format), our intention was not to generate the most “likable” loops, rather the goal was to train a number of models that are reasonably capable of generalizing the training set. To this end, we did not carry out any listening tests to evaluate the subjective likability of the generated loops.

It should be noted that the following evaluations are conducted on the validation set, as the training set had already been used in the hyper-parameter tuning procedures.

The first criteria we used for evaluating the different models was hit accuracy. Table 5 summarizes these results.

The validation set accuracy of predicting Hits

Table 5
Model	solar-shadow-247	rosy-durian-248	misunderstood-bush-246	hopeful-gorge-252
Batch Size	16	16	16	16
d model	128	512	128	512
dim forward	16	16	128	64
n heads	1	4	4	4
n layers	11	6	11	8
Average training time per epoch (min)9	2.44	2.97	2.96	2.22
Accuracy (%)	88.15	91.2	90.56	89.07

A few generated samples are available in Table 6 to Table 10 (more samples are available here 10). Based on our subjective evaluation of the generations, we believe that the drum patterns predicted by rosy-durian-248 and misunderstood-bush-246 are acceptably closer to the ground truth patterns.

Validation Set Synthesized Samples for rock_drummer7_session2_8

Table 6
Monotonic Groove	Ground Truth

rosy-durian	misunderstood-bush

Validation Set Synthesized Samples for Jazz_drummer3_session1_39

Table 7
Monotonic Groove	Ground Truth

rosy-durian	misunderstood-bush

Validation Set Synthesized Samples for funk_drummer1_session3_15

Table 8
Monotonic Groove	Ground Truth

rosy-durian	misunderstood-bush

Validation Set Synthesized Samples for hiphop_drummer3_session2_25

Table 9
Monotonic Groove	Ground Truth

rosy-durian	misunderstood-bush

Validation Set Synthesized Samples for afrobeat_drummer8_session1_16

Table 10
Monotonic Groove	Ground Truth

rosy-durian	misunderstood-bush

In the absence of qualitative analysis of the generated patterns using a listening test, in order to better understand the generated data, we employed the velocity heat maps introduced above. These plots were generated on a per voice basis (analyzed across genres). Image 12 visualizes the generations of kick, snare, and closed hi-hat instruments (see the rest of the heat maps here 11).

**Image 12**
Velocity Heat maps across different genres for Kick, Snare, and Closed Hi-hat calculated over the validation set (interactive versions are available here)

We were also interested to know how the generations develop throughout different epochs. Animations provided in Image 13 and Image 14 illustrate the development of the velocity heat maps for two runs, rosy-durian-248 and hopeful-gorge-252.

**Image 13**
Evolutions of velocity heat-maps across training epochs for run rosy-durian-248 (trained for 26 epochs)

**Image 14**
Evolutions of velocity heat-maps across training epochs for run hopeful-gorge-252 (trained for 90 epochs)

The first observation made from the above images is that initially, the models start generating constant velocity hits closer to the 16th note grid lines. However, as training advances, the hits start moving away from the gridlines and start expanding vertically (different velocity values). In other words, in the earlier epochs, the model learns mostly where the hits should be located, without learning much about the velocity or offset of the events.

Moreover, these animations clearly show that the hits corresponding to the five least frequent voices (low/mid/high toms, the crash, and open hi-hat), tend to be generated in the last few epochs of the run. Conversely, the models seem to be more confident in generating kicks, snares, closed hats, and rides; we strongly suspect that this is a consequence of the over-representation of these voices in the dataset (see Table 2). These findings were confirmed to be true for all four final selected models (refer to here 12 to see the same plots as Image 13 for all four runs). The distribution of the number of instruments in the ground truth and the predictions are shown in Image 15.

**Image 15**

Distribution of the number of instruments across genres for the validation set ground truth data, rosy-durian-248 run, and GrooVAE

While the ground truth data seems to contain a noticeable amount of examples with less than three or more than four instruments, both rosy-durian-248 and GrooVAE models seem to mostly generate samples with only three or four voices. However, the GrooVAE model seems to be more capable of generating samples that have more than four voices. The velocity heat maps for hi- and mid- toms further illustrate the better performance of GrooVAE in generating more voices.

**Image 16**
A comparison of the mid- (right plots) and hi- (left plots) tom generations between rosy-durian-248 (bottom plots) and GrooVAE (top plots).

To analyze the amount of variation between the first and second bars of the generated loops, we calculated a velocity similarity score defined in (ref?).

\tiny Velocity \,Similarity\, Score = 1 - \sum_{\, i,\, j} |Velocity_{\, i+16, j} - Velocity_{\, i, j}| \\ \tiny \text{where i denotes time step 0 to 15 and j denotes the voice index} \tag{7}

Image 17 summarizes the distributions of velocity similarity scores.

**Image 17**
Distribution of velocity similarity scores across genres for the validation set ground truth data, rosy-durian-248 run, and GrooVAE

The above image demonstrates that, just like the validation set, both rosy-durian-248 and GrooVAE generate highly symmetrical velocity patterns. This can be further confirmed by looking at the generated velocity heat maps (similar to Image 12).

Finally, to evaluate the generated offsets we calculated the Timing Accuracy of the generations and the ground truth dataset. As in [30], Timing Accuracy is defined as the absolute sum of micro-timing deviations on 8th note positions. Image 18 summarizes these results.

**Image 18**
Distribution of Timing Accuracy scores across genres for the validation set ground truth data, rosy-durian-248 run, and GrooVAE

Image 18 clearly shows that the timing of generations using both rosy-durian-248 and GrooVAE models are considerably lower than the ground truth data. In other words, these two models struggle to accurately generalize the timing of events. The scatter data shown in Image 12 further confirm this shortcoming in both models.

Discussion

In this section, we start by providing a discussion on the drum generation system presented above. Moreover, we provide a number of suggestions on how to improve this system. Lastly, we provide a discussion on how this system can or can not be used in real-time.

Monotonic Groove Performance to Drum Loop Generator

The evaluations presented above show that the transformer architecture can be used for generating drum loops given a performed monotonic groove. Moreover, we were able to confirm that this task can be executed while avoiding tokenization of events.

While the results show that the model does successfully generalize some aspects of the dataset (such as hit distributions and symmetrical velocities), the model certainly struggles to generalize offsets (timing deviations from the 16th note grid lines). Finally, compared to GrooVAE, the model showed to be more persistent in generating patterns that mostly employ kick, snare, closed hat, and ride instruments. We strongly speculate that one reason for these shortcomings is the imbalance in the dataset. This imbalance exists in two different aspects of the dataset:

The samples in the dataset are unequally distributed across genres as shown in Image 2.
Kick, snare, closed hat and ride instruments correspond to 88% of the hits available in the dataset (see Table 2,).

These imbalances are even more pronounced in the generated data. Different strategies should be employed to account for the imbalances in the dataset.

Moreover, we speculate that the design of the loss function ((ref?) to (ref?)) corresponds to the inability of the model in generating expressive timings. One reason can be the imbalanced contribution of individual losses (hits, velocities, and offsets) to the overall loss. We were unable to improve this issue at this point, hence, further investigation is required.

We believe there are a few improvements that can be investigated in the future:

Investigate the effectiveness of an alternative grid consisting of a mixture of triplet and duplet grid lines
Investigate the effects of pre-training on larger datasets
Investigate retraining the model in a multi-task learning setting to improve offset variability
Improve the controllability of the generations by utilizing additional control parameters such as style, tempo, event density, and instrument density.

Real-Time Application

Using the Google Colab notebook provided here 13, we crudely tested the inference time for the misunderstood-bush-246 model. Inference times without using a GPU were typically on the order of tens of milliseconds. Hence, we believe that the system is not computationally costly depending on the target application. For instance, in an offline setting, this system can be affordably deployed to generate drum patterns given a monotonic groove.

The relatively short inference time allows for the model to be deployed in certain real-time settings as well. For instance, the system can be used as an accompanying “drummer” that responds to (or rather accompanies) an instrumentalist with a fixed measure-long delay. In this context, recurring at the beginning of every measure, the system generates content to be played back at the end of the measure at the beginning of which the generation was triggered (see Image 19). Note that in context a reference click track should be used to metrically synchronize the performer with the system.

**Image 19**
An example of a real-time system using the existing generative model

A rather more interesting alternative real-time application for a groove to drum generator is to use the model as a performable instrument (similar to Piano Genie [31], a real-time system that maps a limited set of 8 keys into a full-size piano performance). In this context, the instrumentalist (performer), rather than accompanying the generator, “plays” the generator as an instrument.

Unfortunately, the generative system developed for this work is not usable in this context. The main reason for this incompatibility is that we based our design only on the encoder section of the Vanilla transformer [1]. As a result, the attention mechanism at any time step requires attending to not only the current and the past time steps but also the upcoming ones. Hence, the existing drum generator is not a causal system and can not be employed in a real-time scenario as is. In order to be able to use the generator as a real-time playable (performable) instrument, the model needs to be modified into a causal system. To do so, we will need to take advantage of a causally masked attention mechanism such that the prediction at any time step is only dependent on current and past events.

Conclusions

In this paper, we presented a transformer architecture capable of generating drum loops given a performed monotonic groove. Moreover, our initial investigations demonstrate that this task can be reasonably executed without the discretization (tokenization) of the input/output sequences. Finally, we provided a discussion on the real-time applications of the proposed system.

Ethics Statement

We acknowledge that there are many ethical concerns with regard to generative systems in general, to name a few, automation, data ownership, generation ownership, fair (or rather biased) representation, as well as the accessibility of development tools. While we did not have ethical conflicts throughout this work, we are certainly concerned about future direct/in-direct applications of our research, specifically in a commercial context.

During this work, we only used publicly available data as well as publicly available tools. For researching/developing the methods involved, we had access to a high-performance computing cluster; we acknowledge that these tools are not publicly available, specifically, for independent researchers. As a result, we have done our utmost best to publicly share not only the source code of our work but also any trained versions of our models, as well as, scripts to easily load, use, and study the trained models without requiring any specific high-performance hardware.

Footnotes

https://colinraffel.com/projects/lmd/
↩
The authors of GrooVAE collected and released hours of drum performances (Groove MIDI Dataset) from improvisations in different styles.
↩
Throughout this paper, we will be using images similar to the following Image, referred to as Velocity Heat-maps. These plots are compiled from a collection of loops (either from predictions or the ground truth). In other words, these plots are obtained by superimposing the piano-roll representations of all the sample loops available in a given subset. The orange scatter points in these images represent a drum event (a hit) at a given time (found on the x-axis relative to the 16th note grid in 4/4) with a given velocity (found on the y-axis). The underlying heat maps are calculated by applying a gaussian filter to the two-dimensional histograms of the scatter points and denote the probabilities of events happening in different regions.
↩
A summary of these experiments can be found here: https://wandb.ai/anonmmi/NIME2022_anon_rytm/reports/First-sweeps-and-lack-of-variation-in-offsets--VmlldzoxNTA1ODk2
↩
http://goo.gl/magenta/groovae-colab
↩
Hit loss multiplier in Equation 7
↩
https://github.com/AnonUserGit/TransformerVelGroove2Performance
↩
https://github.com/AnonUserGit/TransformerVelGroove2Performance/blob/main/NIME2022_Demo.ipynb
↩
Using an Nvidia Tesla T4 GPU (https://www.nvidia.com/en-us/data-center/tesla-t4/)
↩
https://wandb.ai/anonmmi/NIME2022_anon_rytm/reports/Audios-selected-models-vs-GrooVAE--VmlldzoxNDc3ODM4
↩
https://wandb.ai/anonmmi/NIME2022_anon_rytm/reports/Velocity-heatmaps-selected-models-vs-GrooVAE--VmlldzoxNDc3ODQ4
↩
https://wandb.ai/anonmmi/NIME2022_anon_rytm/reports/Evolution-of-velocity-heatmaps-during-training-validation-set---VmlldzoxNDc3ODM1
↩

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 5998–6008.

↩
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv Preprint arXiv:1810.04805.

↩
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., & others. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.

↩
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., … others. (2020). Language models are few-shot learners. arXiv Preprint arXiv:2005.14165.

↩
Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Shazeer, N., Simon, I., Hawthorne, C., … Eck, D. (2018). Music transformer. arXiv Preprint arXiv:1809.04281.

↩
Payne, C. (2019). MuseNet. OpenAI Blog, 3.

↩
Donahue, C., Mao, H. H., Li, Y. E., Cottrell, G. W., & McAuley, J. (2019). LakhNES: Improving multi-instrumental music generation with cross-domain pre-training. arXiv Preprint arXiv:1907.04868.

↩
Choi, K., Hawthorne, C., Simon, I., Dinculescu, M., & Engel, J. (2020). Encoding musical style with transformer autoencoders. International Conference on Machine Learning, 1899–1908. PMLR.

↩
Huang, Y.-S., & Yang, Y.-H. (2020). Pop Music Transformer: Beat-based modeling and generation of expressive Pop piano compositions. Proceedings of the 28th ACM International Conference on Multimedia, 1180–1188.

↩
Hsiao, W.-Y., Liu, J.-Y., Yeh, Y.-C., & Yang, Y.-H. (2021). Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs. arXiv Preprint arXiv:2101.02402.

↩
Jiang, J., Xia, G. G., Carlton, D. B., Anderson, C. N., & Miyakawa, R. H. (2020). Transformer vae: A hierarchical model for structure-aware and interpretable music representation learning. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 516–520. IEEE.

↩
Gómez-Marı́n, D., Jordà, S., & Herrera, P. (2020). Drum rhythm spaces: From polyphonic similarity to generative maps. Journal of New Music Research, 49(5), 438–456.

↩
Vogl, R., Eghbal-Zadeh, H., & Knees, P. (2019). An automatic drum machine with touch UI based on a generative neural network. Proceedings of the 24th International Conference on Intelligent User Interfaces: Companion, 91–92.

↩
Lea, C., Flynn, M. D., Vidal, R., Reiter, A., & Hager, G. D. (2017). Temporal convolutional networks for action segmentation and detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 156–165.

↩
McCormack, J., Gifford, T., Hutchings, P., Llano Rodriguez, M. T., Yee-King, M., & d’Inverno, M. (2019). In a silent way: Communication between ai and improvising musicians beyond sound. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1–11.

↩
Chen, Y.-H., Huang, Y.-H., Hsiao, W.-Y., & Yang, Y.-H. (2020). Automatic composition of guitar tabs by transformers and groove modeling. arXiv Preprint arXiv:2008.01431.

↩
Wu, S.-L., & Yang, Y.-H. (2020). The jazz transformer on the front line: Exploring the shortcomings of ai-composed music through quantitative measures. arXiv Preprint arXiv:2008.01307.

↩
Wu, X., Wang, C., & Lei, Q. (2020). Transformer-XL Based Music Generation with Multiple Sequences of Time-valued Notes. arXiv Preprint arXiv:2007.07244.

↩
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv Preprint arXiv:1810.04805.

↩
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. arXiv Preprint arXiv:1901.02860.

↩
Payne, Christine. "MuseNet." OpenAI, 25 Apr. 2019, openai.com/blog/musenet
↩
Colin Raffel. "Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching". PhD Thesis, 2016.
↩
Donahue, C., Mao, H. H., & McAuley, J. (2018). The NES Music Database: A multi-instrumental dataset with expressive performance attributes. ISMIR.

↩
Ens, J., & Pasquier, P. (2020). Mmm: Exploring conditional multi-track music generation with the transformer. arXiv Preprint arXiv:2008.06048.

↩
Nuttall, T., Haki, B., & Jorda, S. (2021). Transformer Neural Networks for Automated Rhythm Generation.

↩
Gillick, J., Roberts, A., Engel, J., Eck, D., & Bamman, D. (2019). Learning to groove with inverse sequence transformations. International Conference on Machine Learning, 2269–2279. PMLR.

↩
Cartwright, M., & Bello, J. P. (2018). Increasing drum transcription vocabulary using data synthesis. Proc. International Conference on Digital Audio Effects (DAFx), 72–79.

↩
Biewald, L. (2020). Experiment Tracking with Weights and Biases. Retrieved from https://www.wandb.com/

↩
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.

↩
Bruford, F., Lartillot, O., McDonald, Sk., & Sandler, M. (2020). Multidimensional similarity modelling of complex drum loops using the GrooveToolbox.

↩
Donahue, C., Simon, I., & Dieleman, S. (2019). Piano genie. Proceedings of the 24th International Conference on Intelligent User Interfaces, 160–164.

↩

A detailed document describing how to train an encoder-only transformer on drum piano-rolls

Originally Submitted to International Conference on New Interfaces for Musical Expression • NIME 2022

Transformer Architecture for Generating Short Drum Loops Given a Performed Monotonic Groove

Abstract

Author Keywords

CCS Concepts

Introduction

Related Works

Drum Generation using Transformers

Groove to Drum Performance

Methodology

Dataset

Data Representation

Architecture

Training

Evaluation

Discussion

Monotonic Groove Performance to Drum Loop Generator

Real-Time Application

Conclusions

Ethics Statement

Footnotes

References