A musical piece often consists of recurring elements at various levels, from motifs to phrases to sections such as verse-chorus. To generate a coherent piece, a model needs to reference elements that came before, sometimes in the distant past, repeating, varying, and further developing them to create contrast and surprise. Intuitively, self-attention (Parikh et al., 2016) appears to be a good match for this task. Self-attention over its own previous outputs allows an autoregressive model to access any part of the previously generated output at every step of generation. By contrast, recurrent neural networks have to learn to proactively store elements to be referenced in a fixed size state or memory, potentially making training much more difficult. We believe that repeating self-attention in multiple, successive layers of a Transformer decoder (Vaswani et al., 2017) helps capture the multiple levels at which self-referential phenomena exist in music.
In its original formulation, the Transformer relies on absolute position representations, using either positional sinusoids or learned position embeddings that are added to the per-position input representations. Recurrent and convolutional neural networks instead model position in relative terms:
RNNs through their recurrence over the positions in their input, and CNNs by applying kernels that effectively choose which parameters to apply based on the relative position of the covered input representations.
Music has multiple dimensions along which relative differences arguably matter more than their absolute values; the two most prominent are timing and pitch. To capture such pairwise relations between representations, Shaw et al. (2018) introduce a relation-aware version of self-attention which they use successfully to modulate self-attention by the distance between two positions. We extend this approach to capture relative timing and optionally also pitch, which yields improvement in both sample quality and perplexity for JSB Chorales. As opposed to the original Transformer, samples from a Transformer with our relative attention mechanism maintain the regular timing grid present in this dataset. The model furthermore captures global timing, giving rise to regular phrases.
The original formulation of relative attention (Shaw et al., 2018) requires O(L 2D) memory where L is the sequence length and D is the dimension of the model’s hidden state. This is prohibitive for long sequences such as those found in the Piano-e-Competition dataset of human-performed virtuosic, classical piano music. In Section 3.4, we show how to reduce the memory requirements to O(LD), making it practical to apply relative attention to long sequences. The Piano-e-Competition dataset consists of MIDI recorded from performances of competition participants, bearing expressive dynamics and timing on the granularity of < 10 miliseconds. Discretizing time on a fixed grid that would yield unnecessarily long sequences as not all events change on the same timescale. We hence adopt a sparse, MIDI-like, event-based representation from (Oore et al., 2018), allowing a minute of music with 10 milisecond resolution to be represented at lengths around 2K, as opposed to 6K to 18K on a fixed-grid representation with multiple performance attributes.
As position in sequence no longer corresponds to time, a priori it is not obvious that relative attention should work as well with such a representation. However, we will show in Section 4.2 that it does improve perplexity and sample quality over strong baselines. We speculate that idiomatic piano gestures such as scales, arpeggios and other motifs all exhibit a certain grammar and recur periodically, hence knowing their relative positional distances makes it easier to model this regularity. This inductive bias towards learning relational information, as opposed to patterns based on absolute position, suggests that the Transformers with relative attention could generalize beyond the lengths it was trained on, which our experiments in Section 4.2.1 confirm