Deep Learning for Audio and Music: Final Project (Computer Science MSc, Summary)

Introduction

Expressive performance is a critical aspect of music production, allowing performers to convey their emotional and structural narratives to listeners through nuanced deviations in tempo, dynamics, and articulation. These expressive elements transform a mechanical rendition of a score into a vibrant and engaging musical experience. This study explores the use of Long Short-Term Memory (LSTM) and Transformer-based neural networks to model expressive performance, focusing on the prediction of rubato, articulation, and dynamics in polyphonic music.

Methodology

Datasets

A significant challenge in modelling expressive performance is the acquisition of clean and aligned score and performance MIDI files. Misalignments can arise from performance errors or deviations from the score, complicating the measurement of performance actions such as inter-onset-interval (IOI) and duration deviations. To address this, we employed Nakamura’s alignment algorithm, which pairs performance notes with score notes and identifies missing or extra notes. This algorithm was applied to the ASAP dataset, a collection of matched performances and scores, resulting in perfectly aligned data for analysis.

Image

Example output of the alignment algorithm
Data Representation

The MIDI data was converted into lists of notes, where each note’s features included pitch, IOI, duration, and velocity. Two datasets were formed: one for rubato and articulation, and another for velocity. In the rubato and articulation dataset, velocity values were dropped, and deviations in IOI and duration were calculated as fractions between performance and score values. The velocity dataset retained only performance notes. Each sequence was padded with zeros and split into 200-note sub-sequences, with the current note placed at index \(n/2\). To prevent the model from “cheating,” future \(y\) features were masked with zeros. The data was normalized between 0 and 1 using min-max scaling by feature.

Image

Example timing dataset sample (seq length: 10, current note: red, future notes: light red)
LSTM Model

The LSTM model employed a “present and future” approach, splitting the input sequence at the current note and feeding each half into separate bidirectional LSTM layers. The outputs were concatenated, flattened, and passed through linear mapping layers to generate the final prediction. This approach allowed the model to learn different interpretations of the sequence based on past and future contexts.

Transformer Model

The Transformer model followed the standard architecture proposed in the Attention Is All You Need paper. A linear layer was used for embedding, projecting input features up by a factor of 10. The encoder received the entire input sequence, while the decoder received the central 9 notes. Sinusoidal positional encodings were used to signify present and future notes. The decoder’s output was flattened and passed through linear mapping layers to render the final prediction. Dropout with a probability of 0.1 was applied up to the last linear layer.

Evaluation

The models were trained on 1000 samples from the timing and velocity datasets using the ADAM optimizer, mean-squared-error loss function, a batch size of 100, gradient clipping at 0.5, and a learning rate of \(1e^{−5}\). Training times ranged from 6-7 hours for Transformers and 11-13 hours for LSTMs.

Velocity Models

The Transformer model significantly outperformed the LSTM in velocity prediction, likely due to its ability to capture polynomial relationships between pitch and velocity more effectively. Both models demonstrated the ability to identify and emphasize the leading melody in polyphonic samples, with the LSTM showing stronger crescendos and decrescendos.

Timing Models

The Transformer model exhibited a slight performance advantage over the LSTM in timing prediction, with more coherent and musically plausible outputs. The LSTM, however, produced erratic and overly fast outputs, often placing negative IOIs on notes. The Transformer demonstrated clear use of rubato and rhythmic patterns, likely enabled by its self-attention mechanism.

Conclusion

This study documented the application of LSTM and Transformer architectures for predicting expressive timing and velocity in polyphonic music. While the velocity models, particularly the LSTM, showed promising results in musicality and expression, the timing models’ performance was less satisfactory. Factors such as model architecture and data representation could contribute to this outcome. The timing data representation, initially as a fraction between performance and score IOIs, may have caused large negative deviations, potentially complicating the model’s learning process. Future work should explore standardization approaches to improve data distribution and model interpretability.