Image2MID: Environment-based Music Generation for Video Games
Creative Coding: Final Project (Computer Science MSc)
Since its inception in 1970, non-diegetic music has been a cornerstone of video game design, shaping player experiences through iconic motifs. The Legend of Zelda franchise exemplifies this, where leitmotifs such as the Kokiri Forest theme are used to distinguish recurring areas and characters across multiple games, creating a cross-modal link that unites audio and visuals. These themes aren’t just background music; they are tangible connections between artistic elements, reinforcing the game’s identity across iterations.
As we look to the future of game design, the advancement of technologies like artificial intelligence may offer interesting opportunities for research and development. The use of neural networks for music generation isn’t new, with the first instance dating back to 1989. However, music generated by these networks is often flat, unmusical, and thematically ambiguous.
Over the years, steady progress has been made in automatic composition through research into neural network architectures, fueled by the increase in computational power and data availability. However, the field still lacks research that investigates the relationship between environment and motif in video games and media. This leaves us with an opportunity for research into the cross-modality of music and visuals as a generative task.
Data and Representation
The goal of this project is to design a model which compose short musical motifs conditioned on an image of a video game environment. To achieve this, we require a dataset containing pairs of environment images and symbolic music files. Considering the novelty of this task, it is not surprising to find that there are no publically available datasets suitable for our needs.
Therefore, we must fist create the datasets we require for training. To do this, we scrape a number of online MIDI repositories for MIDI files belonging to the Zelda franchise. In addition, for a smaller subset of these files, we manually acquired images of their respective in-game environments. From this data, we generate two datasets:
-
Image-to-MIDI Dataset – This dataset contains pairs of environment images and MIDI files, which each MIDI file corresponding to a screenshot from its respective in-game area. The images are all of a 16:9 aspect ratio, and have three colour channels.
-
Zelda MIDI Dataset – This dataset contains MIDI tracks from various Zelda games, used to train the continuation section of our implementation. The MIDI files were scraped from various online sources and manually inspected for quality.
The MIDI portion of the Img2Mid dataset is converted to 2D piano roll representations using the pypianoroll module in Python. The images are resized to ratio-preserving thumbnails of a specific size. For the ZeldaMIDI dataset, we use a dictionary mapping to convert MIDI notes to tokens based on their location in a dictionary.
System Description
We use an ensemble modelling approach in the design of our network, where multiple different architectures are combined to form a more complex system. Each module of the network was trained separately using their respective datasets. When combined, the model takes an image of a video game environment as input, generates a starting seed based on its features, composes a short melody from the seed, and then adds velocity and timing deviations to make it sound more expressive and performed.
Seed Generation
A variational autoencoder (VAE) is used to generate the starting seed of the motif. This module takes an environment image as input, compresses with a bottle neck, and encodes it as in a lower-dimensional space. The decoder section of the VAE then transposes the latent vector up into a piano roll representation. A threshold value of 0.7 is used to pick out predicted notes. This network was trained on the Image-to-MIDI dataset, where images form the input and MIDI files the output. When given an image, the network generates a short MIDI piano roll based on the features of the image.
Motif Continuation
Once the seed is generated, it is tokenised and passed to the motif continuation section. Here, an LSTM network autoregressively generates predictions for the next note for 100 cycles, outputting a much longer composition based on the input seed. This network was trained using the Zelda MIDI dataset, with each file generating a collection of 50-note long sequences shifted by one step. Dropout and weight decay was added to prevent the network from simply reconstructing the training data.
Expressive Performance Rendering
To account for the lack of note duration and velocity in the motif continuation network, we implemented expressive performance rendering (EPR) models. These simple convolutional-regression networks take the generated note sequence as input and output either a list of note durations or velocity values. These models were trained on the ZeldaMIDI dataset, using a pre-processing step similar to the LSTM network. The resulting vectors are applied to the MIDI notes and rendered as a MIDI file.
Example Outputs
Elden Ring
Celeste
Share on: