Automatic Music Identification Using Content-Based Audio Fingerprints
Music Informatics: Assessment 2 (Computer Science MSc)
Introduction
This project aims to implement a programmatic method for automatic music identification based on the approach presented by Wang in “An industrial strength audio search algorithm” ISMIR (2003). The system allows a user to match a noisy song fragment to a database recording by comparing sets of content-based audio fingerprints.
To evaluate the system, the GTZAN dataset - a collection of 200 pop and classical audio files - is used as the database, and a corresponding set of noisy phone recordings provided by the university is used as the query set. Not all database recordings are present in the query set, and each query file only contains a 10 second segment of the corresponding 30 second database file.
Implementation
The implementation consists of two main tasks: fingerprint building and audio identification. Both tasks rely on the same fundamental processes, which are described below.
Spectrograms
The raw audio files are first converted to spectrograms, which represent the signal information in both the time and frequency domains. Three spectrogram types are explored:
- Short-time Fourier Transform (STFT): A direct representation of the signal’s spectral content over time.
- Mel-scaled Spectrogram: Applies the mel scale to the frequency component, providing a more accurate representation of human pitch perception.
- Constant-Q Transform: Uses a logarithmic pitch scale based on equal temperament, better approximating musical pitch classes.
The STFT spectrogram is found to be the most effective for this application.
Constellation Maps
To reduce the amount of data while retaining the signature of the sound, the spectrograms are converted to constellation maps - collections of local magnitude peaks in the spectrogram. This significantly reduces the quantity of data to be processed.
Peak Pairing
The peak pairing algorithm, based on the approach described by Wang [7], is used to generate a set of context-independent features that can be quickly matched to the files stored in the database. For each peak in the constellation map, it pairs it with every other peak within a certain time-frequency target zone, creating a fingerprint of (f1, f2, Δt)
tuples.
A simple hashing function is implemented to convert the peak pair tuples to unique standardized integer values, mitigating the effects of duplicate values in the audio identification function.
Identification
Two methods for comparing the query fingerprint to the database fingerprints are explored:
- Nested Loops: A simple nested loop that compares every hash in the query fingerprint to every hash in each database fingerprint. This method is extremely slow and not viable with a large database.
- Set Intersection: Uses the set intersection function to retrieve all the entries present in both the query and database fingerprints. This method is much more efficient, with a time-complexity of
O(min(q, d))
, whereq
andd
represent the number of hashes in the query and database fingerprints, respectively.
The similarity value for each database item is generated, and the three best matches are recorded in the output.txt
file.
Results and Discussion
The system performs very well on the pop genre, with an accuracy of 94.23% at rank 1. However, the performance on the classical genre is significantly lower, at 59.26% accuracy at rank 1. This could indicate a bias towards pop fingerprint components or a wider variety of sounds in the pop songs compared to classical.
The F-Measure scores reveal an interesting observation - pop songs outperform classical in the first rank, but this trend reverses in the subsequent ranks. This could suggest that pop songs have more uniquely identifiable fingerprints and that classical songs share more similarity with each other.
The average execution time for fingerprint generation is around 436ms, and the average identification time is around 409ms (reduced to 100ms after subtracting the file loading time). While these results are acceptable for the scale of this implementation, they would not be suitable for industry-scale applications with much larger databases.
To improve the system’s scalability and performance, the following suggestions are made:
- Implement a more streamlined and robust search method, such as genre classification or other information retrieval/indexing-based heuristics.
- Explore a de-noising layer to improve the likelihood of selecting relevant peaks in the query files, without drastically increasing the execution time.
Share on: