Combining Music with High-Tech.
Overview: My PhD research project explores the role of multimodal information in music understanding. Drawing inspiration from the inherent multimodality of music itself, the project seeks to enhance the contextual comprehension of music for non-experts by integrating insights from musicology, music psychology, and music technology.
The core of the project lies in exploring how the deliberate integration of complementary information sources, tailored to specific tasks, can deepen our understanding of music. This includes investigating the effectiveness of machine learning models in analyzing multimodal music data, as well as the impact of engaging multiple sensory modalities (auditory, visual, tactile) on audience comprehension of musical contexts. By bridging the gap between scientific analysis and experiential understanding, this research aims to contribute to a more inclusive and accessible experience of music.
For more details on my research, visit my Google Scholar Profile or my University profile at RITMO.
Music, as an art form, is inherently multimodal. It involves a rich blend of sensory experiences: auditory (the music itself), visual (e.g., music videos, album covers), kinesthetic (e.g., movement and dance), and even physiological (e.g., heart rate, skin response). Human perception of music is shaped by these varied inputs, and our understanding of a musical work is informed not only by its auditory components, but also by its visual and contextual aspects. For example, watching a live performance or reading lyrics can enhance one’s interpretation of the piece, while interacting with music via playlists, recommendations, and discussions provides additional context that shapes our perception.
In contrast, machine models of music typically focus on isolated data modalities. Most music analysis systems rely primarily on audio data, analyzing waveforms or spectrograms to extract low-level features such as pitch, rhythm, and timbre, as well as high-level descriptors like genre and emotional content. However, this narrow focus does not capture the full complexity of music. Machines are often tasked with performing tasks like music retrieval, tagging, and recommendation based solely on audio signals, ignoring the complementary information offered by non-audio modalities such as visual and textual data.
The core motivation of my research is to develop a more human-like, multimodal approach to music understanding, where audio is analyzed in conjunction with other sensory modalities to create a richer, more contextual understanding of music. The end goal of this research is to enable machines to interact with and understand music in a way that mirrors human perception, fostering more natural and transparent human-computer interaction in the music domain.
If you're interested in exploring multimodal machine learning applied to music, I have curated a GitHub repository with a collection of multimodal music datasets. These datasets can serve as a valuable resource for those working on music-related multimodal research.