An introduction to my PhD project
My PhD examines how computers can better understand music by looking at more than just audio. I do that by looking at how we humans understand music by not only listening to it but also moving to the beat, looking at different visuals, reading text/lyrics, understanding cultural cues, and monitoring the physical behavior of performers. This multidimensional integration is called multimodal perception. When we combine recordings of multiple data sources in a computational system, we call it multimodal fusion, and it is a way to model music in a deeper and more human-like manner
This page explains the main research questions in my project, and gives you a visual “map” of how the ideas connect.
How can multimodality be defined and used in computational music analysis?
Different fields talk about multimodality in different ways. In my PhD, I propose a simple idea: use multiple data sources only when each one adds something new and meaningful to the task. This helps researchers choose the right combination of audio, video, motion, text, and metadata, depending on what they want to understand.
How much can technology enhance the contextual understanding of music?
I worked with several multimodal datasets, such as folk music, classical music, ensemble performance, motion capture, lyrics, and metadata, to see whether machines can answer questions like:
What is Music Performance?
Let's just say that music performance is music in action. It’s when performers turn compositions, improvisations, or musical ideas into sound, movement, and expression for an audience. Think of it as storytelling with sound, gestures, and emotion.
What is Music Performance Context?
Context is everything around the music that shapes its meaning: the venue, the audience, the gestures of performers, cultural rules, and even the placement of instruments. It’s what makes the same piece of music feel different in a concert hall, a jazz club, or a folk festival.
What is Music Understanding?
Music understanding begins when we stop simply hearing sound around us and start listening with attention. Hearing is passive and happens automatically. Listening is active and happens when we focus, follow, feel, and make sense of what unfolds in the music. Understanding music isn’t about knowing fancy terminology or analyzing scores, but about noticing what the music is doing, how it moves, how it feels, and how our own experiences help us interpret it.
People often imagine that only trained musicians or musicologists can “understand” music, but everyone can! A person may not know what a motif or cadence is, but they can still feel repetition, arrival, contrast, or surprise. They can follow a musical idea as it evolves, or sense the energy in a performer’s gesture. Understanding music is less about naming things and more about experiencing how music flows and how it speaks.
What is music information processing?
Computational music information processing is about turning the physical experience of music into digital data, and then into forms of machine cognition, so that computers can help us explore, analyze, and understand music in new ways.
metadata = {
"tempo": 95,
"key": "A minor",
"instrumentation": ["violin", "laouto", "voice"],
}
question = "What is the key of the piece?"
if "key" in question.lower():
answer = metadata["key"]
else:
answer = "Not sure yet!"
answer
Is my dataset multimodal?
Step 1 — What task are you performing?
Step 2 — What data does your dataset contain?
This interactive tool helps you explore which music processing tasks you can perform with your dataset, or which data you need for a target task.
This interactive test helps you assess whether your dataset management aligns with ethical and legal requirements.