CrossMuSim: A Cross-Modal Framework for Music Similarity Retrieval with LLM-Powered Text Description Sourcing and Mining

Abstract

Music similarity retrieval is fundamental for managing and exploring relevant content from large collections in streaming platforms. This paper presents a novel cross-modal contrastive learning framework that leverages the open-ended nature of text descriptions to guide music similarity modeling, addressing the limitations of traditional uni-modal approaches in capturing complex musical relationships. To overcome the scarcity of high-quality text-music paired data, this paper introduces a dual-source data acquisition approach combining online scraping and LLM-based prompting, where carefully designed prompts leverage LLMs’ comprehensive music knowledge to generate contextually rich descriptions. Extensive experiments demonstrate that the proposed framework achieves significant performance improvements over existing benchmarks through objective metrics, subjective evaluations, and real-world A/B testing on a commercial music streaming platform.

LLM-Based Prompting

Prompt Structure

Category Details
System You are a music critic. Please provide a detailed description of the song's musical semantics based on the information I provide, including the artist name and song title. The paragraph should cover the following elements: genre, melody, accompaniment, themes, and emotions.
User Artist: [ARTIST], Title: [TITLE]

Generated Examples by GPT-4o-mini

Example 1: Coldplay - Viva la Vida

"Viva la Vida" by Coldplay is a poignant blend of alternative rock and baroque pop, characterized by its sweeping orchestral arrangement and anthemic quality. The melody is both uplifting and melancholic, driven by a memorable string section that weaves through the verses, creating a sense of grandeur. The accompaniment features a combination of lush strings and rhythmic percussion, providing a rich backdrop that enhances the song's emotional depth. Lyrically, the song explores themes of power, loss, and redemption, reflecting on the fleeting nature of fame and the inevitable decline that follows. The narrator's journey from a position of authority to a state of vulnerability evokes a profound sense of nostalgia and introspection. Emotions range from triumphant to sorrowful, as the chorus bursts forth with an almost euphoric energy, juxtaposed against the introspective verses. This dynamic interplay between joy and melancholy resonates deeply, making "Viva la Vida" a powerful anthem of reflection and resilience.

Example 2: Adele - Take It All

"Take It All" by Adele is a moving ballad rooted in the pop and soul genres, showcasing Adele's powerful vocal prowess and emotional depth. The song features a haunting piano melody that serves as a poignant backdrop, allowing her rich, emotive voice to take center stage. The sparse yet effective accompaniment gradually builds, incorporating strings that add a layer of intensity and drama. Thematically, "Take It All" delves into the complexities of heartbreak and vulnerability, capturing the feelings of loss and surrender. The lyrics convey a sense of resignation, reflecting a deep emotional struggle and the desire to relinquish what once was, even when it hurts. Emotionally, the song evokes a profound sense of sadness and longing, perfectly encapsulated in Adele's soulful delivery. Her ability to convey raw emotion through her voice resonates deeply with listeners, making the experience both intimate and universal. Overall, "Take It All" stands as a testament to Adele's artistry, blending exquisite melody with heartfelt themes that leave a lasting impact.

Example 3: DJ Khaled featuring T-Pain, Ludacris, Snoop Dogg, and Rick Ross - All I Do Is Win

"All I Do Is Win" by DJ Khaled, featuring T-Pain, Ludacris, Snoop Dogg, and Rick Ross, is a quintessential hip-hop anthem that encapsulates the genre's celebratory spirit. The song features a catchy, infectious melody driven by a punchy, synth-heavy instrumental that propels the track forward with relentless energy. The production includes crisp snares and booming bass, creating an empowering backdrop that encourages listeners to embrace a winning mindset. Lyrically, the song revolves around themes of success, perseverance, and triumph over adversity. Each artist contributes their own verses that highlight personal victories and the determination to rise above challenges, creating a collective narrative of confidence and resilience. The repeated refrain of "All I do is win" serves as a rallying cry, fostering an atmosphere of motivation and celebration. Emotions conveyed in the track range from exuberance to defiance, making it a staple in both sports arenas and parties. The collaboration of these prominent artists adds depth and variety, with T-Pain’s melodic hooks, Ludacris’s rapid-fire delivery, Snoop Dogg’s laid-back charisma, and Rick Ross’s authoritative presence, all coming together to create an unforgettable, high-energy anthem that inspires listeners to embrace their own victories.

Music Similarity Retrieval Demo

Here, we present demo results based on the evaluation set of 50k samples selected from our in-house music collection. We evaluate the performance of three methods: open-source MTT (baseline), online scraping (ours), and online scraping with LLM-based prompting (ours).

Example 1

Seed Song: Justin Timberlake - Selfish

Baseline Online Scraping +LLM-Based Prompting

Example 2

Seed Song: SZA - Kill Bill

Baseline Online Scraping +LLM-Based Prompting

Example 3

Seed Song: Ed Sheeran - When Will I be Alright

Baseline Online Scraping +LLM-Based Prompting

Example 4

Seed Song: 许茹芸 Valen Hsu - 如果云知道 The Cloud Knows?

Baseline Online Scraping +LLM-Based Prompting

Example 5

Seed Song: 鲸鱼马戏团 Whale Circus - 落雨 Raindrop

Baseline Online Scraping +LLM-Based Prompting