🗣️

Revolutionizing Search & Discovery with Conversational Music Retrieval

Though iTunes and Spotify disrupted the legacy music industry model, music discovery hasn’t evolved much since Algorithmic Recommendations. I’d say this is fertile grounds for a challenger service like TIDAL to disrupt the status quo once again and reap the rewards.

Deep Learning innovations have the power to disrupt at such scale with Conversational Music Retrieval. Gone are the days of relying on textual searches or dry “for you” playlists. Users can finally have the music discovery experience they’ve always wanted through conversation.

Evolution of Music Discovery
Analog Era: Radio to Big Box Retail
Golden Age of Radio
1950s and 60s
1970s, 80s and 90s
Digital Revolution: Napster to iTunes to Spotify
Limitations of Current Paradigm
Conversational Music Retrieval
Differences from traditional search and recommendation
Potential impacts on user experience
Potential impacts on businesses in the industry
Technology Behind Conversational Music Retrieval
Deep learning for Music Information Retrieval
Natural Language Processing (NLP) in music context
The role of large language models
Beyond Genre: Understanding Musical Nuance
Teaching AI to recognize mood, energy, and context in music
Incorporating cultural and historical knowledge
Challenge: Capturing the 'feel' of music
Personalization: The DJ in Your Pocket
Learning user preferences over time
Balancing familiarity with discovery
Adapting recommendations based on context
Future Trends of Music Discovery
Integration with voice assistants and smart speakers
Virtual music curators and AI DJs
Impact on music creation and artist discovery
Ethical Considerations and Challenges
Avoiding echo chambers in recommendations
Protecting user privacy in conversational systems
Role of human curation alongside AI
From DJ Booth to AI Lab: A Personal Journey
Bridging technical knowledge with musical intuition
Lessons from traditional DJing applicable to AI music retrieval
Vision for the future of music discovery
Conclusions
Further Discussion
Call to Action

Evolution of Music Discovery

Music is fascinating. I felt this as a young boy praying to algorithmic gods my mixtape got streams. I felt it as a software engineer building new products at Amazon Music. I feel it still.

The Era of recorded music began in 1877 when Edison unveiled the Phonograph. Since then, collectors have hunted high and low to discover the rarest musical recordings to collect.

What follows is my best attempt to condense hundreds of years of culture into a few paragraphs.

Analog Era: Radio to Big Box Retail

Before digital computers, all sound and music was produced by analog means. This meant sharing a song required being physically present, sending a physical good, or emitting a radio frequency broadcast. Discovering new music intentionally was relatively challenging.

Golden Age of Radio

Music shows became popular on the radio in the 1920s in the US. During the “Golden Age of Radio,” which lasted until the early 1950s, Radio was the dominant form of big-budget mainstream entertainment. Fans of music would not only discover new music but also listen to music primarily on their radios.

1950s and 60s

In the 1950s and 60s, Radio was joined by new mediums for music discovery and sharing:

Disc Jockeys or DJs started to play a stronger role in curating tracks for discovery
Television variety shows like American Bandstand expose fans to new acts
Dining and entertainment venues add Jukeboxes to play new music

Radio still held the lead, but the market was diversifying.

1970s, 80s and 90s

This era saw a relatively steady stream of profits into the recorded music industry as it also continued to innovate like never before.

70s radio stations split by genre, creating dense network effects for similar music
70s also saw the rise of Record Stores as a venue for social discovery
In 1981, MTV launched an artist discovery platform with music videos
80s listeners also began sharing custom mixtapes between friends
90s big box retailers like Best Buy had large sway in record sales

But these market dynamics would be subverted suddenly in 1999, with the advent of peer-to-peer music sharing services like Napster.

Digital Revolution: Napster to iTunes to Spotify

In 1999, Napster was released. It allowed users to share audio files in a decentralized peer-to-peer manner. This created a music discovery paradigm primarily driven by search. That is, users would find new music by searching and downloading from the results. Recommendations operated off of message boards and word-of-mouth through conversation.

Napster would soon be shut down for copyright lawsuits, but Apple innovated a more equitable and legal marketplace dynamic with iTunes in 2001. Curated playlists and suggestions occupied the front page prominently. For the most part, search still reigned supreme, and many still relied on radio or word of mouth to find new music.

It was not until Spotify came along in 2008 that algorithmic music discovery would reach technological maturity, find product-market fit, and begin scaling into the household. Spotify’s generated “for you” mixes, playlists and innovative social features defined their era.

Limitations of Current Paradigm

Limitations of current music discovery approaches mirror other societal problems and other technological shortcomings of current era technology.

Current algorithm dynamics favor established artists heavily. Songs take longer to reach the Billboard charts, and they stay there longer. Algorithmic recommendations have become a power law game where the entrenched winners-take-all. As a result, the “Golden Age” of viral artist discovery online has stabilized into a homogenous mix of soundalike bands, AI EDM covers, and copycats ruling the discovery charts week after week.
Algorithmic suggestions prioritize shallow engagement over deep matching. Platforms have one goal: keep you there. Spending longer every day, like a time casino. I want music that challenges me, expands my horizons, and ultimately enriches my experience. Current music discovery tools aren’t even designed with that goal in mind, and tend to cluster you into an algorithmic “pocket” where they try to keep you.
Algorithmic suggestions lack nuance. Where a good friend or a record store employee who spoke with you could offer deeply insightful recommendations based on your tastes, often algorithmic recommendations are based on inferences from weak behavioral signals that don’t bridge the gap of what each song feels like to experience.
An element of connection is missing. My father, my best friend from high school, or my favorite DJ can convince me to love an unfamiliar tune by sharing a moment of connection. Graphical User Interfaces by themselves do not evoke such emotion, at least not in me. A system that truly knows me, my preferences and my self, could do better.
Search queries are unnatural. We need an approach that facilitates more natural queries.

Given these limitations and the opportunity for digital services like streaming to technologically innovate or face market disruption, it’s time for a brand new paradigm to take center stage: Conversational Music Retrieval.

Conversational Music Retrieval

Conversational Music Retrieval allows you to find music through conversational interactions. Users provide feedback and refine suggestions through natural language dialog.

Differences from traditional search and recommendation

Two key differences set apart a conversational experience from the more traditional search and recommendation interfaces users are presently accustomed to:

Users explicitly state or refine their preferences. User preferences are currently collected largely from implicit behavioral signals. There is often no way for the user to directly view or edit the preferences the algorithm has classified them into, which is frustrating. Conversational music interfaces allows users to state their preferences plainly and directly.
Systems elicit preferences in natural language. Rather than complex light-up casinos made of snapping grids, design systems and walls of album art, recommender systems can directly elicit clarifications of preference from the user in natural language (textual and spoken).

These simple changes to a search and recommendation system have broad-reaching impacts.

Potential impacts on user experience

I anticipate the following potential positive impacts:

Conversational recommendation methodologies which consider sets of content like playlists or albums instead of individual pieces of content in search results lead users to express preferences that would not be otherwise expressed. As a result, the recommender can better understand the user and therefore make better suggestions on their behalf.
Users have the opportunity to directly clear up misconceptions, mis-categorizations, and ambiguity with direct natural language statements such as “I don’t like country music” or “Don’t play sad music early in the morning.” This is more efficient than attempting to derive such nuanced preferences from analyzing the entire set of implicit behavioral signals.
Conversational interface can provide a recommendation that equals or exceeds the specificity and quality of a record store with friendly and knowledgeable staff, something current platforms haven’t managed.

Conversational interfaces enable more personal and more personable UX for music.

Potential impacts on businesses in the industry

While the benefits of improved UX to the platform owners should be intuitive, there are other benefits to the development of this technology that industry insiders should not ignore:

Secure competitive advantage. Spotify rapidly became the #1 music streaming service due to innovating a better way to discover music. TikTok took over consumer social for the same reason. Any music service that is not #1 has an incentive to disrupt Spotify by revolutionizing music discovery user experience. Spotify has an incentive to defend their moat.
Improved forecasting from richer preference data. Artists, labels, and streaming platforms invest a lot of money into figuring out listener preferences, analyze trends in those preferences, and predicting what’s next. Plain, natural language user preferences would serve to demystify these predictive markets significantly.
Meet underserved user preferences with undiscovered inventory. As a listener with obscure preferences in music, finding the right tunes for a certain moment or mood can still be a challenge with traditional discovery tools. As a musician and uploader of content, I own a catalog of mostly undiscovered content that performed well in initial market testing but never reached widespread algorithmic distribution. Labels often have the same issue with undiscovered artists with large catalogs of content they invested significant sums to develop. There is market value to be exploited in connecting the obscurist listeners with the obscure musicians, on a new scale and at a new benchmark of relevance beyond a blog or playlist.
Improved recommendation relevance and clarity drives product innovation in the music industry based on research by Gorgoglione, Garavelli, Panniello, and Natalicchio. This observed effect creates additional pressure to innovate on artists, labels, and studios, but overall creates a favorable economic growth trend for the music industry (which is growing).

My assertion is that the platform and user dynamics and market incentives are aligned, meaning this wave of innovation is an inevitable trend if it proves to be technically viable.

Technology Behind Conversational Music Retrieval

Now that we have historical context and understand what makes these systems potentially superior to the ones we have today, let’s talk a bit more about the tech making things work.

Deep learning for Music Information Retrieval

An active topic of research is Deep Learning architectures for Music Information Retrieval. I don’t yet have the context to perform a rigorous literature review, but I’ll do my best to summarize what I’ve gleaned from ”Key Technologies of Music Information Big Data Storage and Retrieval” by Li and Huang:

Music information retrieval systems can exceed the performance benchmarks of the current music streaming paradigm for search and discovery using Deep Learning architectures and richer contextual data.
Exponentially increasing amounts of data being collected from music streaming platforms hints at improving long-term prospects of Deep Learning based recommendation architectures and algorithms based on the Blessings of Scale.
Similarity calculation methods have innovated rapidly but still have room for improvement.

But Deep Learning is a wide field; which specific methods will we use, and why are they optimal?

Natural Language Processing (NLP) in music context

Natural Language Processing is relevant to music in that users communicate their intent, or the musical experience they desire, via natural language in text or speech, or alternatively in a search query language. Conversational search eliminates the need for the user to break down their request into a search query language, potentially providing much richer context.

Conversational Search currently provides an interesting roadmap of research into topics like how multimodal techniques could improve results, developing more sophisticated user modeling, and integrating interesting or novel analytics from applying NLP to conversational search. Coupled with an expert Music Information Retrieval system that deeply understands the set of all recorded music, we see the building blocks of a new paradigm emerging.

Conversational Music Retrieval systems in particular have shown good results paired with synthetic data generation techniques. This implies scalability compared to labor-intensive methods of data generation such as Reinforcement Learning from Human Feedback (RLHF), though you’d likely want to RLHF as much as you could afford, anyway. Overall, the landscape for NLP in the context of conversation and music is wide and promising.

The role of large language models

How are generative AI technologies like Large Language Models involved? Enter MuseChat.

MuseChat is an experimental conversational music recommendation system for videos.
Recommendations are focused on user preference moreso than content compatibility.
MuseChat can interact with users for further refinements or to provide explanations of why music was recommended, leading to a more informative and satisfying user experience.
Experiment results show that MuseChat achieves significant improvements over existing video-based music retrieval methods as well as offers strong interpretability and interactivity.

MuseChat utilizes a Vicuna-7B Large Language model modified to accept multi-modal inputs within its Reasoning module involved in generative interpretability and recommendations. Such a small parameter count providing exceptional performance is nothing short of a promising result for LLMs and their role in reasoning within Conversational Music Retrieval systems.

Beyond Genre: Understanding Musical Nuance

Current music analysis techniques often rely on the concept of linear genre. I’d argue the 1:1 relationship of genre to song was broken long ago with fusion genres cross-contaminating with pop music. Current tastes are an eclectic medley.

Discovery now operates moreso on nuance and feel, which can better be represented by natural language in a conversational context than mere search queries or similarity calculation.

Teaching AI to recognize mood, energy, and context in music

While Spotify’s algorithm considers rich and nuanced metadata such as mood, energy, and context in music, most streaming platforms do not have such advanced analytics. Furthermore, a conversational interface could better unpack ideas like mood, energy, and context outside of assigning simple numerical scores to each value.

For an example of how music is currently scored and analyzed, see an analysis from one of my own tracks below.

Ouch… 0 popularity? I’m washed up! Let me know on X/Twitter if you enjoyed that tune.

Incorporating cultural and historical knowledge

If you’ve ever heard the term “culture vulture,” you know that incorporating cultural and historical knowledge properly as an artist is an important consideration of taste. This knowledge is currently captured by platforms in an inexact manner through measuring behavioral signals of users. A conversational reasoning engine can explicitly process and interpret this information and use it to deliver recommendations more aligned with the user’s preferences and context.

Challenge: Capturing the 'feel' of music

All of these dimensions of search and recommendation seek to capture one simple thing: what does it feel like to experience the music, and how would a user react to this piece of music based on our model of their internal world? In a word, we are attempting to measure a qualia.

At best, we can interpret this information indirectly, at least until we solve the hard problem of consciousness. But this quest will also give us valuable insights into the nature of our experiences, our preferences, and ultimately, minds - natural and synthetic.

Personalization: The DJ in Your Pocket

A key value add of AI technology is infinite personalization. Specifically, I can have a personalized experience catered to by an agentic DJ. One who knows me, understands my preferences. Even in scenarios where this makes no economic sense for a human to perform this labor in real-time.

Let’s examine some of the emergent value produced by “the DJ in your pocket.”

Learning user preferences over time

Nuance is often best captured by observing a subject over time. Users don’t always feel like giving their full context, and may not be able to lay it plain in words. Observation over time also grants the opportunity to compare expressed vs revealed preferences and observe fluctuations or oscillations in user preferences. The sum result is a dataset that could support more effective user preference modeling for increasingly resonant and personalized recommendations.

Balancing familiarity with discovery

While marketing a new song, labels traditionally track how many repetitions of a track are required to convert a listener into a fan of the artist. Effective songs get repeat streams a lot and convert listeners to fans rapidly.

At the same time, popular music and other samey genres run the risk of being “overplayed” easily. A simplistic, repetitive melody can be sweeter than candy in the right context, or grating and shrill if overdone.

Similarly, recommendations must balance a model of the user’s context and basis of familiarity with the curated sensation of discovering new joys.

Adapting recommendations based on context

An ideal search or recommender could account for contextual information such as:

Time
Activity
Location

These datapoints can be shared easily through a conversational interaction, if and when the user wants to share, creating an experience as personalized as the user desires.

Future Trends of Music Discovery

Based on my education and experiences in industry so far, here are my most defensible predictions regarding the future directions of music discovery as an industry vertical.

Integration with voice assistants and smart speakers

Extending the reach of streaming platforms into the household in a meaningful way is a potential value add through integrations with built-in voice assistants and physical smart devices:

Amazon Music captured the #3 streaming service slot by active user count with an Alexa integration during my time there. Users with smart devices love deep integration.
OpenAI is presently in talks with Apple to extend Siri’s functionality. Integrating general and expert capabilities into the core built-in voice assistants is a strategy giants like Apple and Google seem amenable to. Such a direction could birth companies and even industries.
Current generation AI hardware has generally been panned by critics and early adopters alike. Perhaps purpose-built devices covering narrower use cases might be the first to break through into mainstream adoption, such as a smart speaker with an intelligent and functional conversational search and music information retrieval service.

Far future applications will range from sci-fi to whacky here. Expect intelligent AI DJ equipment, musical instruments, and producer toys - not least because I’m building them.

Virtual music curators and AI DJs

Hallucinated diagram for a conversational music retrieval driven virtual DJ.

Hallucinated diagram for a virtual DJ driven by conversational music retrieval. Generated with Playground.

I predict niche tastes which are difficult for human curators and DJs to model or keep up with will spawn economic opportunity for virtual music creators and AI DJs akin to VTubers. Some exist already, and I’ve experimented in the space as well. While I can’t say for sure what the key to creating an interesting virtual character that captures the imagination of millions is just yet, I expect many Venture Capitalists, Founders, Artists and Brands to enter the arena and try stuff.

Impact on music creation and artist discovery

Current suggestive algorithms have heavily moulded listener preferences, artist career strategy, and creative dynamics in the business of music at large. A new paradigm of equal or greater power is likely to have similar effects, possibly on an exponential scale. The “meta” of getting discovered as an artist is constantly changing, and the only constant is change itself. This is a factor of the worlds of art and technology independently, and no more so than when combined.

I suppose my analysis of the available data is that art is technology, technology is an art, and both can be understood better through the scientific method. Artists, scientists, technologists and humans at large will need to continue to innovate to stay relevant, and that’s a good thing.

Ethical Considerations and Challenges

While I am pro-AI in general and pro-Conversational Music Retrieval in specific, as professionals it is our duty to mind ethical concerns surrounding a proposed paradigm shift. Thankfully, showing people cool songs seems quite unlikely to result in sci-fi doomsday type scenarios.

Avoiding echo chambers in recommendations

One of the problems with current recommendation systems I pointed out is:

Current algorithm dynamics favor established artists heavily.

In a phrase, “the rich get richer, and the poor get poorer.” It’s not hard to imagine a similar dynamic could be created to optimize content for conversational recommendations, similar to the discipline of Search Engine Optimization arising from the market demand for search.

To preserve the advantages of taking this approach in the first place, it must be designed to be resilient to cartel capture, preserve the personal relevancy of recommendations, and also adapt actively to resist adversarial tactics of platform manipulation such as guerilla marketing launched with the intent of creating echo chambers and attractor basins in the discovery experience.

Protecting user privacy in conversational systems

One advantage of these systems is actually a trade-off:

Improved forecasting from richer preference data.

Because these systems would collect richer preference data through semi-structured free response text, there is additional Information Security responsibilities on behalf of the platform. These include monitoring for Personally Identifiably Information, anonymizing or removing it, and protecting this valuable data from attackers.

In addition, responsible consumer-facing and multi-tenant B2B applications will ensure architecture-level isolation to prevent confidential or sensitive user preference data from leaking between customers. Privacy as a concern is an area of active inquiry within the field of AI.

Role of human curation alongside AI

As a current hobbyist and former professional DJ, I take the human role in music discovery very seriously. I enjoy scanning through hundreds of tracks to pick out the hidden gems. I enjoy spending hundreds of hours every year cutting out the best moments from each track to make a megamix. And most of all I enjoy showing my friends the fruits of my music hunting adventures.

The music discovery experience of tomorrow will embrace and enhance this desire, not replace it. A conversational interface can allow me to do my job with great efficiency and also allow my friends to more effectively find music they enjoy for themselves. Conversational interfaces can’t replace the bond and connection between us humans and our social experience of music.

From DJ Booth to AI Lab: A Personal Journey

Let’s bring it home with some personal examples from my own journey from a DJ Booth in Atlanta to being a Cognitive Systems Engineer at OPEN SOULS, an AI startup in San Francisco.

Bridging technical knowledge with musical intuition

Maybe I should have opened with this part, but here’s a bit about who I am and why I think you should listen to what I said above. I am educated with a Bachelor and Master of Science in Engineering from Mercer University, specializing in Computer Engineering. I’ve also been a producer, DJ, and vocalist for more than 10 years now. In my past, I’ve worked on new products at Amazon Music and at various music startups you may know as a freelancer and as a consultant. Today I work at an AI startup here in SF.

I’ve been studying the music game and the algorithms that drive discovery since I was a young hustler trying to “crack the code” and become famous overnight. I cracked a few minor viral hits in Hip-Hop and Pop genres and played small tours with national artists. In the process, I grew an appreciation for the economy of scale and the market dynamics at play. What began as a hobby became a professional interest over the course of decades.

Combining my experience as a musician with my experience in FAANG and at elite startups, I submit humbly: I’ve been thinking about Music Discovery a lot.

Lessons from traditional DJing applicable to AI music retrieval

Generated image of a live DJ rocking a huge crowd

Lessons taking from Live DJing can inform our strategic approach to conversational music retrieval products.

More established or more professional DJs are typically haughty about taking requests. Because I generally play intimate venues, after parties or house parties, I don’t mind. I’ve actually found tons of music I still love to play from taking requests during my sets.

There’s an art to taking requests well, though. You must approach the conversation strategically. Compare the following two realistic examples of a DJ taking a request.

Listener: Hey! Do you have any A$AP Rocky in your Serato? DJ: Yeah, coming right up, no problem!

A little stiff. Our DJ knows what artist to play next, but they haven’t discovered much explicit information about the listener. Perhaps it’s a song their friend wants to hear, or they always play this song on their birthday, which is happening tonight. Here’s a better example:

Listener: Hey! Do you have any A$AP Rocky in your Serato? DJ: Of course I’ve got Flacko in my Serato! Do you want vibes or that jiggy ish? Listener: Play “Lord Pretty Flacko Jodye 2! It’s my best friend’s birthday and she loves it!” DJ: Great, coming up in a few songs! Should I play more New York artists for your friend? Listener: Omg yes!!! That would be amazing!!!

While I’m likely dating myself by revealing my idea of “currently hip slang,” I hope the potential value of interrogating user preferences is evident. There are certainly unprecedented technical challenges in producing a conversational music retrieval system with this level of verbal acuity, natural dialog, and contextual knowledge about an arbitrarily large catalog of music. But the rewards of doing so are potentially immense from a product and market perspective.

Vision for the future of music discovery

My bold future state vision for music discovery is the total elimination of the current status quo. We should never have to filter through a stale list of uncanny valley algorithmic recommendations to make a dope playlist ever again.

As a listener and fan of music, I demand a personalized assistant that deeply knows and engages with my preferences, allowing me to sculpt and craft the personalized digital record store experience of my dreams. I will not listen to the slop, I will not bow before the industry gatekeepers and tastemakers, and I will forge the future of my musical listening experience.

Consider the potential of Artificial Intelligence and Deep Learning as democratizers of discovery and the uniquely personal experiences you could unlock with art and music by forging a deeper relationship with your music discovery paradigm through conversation.

Finally, DJs, Beatmakers, and DIY artists stand to benefit the most from having an expert conversational music retrieval system in the studio with them, especially coupled with a conditional generation model for audio samples.

Conclusions

I assert the great potential of conversational music retrieval methods. This technology has the potential to revolutionize music search, recommendation, and discovery as a market and as an institution. I call for further innovation in this under-explored area of research, both within academia and the world of startups. Experiment on your own and feel free to share with me what you come up with.

Further Discussion

In what ways could conversational music retrieval make life easier for human DJs?
Could a beatmaker, producer, or DIY artist benefit from the technology?
Would more accurate search and retrieval capabilities create newly viable use cases for AI within the realm of music, generative audio and entertainment technology as a whole?

Call to Action

I got the idea to write this post based on an X post about a forthcoming paper by Keunwoo Choi and Juhan Nam as part of ISMIR 2024! Stay tuned for that paper, and show them your support!