Applying textual inversion to control and personalize text-to-music models by audio reference

By leveraging textual inversion, it is possible to expand the concept vocabulary of a pretrained text-to-music model without affecting the concepts on which it has already been trained.

Overview

Textual inversion is a popular method in text-to-image applications which allow for the introduction of new concepts, such as a subject or style, into a pretrained model by providing a small set of desired output examples and by optimizing a new text embedding to target said concept. The method does not change the original model parameters but rather appends new word embeddings to the input text conditioning embedding matrix. We have adapted textual inversion to MusicGen - a language-model approach to text-to-music generation - in order to personalize and control the music generation with audio examples as well as text in the prompts.

Curated examples

Here we provide selected system outputs from our experiments that we find musically interesting.

Input Prompt	Output	Reference
lofi slow bpm electro chill with organic samples oud
reggae track with an electric guitar solo hendrix
2010s hip-hop with hendrix and lo-fi beats
1920s blues with hendrix and electric guitars
2010s hip-hop with amen and lo-fi beats
a jazz song with a amen
60s rock with oud and drums
70s reggae with hendrix and piano
50s country with hendrix and pedal steel
50s country with whistle and pedal steel
lo-fi beats with a guitar and drums

Remix variations

We are able to synthesise variations of a finetuned concept. For example, we can sample techno remixes of a song.

Input Prompt	Output	Reference
a techno track with rickroll
a techno track with rickroll
a techno track with rickroll
a techno track with rickroll
a techno track with rickroll
a techno track with rickroll
a techno track with rickroll
a techno track with rickroll
a techno track with rickroll

Instrument variations

We can also create variations of a solo instrument recording, For example, a sitar player could get suggestions for new melody lines to try out when practicing.

Input Prompt	Output	Reference
a recording of a sitar
a recording of a sitar
a recording of a sitar
a recording of a sitar
a recording of a sitar
a recording of a sitar
a recording of a sitar
a recording of a sitar

Systems comparison

System outputs for demo examples from DreamSound. Note that we only finetune input embeddings while DreamSound has finetuned the entire model per concept.

Input Prompt	Output	Baseline	Reference
a drum'n'bass beat with radiohead in the background
a drum'n'bass song with a morricone
a recording of vader in heavy metal style
a rock song with a bond bass riff
a heavy metal song with a eminem
a rock song with a sitar
a techno song with a beethoven
a techno track with rickroll
an industrial gabber techno song with a chant

Personalized guitar

We are able to personalize music generation with our own guitar sound and playing style included in the audio output (a personal goal for the first author, and the origin of this project).

Input Prompt	Output	Reference
lo-fi beats with a guitar and drums
lo-fi beats with a guitar and drums
lo-fi beats with a guitar and drums
lo-fi beats with a guitar and drums
lo-fi beats with a guitar and drums
lo-fi beats with a guitar and drums
lo-fi beats with a guitar and drums
lo-fi beats with a guitar and drums

Out-of-distribution concepts

When concepts are not present in the original training dataset, such as with vocals in MusicGen, the results after textual inversion exhibit speechy sounds with nonsensical lyrics and limited coherence. This is an expected limitation of tuning text conditioning only that can be applied for exploring new sounds.

Input Prompt	Output	Reference
eminem
smashmouth
90s techno with a crazyfrog