By leveraging textual inversion, it is possible to expand the concept vocabulary of a pretrained text-to-music model without affecting the concepts on which it has already been trained.
Textual inversion is a popular method in text-to-image applications which allow for the introduction of new concepts, such as a subject or style, into a pretrained model by providing a small set of desired output examples and by optimizing a new text embedding to target said concept. The method does not change the original model parameters but rather appends new word embeddings to the input text conditioning embedding matrix. We have adapted textual inversion to MusicGen - a language-model approach to text-to-music generation - in order to personalize and control the music generation with audio examples as well as text in the prompts.
Here we provide selected system outputs from our experiments that we find musically interesting.
Input Prompt | Output | Reference |
---|---|---|
lofi slow bpm electro chill with organic samples oud | ||
reggae track with an electric guitar solo hendrix | ||
2010s hip-hop with hendrix and lo-fi beats | ||
1920s blues with hendrix and electric guitars | ||
2010s hip-hop with amen and lo-fi beats | ||
a jazz song with a amen | ||
60s rock with oud and drums | ||
70s reggae with hendrix and piano | ||
50s country with hendrix and pedal steel | ||
50s country with whistle and pedal steel | ||
lo-fi beats with a guitar and drums |
We are able to synthesise variations of a finetuned concept. For example, we can sample techno remixes of a song.
Input Prompt | Output | Reference |
---|---|---|
a techno track with rickroll | ||
a techno track with rickroll | ||
a techno track with rickroll | ||
a techno track with rickroll | ||
a techno track with rickroll | ||
a techno track with rickroll | ||
a techno track with rickroll | ||
a techno track with rickroll | ||
a techno track with rickroll |
We can also create variations of a solo instrument recording, For example, a sitar player could get suggestions for new melody lines to try out when practicing.
Input Prompt | Output | Reference |
---|---|---|
a recording of a sitar | ||
a recording of a sitar | ||
a recording of a sitar | ||
a recording of a sitar | ||
a recording of a sitar | ||
a recording of a sitar | ||
a recording of a sitar | ||
a recording of a sitar |
System outputs for demo examples from DreamSound. Note that we only finetune input embeddings while DreamSound has finetuned the entire model per concept.
Input Prompt | Output | Baseline | Reference |
---|---|---|---|
a drum'n'bass beat with radiohead in the background | |||
a drum'n'bass song with a morricone | |||
a recording of vader in heavy metal style | |||
a rock song with a bond bass riff | |||
a heavy metal song with a eminem | |||
a rock song with a sitar | |||
a techno song with a beethoven | |||
a techno track with rickroll | |||
an industrial gabber techno song with a chant |
We are able to personalize music generation with our own guitar sound and playing style included in the audio output (a personal goal for the first author, and the origin of this project).
Input Prompt | Output | Reference |
---|---|---|
lo-fi beats with a guitar and drums | ||
lo-fi beats with a guitar and drums | ||
lo-fi beats with a guitar and drums | ||
lo-fi beats with a guitar and drums | ||
lo-fi beats with a guitar and drums | ||
lo-fi beats with a guitar and drums | ||
lo-fi beats with a guitar and drums | ||
lo-fi beats with a guitar and drums |
When concepts are not present in the original training dataset, such as with vocals in MusicGen, the results after textual inversion exhibit speechy sounds with nonsensical lyrics and limited coherence. This is an expected limitation of tuning text conditioning only that can be applied for exploring new sounds.
Input Prompt | Output | Reference |
---|---|---|
eminem | ||
smashmouth | ||
90s techno with a crazyfrog |