Text2Tracks: Prompt-Based Music Recommendation via Generative Retrieval

Note: This is an unofficial implementation and is not affiliated with Spotify.

This repo is a proof-of-concept implementation of the paper from Spotify - Text2Tracks: Prompt-Based Music Recommendation via Generative Retrieval.

The paper formulates music recommendation as a generative retrieval problem. Instead of ranking items from a candidate pool, it treats music retrieval as a sequence generation task, mapping a natural language prompt to a sequence of track identifiers.

Problem formulation

Input: A natural language music prompt
Example: "I'd love some relaxing bossa nova music"

Goal: Recommend a list of track IDs (not track names)
Output: ["<0><1><4>", "<0><3><2>", ...]

Instead of using raw track IDs, the paper discretizes track embeddings into semantic ID tokens, so that generation happens over a finite vocabulary.

Why existing approaches fall short

LLMs usually generate track names like “Girl from Ipanema” and “Desafinado”
Then they need entity resolution: matching the text to real songs in Spotify’s catalog
This is slow, error-prone, and not scalable

Text2Tracks reframes this as a Generative Retrieval task

Mathematically, they model this as:

$$ f(Q) \rightarrow { t_1, t_2, \ldots, t_m } $$

that maps prompt Q directly to a small subset of tracks from the total catalog T

Where:
Q = music prompt
T = total track universe
f(Q) = track IDs directly generated by a model (no lookup!)

Core Insight: Rethinking Track IDs as Generative Targets

A key challenge in Text2Tracks is selecting a track ID format that is both easy for the model to generate and rich in semantic meaning. This design decision is central to the effectiveness of the generative retrieval approach.

The paper evaluates three strategies for representing track IDs:

Content-Based IDs use strings derived from track metadata, such as "bossa_nova_relaxing" or combinations like "artist_title". These are straightforward to generate but tend to be long, ambiguous, and difficult to resolve consistently at scale.

Integer-Based IDs assign a unique numerical ID to each track, such as "1234_5678". While compact, these IDs are arbitrary and carry no semantic meaning, making them hard for the model to learn without extensive memorization.

Learned Semantic IDs, which perform best, embed tracks into a vector space using collaborative filtering techniques based on playlist co-occurrence. These embeddings are then discretized into short symbolic sequences like "<3><15><7>". This approach strikes a balance between structure, expressiveness, and generative compatibility.

Overview

This implementation follows the paper's high-level approach with simplifications:

Track Representation

We simulate playlists by grouping track_name under track_genre from a Spotify dataset.
The training data used is from the Spotify Tracks Dataset on Kaggle, which includes metadata and audio features for over 600k tracks.
Train a Word2Vec model to learn embeddings for each track.
Use MiniBatchDictionaryLearning to discretize those vectors into sparse semantic token IDs (e.g., <2><17><5>).

Training

Fine-tune flan-t5-small to map from track_genre prompts to semantic ID sequences.
Training data: (prompt=genre, target=semantic_id)

Inference

The model generates semantic ID tokens (e.g., <3><12><44>) from a new prompt.
These tokens can be reverse-mapped to the most similar real tracks (optional).

Setup & Usage

Create a virtual environment with Python 3.10 and activate.
Install the required packages using the command:
```
pip install -r requirements.txt
```
Download the dataset:

Download the Spotify Tracks Dataset from Kaggle: https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset

Save the downloaded CSV file (usually named tracks.csv) into the project root directory and rename it to spotify_dataset.csv.
Run the dataset creation script:
```
python dataset_creation.py
```
Train the model:
```
python train_model.py
```
Generate recommendations:
```
python inference.py
```

Folder Structure

dataset_creation.py - prepares training data and semantic IDs.
train_model.py - fine-tunes the T5 model.
inference.py - generates semantic IDs from a prompt.

Requirements

Python 3.10
transformers, datasets, gensim, scikit-learn, sentencepiece

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
create_dataset.py		create_dataset.py
inference.py		inference.py
requirements.txt		requirements.txt
train_model.py		train_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text2Tracks: Prompt-Based Music Recommendation via Generative Retrieval

Problem formulation

Why existing approaches fall short

Core Insight: Rethinking Track IDs as Generative Targets

Overview

Track Representation

Training

Inference

Setup & Usage

Folder Structure

Requirements

About

Uh oh!

Releases

Packages

Languages

mayurbhangale/Text2Tracks

Folders and files

Latest commit

History

Repository files navigation

Text2Tracks: Prompt-Based Music Recommendation via Generative Retrieval

Problem formulation

Why existing approaches fall short

Core Insight: Rethinking Track IDs as Generative Targets

Overview

Track Representation

Training

Inference

Setup & Usage

Folder Structure

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages