Note: This is an unofficial implementation and is not affiliated with Spotify.
This repo is a proof-of-concept implementation of the paper from Spotify - Text2Tracks: Prompt-Based Music Recommendation via Generative Retrieval.
The paper formulates music recommendation as a generative retrieval problem. Instead of ranking items from a candidate pool, it treats music retrieval as a sequence generation task, mapping a natural language prompt to a sequence of track identifiers.
Input: A natural language music prompt
Example: "I'd love some relaxing bossa nova music"
Goal: Recommend a list of track IDs (not track names)
Output: ["<0><1><4>", "<0><3><2>", ...]
Instead of using raw track IDs, the paper discretizes track embeddings into semantic ID tokens, so that generation happens over a finite vocabulary.
- LLMs usually generate track names like “Girl from Ipanema” and “Desafinado”
- Then they need entity resolution: matching the text to real songs in Spotify’s catalog
- This is slow, error-prone, and not scalable
Text2Tracks reframes this as a Generative Retrieval task
Mathematically, they model this as:
that maps prompt Q directly to a small subset of tracks from the total catalog T
Where:
Q = music prompt
T = total track universe
f(Q) = track IDs directly generated by a model (no lookup!)
A key challenge in Text2Tracks is selecting a track ID format that is both easy for the model to generate and rich in semantic meaning. This design decision is central to the effectiveness of the generative retrieval approach.
The paper evaluates three strategies for representing track IDs:
Content-Based IDs use strings derived from track metadata, such as "bossa_nova_relaxing" or combinations like "artist_title". These are straightforward to generate but tend to be long, ambiguous, and difficult to resolve consistently at scale.
Integer-Based IDs assign a unique numerical ID to each track, such as "1234_5678". While compact, these IDs are arbitrary and carry no semantic meaning, making them hard for the model to learn without extensive memorization.
Learned Semantic IDs, which perform best, embed tracks into a vector space using collaborative filtering techniques based on playlist co-occurrence. These embeddings are then discretized into short symbolic sequences like "<3><15><7>". This approach strikes a balance between structure, expressiveness, and generative compatibility.
This implementation follows the paper's high-level approach with simplifications:
- We simulate playlists by grouping
track_name
undertrack_genre
from a Spotify dataset. - The training data used is from the Spotify Tracks Dataset on Kaggle, which includes metadata and audio features for over 600k tracks.
- Train a Word2Vec model to learn embeddings for each track.
- Use
MiniBatchDictionaryLearning
to discretize those vectors into sparse semantic token IDs (e.g.,<2><17><5>
).
- Fine-tune
flan-t5-small
to map fromtrack_genre
prompts to semantic ID sequences. - Training data: (
prompt=genre
,target=semantic_id
)
- The model generates semantic ID tokens (e.g.,
<3><12><44>
) from a new prompt. - These tokens can be reverse-mapped to the most similar real tracks (optional).
-
Create a virtual environment with Python 3.10 and activate.
-
Install the required packages using the command:
pip install -r requirements.txt
-
Download the dataset:
Download the Spotify Tracks Dataset from Kaggle: https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset
Save the downloaded CSV file (usually named
tracks.csv
) into the project root directory and rename it tospotify_dataset.csv
. -
Run the dataset creation script:
python dataset_creation.py
-
Train the model:
python train_model.py
-
Generate recommendations:
python inference.py
dataset_creation.py
- prepares training data and semantic IDs.train_model.py
- fine-tunes the T5 model.inference.py
- generates semantic IDs from a prompt.
- Python 3.10
transformers
,datasets
,gensim
,scikit-learn
,sentencepiece