Scrape and transcribe thousands of audio files for free using Python, OpenAI's Whisper, and GitHub Actions.
This repository demonstrates how to:
- Scrape thousands of MP3 files from the NYC Municipal Archive's WNYC radio collection
- Transcribe them automatically using OpenAI's Whisper speech-to-text model
- Scale the process using GitHub Actions matrix operations to process files in parallel
- Store results as searchable text files in your repository
It works because:
- Cost: $0 using free tools and GitHub Actions
- Quality: Whisper provides state-of-the-art transcription
- Scale: GitHub Actions can process hundreds of files in parallel
- Simplicity: Fully automated once set up
wnyc-radio-archive-transcriber/
├── pipeline/ # Core processing modules
│ ├── settings.py # Configuration and paths
│ ├── utils.py # Shared utility functions
│ ├── scrape.py # Web scraping logic
│ ├── transcribe.py # Audio transcription
│ ├── count.py # Progress tracking
│ └── untranscribed.py # File management
├── data/ # Data storage
│ ├── input/ # Scraped data
│ │ ├── html/ # Raw HTML files
│ │ │ ├── lists/ # List page HTML
│ │ │ └── details/ # Detail page HTML
│ │ └── json/ # Structured metadata
│ └── output/ # Transcription results
├── .github/workflows/ # Automation
│ ├── scrape.yaml # Metadata collection
│ └── transcribe.yaml # Parallel transcription
├── README.md # Quick start guide
└── pyproject.toml # Dependencies and config
Clone this repository.
git clone https://github.com/palewire/wnyc-radio-archive-transcriber.git
cd wnyc-radio-archive-transcriber
Install dependencies.
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
Check for untranscribed files.
uv run python -m pipeline.count
Scrape new files from the WNYC archive.
uv run python -m pipeline.scrape
List untranscribed files.
uv run python -m pipeline.untranscribed -l 1
uv run python -m pipeline.scrape
List untranscribed files.
uv run python -m pipeline.untranscribed -l 1
Transcribe a single file.
uv run python -m pipeline.transcribe -f "your-file-id-here"
The transcription workflow uses GitHub Actions' matrix strategy to process multiple files simultaneously:
strategy:
fail-fast: false
matrix:
file: ${{ fromJson(needs.seed.outputs.file-list) }}
This creates a separate job for each file, allowing GitHub to process hundreds of files in parallel across multiple runners.