WNYC Radio Archive Transcriber

Scrape and transcribe thousands of audio files for free using Python, OpenAI's Whisper, and GitHub Actions.

This repository demonstrates how to:

Scrape thousands of MP3 files from the NYC Municipal Archive's WNYC radio collection
Transcribe them automatically using OpenAI's Whisper speech-to-text model
Scale the process using GitHub Actions matrix operations to process files in parallel
Store results as searchable text files in your repository

It works because:

Cost: $0 using free tools and GitHub Actions
Quality: Whisper provides state-of-the-art transcription
Scale: GitHub Actions can process hundreds of files in parallel
Simplicity: Fully automated once set up

Directory Structure

wnyc-radio-archive-transcriber/
├── pipeline/                    # Core processing modules
│   ├── settings.py             # Configuration and paths
│   ├── utils.py                # Shared utility functions
│   ├── scrape.py               # Web scraping logic
│   ├── transcribe.py           # Audio transcription
│   ├── count.py                # Progress tracking
│   └── untranscribed.py        # File management
├── data/                       # Data storage
│   ├── input/                  # Scraped data
│   │   ├── html/              # Raw HTML files
│   │   │   ├── lists/         # List page HTML
│   │   │   └── details/       # Detail page HTML
│   │   └── json/              # Structured metadata
│   └── output/                # Transcription results
├── .github/workflows/          # Automation
│   ├── scrape.yaml            # Metadata collection
│   └── transcribe.yaml        # Parallel transcription
├── README.md                  # Quick start guide
└── pyproject.toml           # Dependencies and config

Usage

Clone this repository.

git clone https://github.com/palewire/wnyc-radio-archive-transcriber.git
cd wnyc-radio-archive-transcriber

Install dependencies.

curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync

Check for untranscribed files.

uv run python -m pipeline.count

Scrape new files from the WNYC archive.

uv run python -m pipeline.scrape

List untranscribed files.

uv run python -m pipeline.untranscribed -l 1

uv run python -m pipeline.scrape

List untranscribed files.

uv run python -m pipeline.untranscribed -l 1

Transcribe a single file.

uv run python -m pipeline.transcribe -f "your-file-id-here"

How the matrix strategy mass transcribes files

The transcription workflow uses GitHub Actions' matrix strategy to process multiple files simultaneously:

strategy:
  fail-fast: false
  matrix:
    file: ${{ fromJson(needs.seed.outputs.file-list) }}

This creates a separate job for each file, allowing GitHub to process hundreds of files in parallel across multiple runners.

Name		Name	Last commit message	Last commit date
Latest commit History 20,971 Commits
.github		.github
data		data
pipeline		pipeline
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WNYC Radio Archive Transcriber

Directory Structure

Usage

How the matrix strategy mass transcribes files

About

Uh oh!

Contributors 2

Uh oh!

Languages

License

palewire/wnyc-radio-archive-transcriber

Folders and files

Latest commit

History

Repository files navigation

WNYC Radio Archive Transcriber

Directory Structure

Usage

How the matrix strategy mass transcribes files

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages