BioBERT Rehabilitation Exercise Information Extraction

Replication and extension of PMC11024747 (Wang NAIL Lab, UPitt).
The original paper used rule-based NLP (F1=0.891) and found vanilla BERT failed (F1=0.05).
This project replaces the rule-based system with a BioBERT two-stage pipeline.

What This System Does

Takes a free-text therapy note as input, outputs structured exercise information:

Input:
  "Patient performed passive ROM to right shoulder bilaterally, 3 sets x 10 reps."

Output:
  Entities:    MOTION_TYPE  → "passive ROM"
               BODY_PART    → "shoulder"
               BODY_SIDE    → "right", "bilaterally"
               SETS_REPS    → "3 sets x 10 reps"

  Attributes:  exercise_status → performed_in_office
               motion_type     → PROM
               body_side       → bilateral

Architecture: Two-Stage Pipeline

Raw therapy note
      │
      ▼
┌─────────────┐
│  Stage 1    │  BioBERT + CRF
│  NER Model  │  Token classification → finds entity spans in text
└─────────────┘
      │
      ▼
┌─────────────┐
│  Stage 2    │  BioBERT + Multi-label head
│  Classifier │  Sentence classification → assigns attribute categories
└─────────────┘
      │
      ▼
Structured JSON output

Why two stages?

NER answers: where in the text is the information ("right shoulder" at chars 32-45)
Classifier answers: what category does this sentence belong to (exercise_status=performed)
Together they mirror the 9-category ontology from the original paper

Project Structure

BioBERT/
├── configs/
│   └── config.yaml              # All hyperparameters and label definitions
│
├── data/
│   ├── annotations/             # JSON datasets (input format for training)
│   │   ├── sample_notes.json    # Small sample (5 notes)
│   │   ├── synthetic_notes.json # Generated synthetic data
│   │   └── synthetic_notes_500.json  # 500-note training set (used for fine-tuning)
│   ├── processed/               # Preprocessed/tokenized data (auto-generated)
│   └── raw/                     # Original source files
│
├── src/                         # Core library (importable modules)
│   ├── data/
│   │   └── dataset.py           # PyTorch Dataset classes
│   ├── models/
│   │   ├── biobert_ner.py       # BioBERT + CRF model (Stage 1)
│   │   └── biobert_classifier.py # BioBERT + multi-label head (Stage 2)
│   ├── training/
│   │   └── trainer.py           # Training loop (shared by both models)
│   └── utils/
│       ├── preprocessing.py     # Text cleaning, abbreviation expansion, data split
│       └── metrics.py           # seqeval NER metrics + sklearn multi-label metrics
│
├── scripts/                     # Runnable entry points
│   ├── train_ner.py             # Fine-tune the NER model
│   ├── train_classifier.py      # Fine-tune the classifier model
│   ├── predict.py               # End-to-end inference (NER → Classifier)
│   ├── generate_synthetic_data.py  # Generate synthetic training data
│   ├── run_demo.py              # Interactive demo
│   └── visualize.py             # Plot training curves
│
├── results/
│   ├── models/
│   │   ├── ner/best_model.pt        # Saved NER checkpoint
│   │   └── classifier/best_model.pt # Saved Classifier checkpoint
│   └── logs/                    # TensorBoard training logs
│
├── notebooks/                   # Jupyter notebooks for exploration
├── requirements.txt
└── venv/                        # Virtual environment (not committed)

Data Format

Every JSON file follows this schema:

{
  "id": "note_001",
  "text": "Patient performed AROM to right ankle, 3 sets x 10 reps.",
  "language": "en",
  "ner_labels": [
    {"start": 18, "end": 22, "text": "AROM", "label": "MOTION_TYPE"},
    {"start": 26, "end": 31, "text": "right", "label": "BODY_SIDE"},
    {"start": 32, "end": 37, "text": "ankle", "label": "BODY_PART"},
    {"start": 39, "end": 56, "text": "3 sets x 10 reps", "label": "SETS_REPS"}
  ],
  "attributes": {
    "exercise_status": "performed_in_office",
    "motion_type": "AROM",
    "body_part": "ankle",
    "body_side": "right",
    "body_position": null,
    "exercise_type": "strengthening"
  }
}

ner_labels → used by Stage 1 (NER), character-level spans
attributes → used by Stage 2 (Classifier), sentence-level categories

Code Walkthrough

1. `configs/config.yaml`

Single source of truth for everything: model backbone path, NER label list, classification categories, training hyperparameters, file paths. Both training scripts read from this file.

2. `src/data/dataset.py`

Two PyTorch Dataset classes:

RehabNERDataset — tokenizes text with BioBERT tokenizer, uses align_labels_to_tokens() to convert character-level spans → BIO token labels. Handles subword tokenization (e.g., "shoulder" → ["sh", "##ould", "##er"], only the first subword gets the B- label).
RehabClassifierDataset — tokenizes text, converts attributes dict → binary vector (multi-hot encoding). e.g., if body_side=right, sets index of body_side__right to 1.

3. `src/models/biobert_ner.py`

BioBERT encoder → hidden states (batch, seq_len, 768)
      ↓
Linear projection → emission scores (batch, seq_len, num_labels)
      ↓
CRF layer → enforces valid BIO tag sequences (e.g., I- cannot follow O)
      ↓
Output: predicted tag sequence per token

CRF is the key difference from a plain softmax classifier — it learns transition probabilities between tags, preventing invalid sequences like O → I-BODY_PART.

4. `src/models/biobert_classifier.py`

BioBERT encoder → [CLS] token representation (batch, 768)
      ↓
Dropout → Linear → sigmoid (not softmax)
      ↓
Output: probability per label (independent, multi-label)

Uses sigmoid (not softmax) because multiple labels can be true simultaneously (e.g., a note can have both body_side=right AND exercise_type=strengthening).

5. `src/training/trainer.py`

Generic trainer shared by both models. Handles:

AdamW optimizer with linear warmup + decay
Gradient clipping
Evaluation every N steps
Early stopping (saves best checkpoint by primary metric)
TensorBoard logging

6. `src/utils/preprocessing.py`

preprocess() — expands medical abbreviations before tokenization (ROM→range of motion, AROM→active range of motion, HEP→home exercise program, etc.)
align_labels_to_tokens() — maps character-level entity spans to subword token positions
split_dataset() — 70/15/15 train/val/test split

Setup

# Create and activate virtual environment
python -m venv venv
venv\Scripts\activate       # Windows
source venv/bin/activate    # Mac/Linux

# Install dependencies
pip install -r requirements.txt
pip install pytorch-crf

Training

Step 1 — Fine-tune NER model

python scripts/train_ner.py \
  --config configs/config.yaml \
  --data data/annotations/synthetic_notes_500.json

Saves best checkpoint to results/models/ner/best_model.pt

Step 2 — Fine-tune Classifier model

python scripts/train_classifier.py \
  --config configs/config.yaml \
  --data data/annotations/synthetic_notes_500.json

Saves best checkpoint to results/models/classifier/best_model.pt

Inference

Single note:

python scripts/predict.py \
  --config configs/config.yaml \
  --text "Patient performed active ROM to right ankle, 3 sets x 10 reps."

From file (one note per line):

python scripts/predict.py \
  --config configs/config.yaml \
  --input_file data/raw/notes.txt \
  --output results/predictions.json

Demo (built-in examples):

python scripts/predict.py --config configs/config.yaml

Results

Trained on 500 synthetic notes (350 train / 75 val / 75 test), RTX 4060 Laptop GPU.

Model	Metric	Score	Notes
Rule-based (paper baseline)	Macro F1	0.891	On real clinical notes
Vanilla BERT (paper baseline)	Macro F1	0.050	Failed due to domain gap
BioBERT NER (ours)	NER F1	0.987	Synthetic data only
BioBERT Classifier (ours)	Macro F1	0.310	Synthetic data, imbalanced

Note: Synthetic data results are not comparable to the paper's real-data results. Final evaluation pending real annotated clinical notes.

Limitations & Next Steps

Real data needed — current results are on synthetic data with matched train/test distribution; real clinical notes will be harder
Classifier imbalance — body_part and body_position categories underrepresented in synthetic data
Comparison pending — fair comparison with rule-based baseline requires the same test set

Reference

Sivarajkumar S, et al. Mining Clinical Notes for Physical Rehabilitation Exercise Information.
JMIR Med Inform 2024;12:e52289. doi:10.2196/52289

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BioBERT Rehabilitation Exercise Information Extraction

What This System Does

Architecture: Two-Stage Pipeline

Project Structure

Data Format

Code Walkthrough

1. `configs/config.yaml`

2. `src/data/dataset.py`

3. `src/models/biobert_ner.py`

4. `src/models/biobert_classifier.py`

5. `src/training/trainer.py`

6. `src/utils/preprocessing.py`

Setup

Training

Inference

Results

Limitations & Next Steps

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
data/annotations		data/annotations
notebooks		notebooks
results		results
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

BioBERT Rehabilitation Exercise Information Extraction

What This System Does

Architecture: Two-Stage Pipeline

Project Structure

Data Format

Code Walkthrough

1. configs/config.yaml

2. src/data/dataset.py

3. src/models/biobert_ner.py

4. src/models/biobert_classifier.py

5. src/training/trainer.py

6. src/utils/preprocessing.py

Setup

Training

Inference

Results

Limitations & Next Steps

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `configs/config.yaml`

2. `src/data/dataset.py`

3. `src/models/biobert_ner.py`

4. `src/models/biobert_classifier.py`

5. `src/training/trainer.py`

6. `src/utils/preprocessing.py`

Packages