Replication and extension of PMC11024747 (Wang NAIL Lab, UPitt).
The original paper used rule-based NLP (F1=0.891) and found vanilla BERT failed (F1=0.05).
This project replaces the rule-based system with a BioBERT two-stage pipeline.
Takes a free-text therapy note as input, outputs structured exercise information:
Input:
"Patient performed passive ROM to right shoulder bilaterally, 3 sets x 10 reps."
Output:
Entities: MOTION_TYPE → "passive ROM"
BODY_PART → "shoulder"
BODY_SIDE → "right", "bilaterally"
SETS_REPS → "3 sets x 10 reps"
Attributes: exercise_status → performed_in_office
motion_type → PROM
body_side → bilateral
Raw therapy note
│
▼
┌─────────────┐
│ Stage 1 │ BioBERT + CRF
│ NER Model │ Token classification → finds entity spans in text
└─────────────┘
│
▼
┌─────────────┐
│ Stage 2 │ BioBERT + Multi-label head
│ Classifier │ Sentence classification → assigns attribute categories
└─────────────┘
│
▼
Structured JSON output
Why two stages?
- NER answers: where in the text is the information ("right shoulder" at chars 32-45)
- Classifier answers: what category does this sentence belong to (exercise_status=performed)
- Together they mirror the 9-category ontology from the original paper
BioBERT/
├── configs/
│ └── config.yaml # All hyperparameters and label definitions
│
├── data/
│ ├── annotations/ # JSON datasets (input format for training)
│ │ ├── sample_notes.json # Small sample (5 notes)
│ │ ├── synthetic_notes.json # Generated synthetic data
│ │ └── synthetic_notes_500.json # 500-note training set (used for fine-tuning)
│ ├── processed/ # Preprocessed/tokenized data (auto-generated)
│ └── raw/ # Original source files
│
├── src/ # Core library (importable modules)
│ ├── data/
│ │ └── dataset.py # PyTorch Dataset classes
│ ├── models/
│ │ ├── biobert_ner.py # BioBERT + CRF model (Stage 1)
│ │ └── biobert_classifier.py # BioBERT + multi-label head (Stage 2)
│ ├── training/
│ │ └── trainer.py # Training loop (shared by both models)
│ └── utils/
│ ├── preprocessing.py # Text cleaning, abbreviation expansion, data split
│ └── metrics.py # seqeval NER metrics + sklearn multi-label metrics
│
├── scripts/ # Runnable entry points
│ ├── train_ner.py # Fine-tune the NER model
│ ├── train_classifier.py # Fine-tune the classifier model
│ ├── predict.py # End-to-end inference (NER → Classifier)
│ ├── generate_synthetic_data.py # Generate synthetic training data
│ ├── run_demo.py # Interactive demo
│ └── visualize.py # Plot training curves
│
├── results/
│ ├── models/
│ │ ├── ner/best_model.pt # Saved NER checkpoint
│ │ └── classifier/best_model.pt # Saved Classifier checkpoint
│ └── logs/ # TensorBoard training logs
│
├── notebooks/ # Jupyter notebooks for exploration
├── requirements.txt
└── venv/ # Virtual environment (not committed)
Every JSON file follows this schema:
{
"id": "note_001",
"text": "Patient performed AROM to right ankle, 3 sets x 10 reps.",
"language": "en",
"ner_labels": [
{"start": 18, "end": 22, "text": "AROM", "label": "MOTION_TYPE"},
{"start": 26, "end": 31, "text": "right", "label": "BODY_SIDE"},
{"start": 32, "end": 37, "text": "ankle", "label": "BODY_PART"},
{"start": 39, "end": 56, "text": "3 sets x 10 reps", "label": "SETS_REPS"}
],
"attributes": {
"exercise_status": "performed_in_office",
"motion_type": "AROM",
"body_part": "ankle",
"body_side": "right",
"body_position": null,
"exercise_type": "strengthening"
}
}ner_labels→ used by Stage 1 (NER), character-level spansattributes→ used by Stage 2 (Classifier), sentence-level categories
Single source of truth for everything: model backbone path, NER label list, classification categories, training hyperparameters, file paths. Both training scripts read from this file.
Two PyTorch Dataset classes:
-
RehabNERDataset— tokenizes text with BioBERT tokenizer, usesalign_labels_to_tokens()to convert character-level spans → BIO token labels. Handles subword tokenization (e.g., "shoulder" → ["sh", "##ould", "##er"], only the first subword gets the B- label). -
RehabClassifierDataset— tokenizes text, convertsattributesdict → binary vector (multi-hot encoding). e.g., ifbody_side=right, sets index ofbody_side__rightto 1.
BioBERT encoder → hidden states (batch, seq_len, 768)
↓
Linear projection → emission scores (batch, seq_len, num_labels)
↓
CRF layer → enforces valid BIO tag sequences (e.g., I- cannot follow O)
↓
Output: predicted tag sequence per token
CRF is the key difference from a plain softmax classifier — it learns transition probabilities between tags, preventing invalid sequences like O → I-BODY_PART.
BioBERT encoder → [CLS] token representation (batch, 768)
↓
Dropout → Linear → sigmoid (not softmax)
↓
Output: probability per label (independent, multi-label)
Uses sigmoid (not softmax) because multiple labels can be true simultaneously (e.g., a note can have both body_side=right AND exercise_type=strengthening).
Generic trainer shared by both models. Handles:
- AdamW optimizer with linear warmup + decay
- Gradient clipping
- Evaluation every N steps
- Early stopping (saves best checkpoint by primary metric)
- TensorBoard logging
preprocess()— expands medical abbreviations before tokenization (ROM→range of motion, AROM→active range of motion, HEP→home exercise program, etc.)align_labels_to_tokens()— maps character-level entity spans to subword token positionssplit_dataset()— 70/15/15 train/val/test split
# Create and activate virtual environment
python -m venv venv
venv\Scripts\activate # Windows
source venv/bin/activate # Mac/Linux
# Install dependencies
pip install -r requirements.txt
pip install pytorch-crfStep 1 — Fine-tune NER model
python scripts/train_ner.py \
--config configs/config.yaml \
--data data/annotations/synthetic_notes_500.jsonSaves best checkpoint to results/models/ner/best_model.pt
Step 2 — Fine-tune Classifier model
python scripts/train_classifier.py \
--config configs/config.yaml \
--data data/annotations/synthetic_notes_500.jsonSaves best checkpoint to results/models/classifier/best_model.pt
Single note:
python scripts/predict.py \
--config configs/config.yaml \
--text "Patient performed active ROM to right ankle, 3 sets x 10 reps."From file (one note per line):
python scripts/predict.py \
--config configs/config.yaml \
--input_file data/raw/notes.txt \
--output results/predictions.jsonDemo (built-in examples):
python scripts/predict.py --config configs/config.yamlTrained on 500 synthetic notes (350 train / 75 val / 75 test), RTX 4060 Laptop GPU.
| Model | Metric | Score | Notes |
|---|---|---|---|
| Rule-based (paper baseline) | Macro F1 | 0.891 | On real clinical notes |
| Vanilla BERT (paper baseline) | Macro F1 | 0.050 | Failed due to domain gap |
| BioBERT NER (ours) | NER F1 | 0.987 | Synthetic data only |
| BioBERT Classifier (ours) | Macro F1 | 0.310 | Synthetic data, imbalanced |
Note: Synthetic data results are not comparable to the paper's real-data results. Final evaluation pending real annotated clinical notes.
- Real data needed — current results are on synthetic data with matched train/test distribution; real clinical notes will be harder
- Classifier imbalance —
body_partandbody_positioncategories underrepresented in synthetic data - Comparison pending — fair comparison with rule-based baseline requires the same test set
Sivarajkumar S, et al. Mining Clinical Notes for Physical Rehabilitation Exercise Information.
JMIR Med Inform 2024;12:e52289. doi:10.2196/52289