Skip to content

DoubleSky123/BioBERT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BioBERT Rehabilitation Exercise Information Extraction

Replication and extension of PMC11024747 (Wang NAIL Lab, UPitt).
The original paper used rule-based NLP (F1=0.891) and found vanilla BERT failed (F1=0.05).
This project replaces the rule-based system with a BioBERT two-stage pipeline.


What This System Does

Takes a free-text therapy note as input, outputs structured exercise information:

Input:
  "Patient performed passive ROM to right shoulder bilaterally, 3 sets x 10 reps."

Output:
  Entities:    MOTION_TYPE  → "passive ROM"
               BODY_PART    → "shoulder"
               BODY_SIDE    → "right", "bilaterally"
               SETS_REPS    → "3 sets x 10 reps"

  Attributes:  exercise_status → performed_in_office
               motion_type     → PROM
               body_side       → bilateral

Architecture: Two-Stage Pipeline

Raw therapy note
      │
      ▼
┌─────────────┐
│  Stage 1    │  BioBERT + CRF
│  NER Model  │  Token classification → finds entity spans in text
└─────────────┘
      │
      ▼
┌─────────────┐
│  Stage 2    │  BioBERT + Multi-label head
│  Classifier │  Sentence classification → assigns attribute categories
└─────────────┘
      │
      ▼
Structured JSON output

Why two stages?

  • NER answers: where in the text is the information ("right shoulder" at chars 32-45)
  • Classifier answers: what category does this sentence belong to (exercise_status=performed)
  • Together they mirror the 9-category ontology from the original paper

Project Structure

BioBERT/
├── configs/
│   └── config.yaml              # All hyperparameters and label definitions
│
├── data/
│   ├── annotations/             # JSON datasets (input format for training)
│   │   ├── sample_notes.json    # Small sample (5 notes)
│   │   ├── synthetic_notes.json # Generated synthetic data
│   │   └── synthetic_notes_500.json  # 500-note training set (used for fine-tuning)
│   ├── processed/               # Preprocessed/tokenized data (auto-generated)
│   └── raw/                     # Original source files
│
├── src/                         # Core library (importable modules)
│   ├── data/
│   │   └── dataset.py           # PyTorch Dataset classes
│   ├── models/
│   │   ├── biobert_ner.py       # BioBERT + CRF model (Stage 1)
│   │   └── biobert_classifier.py # BioBERT + multi-label head (Stage 2)
│   ├── training/
│   │   └── trainer.py           # Training loop (shared by both models)
│   └── utils/
│       ├── preprocessing.py     # Text cleaning, abbreviation expansion, data split
│       └── metrics.py           # seqeval NER metrics + sklearn multi-label metrics
│
├── scripts/                     # Runnable entry points
│   ├── train_ner.py             # Fine-tune the NER model
│   ├── train_classifier.py      # Fine-tune the classifier model
│   ├── predict.py               # End-to-end inference (NER → Classifier)
│   ├── generate_synthetic_data.py  # Generate synthetic training data
│   ├── run_demo.py              # Interactive demo
│   └── visualize.py             # Plot training curves
│
├── results/
│   ├── models/
│   │   ├── ner/best_model.pt        # Saved NER checkpoint
│   │   └── classifier/best_model.pt # Saved Classifier checkpoint
│   └── logs/                    # TensorBoard training logs
│
├── notebooks/                   # Jupyter notebooks for exploration
├── requirements.txt
└── venv/                        # Virtual environment (not committed)

Data Format

Every JSON file follows this schema:

{
  "id": "note_001",
  "text": "Patient performed AROM to right ankle, 3 sets x 10 reps.",
  "language": "en",
  "ner_labels": [
    {"start": 18, "end": 22, "text": "AROM", "label": "MOTION_TYPE"},
    {"start": 26, "end": 31, "text": "right", "label": "BODY_SIDE"},
    {"start": 32, "end": 37, "text": "ankle", "label": "BODY_PART"},
    {"start": 39, "end": 56, "text": "3 sets x 10 reps", "label": "SETS_REPS"}
  ],
  "attributes": {
    "exercise_status": "performed_in_office",
    "motion_type": "AROM",
    "body_part": "ankle",
    "body_side": "right",
    "body_position": null,
    "exercise_type": "strengthening"
  }
}
  • ner_labels → used by Stage 1 (NER), character-level spans
  • attributes → used by Stage 2 (Classifier), sentence-level categories

Code Walkthrough

1. configs/config.yaml

Single source of truth for everything: model backbone path, NER label list, classification categories, training hyperparameters, file paths. Both training scripts read from this file.

2. src/data/dataset.py

Two PyTorch Dataset classes:

  • RehabNERDataset — tokenizes text with BioBERT tokenizer, uses align_labels_to_tokens() to convert character-level spans → BIO token labels. Handles subword tokenization (e.g., "shoulder" → ["sh", "##ould", "##er"], only the first subword gets the B- label).

  • RehabClassifierDataset — tokenizes text, converts attributes dict → binary vector (multi-hot encoding). e.g., if body_side=right, sets index of body_side__right to 1.

3. src/models/biobert_ner.py

BioBERT encoder → hidden states (batch, seq_len, 768)
      ↓
Linear projection → emission scores (batch, seq_len, num_labels)
      ↓
CRF layer → enforces valid BIO tag sequences (e.g., I- cannot follow O)
      ↓
Output: predicted tag sequence per token

CRF is the key difference from a plain softmax classifier — it learns transition probabilities between tags, preventing invalid sequences like O → I-BODY_PART.

4. src/models/biobert_classifier.py

BioBERT encoder → [CLS] token representation (batch, 768)
      ↓
Dropout → Linear → sigmoid (not softmax)
      ↓
Output: probability per label (independent, multi-label)

Uses sigmoid (not softmax) because multiple labels can be true simultaneously (e.g., a note can have both body_side=right AND exercise_type=strengthening).

5. src/training/trainer.py

Generic trainer shared by both models. Handles:

  • AdamW optimizer with linear warmup + decay
  • Gradient clipping
  • Evaluation every N steps
  • Early stopping (saves best checkpoint by primary metric)
  • TensorBoard logging

6. src/utils/preprocessing.py

  • preprocess() — expands medical abbreviations before tokenization (ROM→range of motion, AROM→active range of motion, HEP→home exercise program, etc.)
  • align_labels_to_tokens() — maps character-level entity spans to subword token positions
  • split_dataset() — 70/15/15 train/val/test split

Setup

# Create and activate virtual environment
python -m venv venv
venv\Scripts\activate       # Windows
source venv/bin/activate    # Mac/Linux

# Install dependencies
pip install -r requirements.txt
pip install pytorch-crf

Training

Step 1 — Fine-tune NER model

python scripts/train_ner.py \
  --config configs/config.yaml \
  --data data/annotations/synthetic_notes_500.json

Saves best checkpoint to results/models/ner/best_model.pt

Step 2 — Fine-tune Classifier model

python scripts/train_classifier.py \
  --config configs/config.yaml \
  --data data/annotations/synthetic_notes_500.json

Saves best checkpoint to results/models/classifier/best_model.pt


Inference

Single note:

python scripts/predict.py \
  --config configs/config.yaml \
  --text "Patient performed active ROM to right ankle, 3 sets x 10 reps."

From file (one note per line):

python scripts/predict.py \
  --config configs/config.yaml \
  --input_file data/raw/notes.txt \
  --output results/predictions.json

Demo (built-in examples):

python scripts/predict.py --config configs/config.yaml

Results

Trained on 500 synthetic notes (350 train / 75 val / 75 test), RTX 4060 Laptop GPU.

Model Metric Score Notes
Rule-based (paper baseline) Macro F1 0.891 On real clinical notes
Vanilla BERT (paper baseline) Macro F1 0.050 Failed due to domain gap
BioBERT NER (ours) NER F1 0.987 Synthetic data only
BioBERT Classifier (ours) Macro F1 0.310 Synthetic data, imbalanced

Note: Synthetic data results are not comparable to the paper's real-data results. Final evaluation pending real annotated clinical notes.


Limitations & Next Steps

  1. Real data needed — current results are on synthetic data with matched train/test distribution; real clinical notes will be harder
  2. Classifier imbalancebody_part and body_position categories underrepresented in synthetic data
  3. Comparison pending — fair comparison with rule-based baseline requires the same test set

Reference

Sivarajkumar S, et al. Mining Clinical Notes for Physical Rehabilitation Exercise Information.
JMIR Med Inform 2024;12:e52289. doi:10.2196/52289

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors