Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
reinforcement-learning transformers transformer safety llama gpt datasets beaver alpaca ai-safety safe-reinforcement-learning vicuna deepspeed large-language-models llm llms rlhf reinforcement-learning-from-human-feedback safe-rlhf safe-reinforcement-learning-from-human-feedback
-
Updated
Aug 18, 2025 - Python