-
Notifications
You must be signed in to change notification settings - Fork 99
Description
Thanks for sharing the great code!
In your armo-rm stage 2 code (https://github.com/RLHFlow/RLHF-Reward-Modeling/blob/main/armo-rm/stage-2_train.py) line 80, you add softmax on the output at the dim 1, which is shown as follows:
def forward(self, x: torch.FloatTensor) -> torch.FloatTensor:
# Apply the linear layers with ReLU and dropout
for i, layer in enumerate(self.layers):
x = layer(x)
if i < len(self.layers) - 1:
x = F.relu(x)
if self.dropout_prob > 0:
x = F.dropout(x, p=self.dropout_prob, training=self.training)
# Apply softmax with temperature scaling
x = F.softmax(x / self.temperature, dim=1) # This line
return x * self.logit_scale[0]
But in practice the shape of the x here has 3 dimensions (i.e., batchsize, 2, 19; the 2 here is chosen & rejected), so I wonder whether the dim=1 here should be changed to dim=-1 to make the softmax be added on the last 19 weights during the training?
Btw, do you use this code for Amom-rm for crafting RLHFlow/ArmoRM-Llama3-8B-v0.1? When I use your code (i.e., with softmax set as dim=1), almost all 19 weights look weird (they sum not equal to 1, and they are almost the same value), and I cannot reproduce the high reward score as shown in Reward Bench Leaderboard.