•2 min read•from Machine Learning
Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO written from scratch in PyTorch - updates! [P]
![Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO written from scratch in PyTorch - updates! [P]](/_next/image?url=https%3A%2F%2Fpreview.redd.it%2F7nrsulwdkbvg1.png%3Fwidth%3D140%26height%3D69%26auto%3Dwebp%26s%3D7c61d2f68d6b094614b5dff0cb9347873885e226&w=3840&q=75)
| So, yesterday run was a success and I did get an avg rollout length of about 64 tokens as attached in the image! This was with quality_reward + length_penalty (more info below!) Next, I'll be going with length penalty as the reward and with the mistake of counting characters as tokens fixed and see if there is any gaming the system stuff or degraded outputs! The rewards I used were 2:
One node drives training using GRPO, two push rollouts via vLLM. Trained two variants:
Eval: LLM-as-a-Judge (gpt-5)
[link] [comments] |
Want to read more?
Check out the full article on the original site
Tagged with
#financial modeling with spreadsheets
#rows.com
#large dataset processing
#Qwen2.5-0.5B-Instruct
#bf16 model
#Reddit post summarization
#GRPO
#ROUGE-L
#length penalty
#quality reward
#PyTorch
#LCS
#rollout length
#DeepEval
#BLEU
#METEOR
#LLM-as-a-Judge
#faithfulness
#coverage
#conciseness