reward shaping → formatting (tag reqs • math • code) → correctness (math • code) → length penalty (yu et al., 2025) → language consistency. shall we make this for @PrimeIntellect Environment Hub? 😈