An interactive lesson in Goodhart's Law & RLHF
You'll rate pairs of AI responses — just pick the better one. Your preferences will train a reward model. Then you'll see what happens when an AI optimizes for your ratings.