We use information theory to analyze and improve chain-of-thought monitorability, proposing training methods that improve monitor accuracy while preventing CoT degeneration.
We replace natural language "thinking" with structured tool interactions, enabling even 3B-parameter models to learn effective test-time compute scaling.