(translation of the original telegram post here)
So, Operation Pangolin took first place in the blind run on the Accuracy Leaderboard (tied with codex-on-rails).
What is under the hood? It is not so much a chatbot agent as a compact programmable analyst with a strict checklist and a REPL loop.
The core is written in TypeScript. It calls Anthropic Claude (Sonnet for debugging, Opus for the competition). Notably, the LLM does not have a large set of tools, but only a single one: execute_code. In other words, the LLM generates Python code, which gets access to the runtime tools through the Workspace class, as well as to memory (scratchpad) and a dictionary of variables. The results are then passed back to Claude. This repeats until the code eventually produces an answer through ws.answer(scratchpad, verify), which successfully passes the built-in verification.
The solution works very