turns out LLMs are great at single-shot q&a but get lost when it becomes a back-and-forth. researchers shard a coding task into multiple hints; models answered too early, kept clinging to bad assumptions, and produced longer yet worse code on each turn.