Bold claim: AI still struggles to perform basic spatial reasoning, especially when it comes to manipulating and transforming 3D knots. And this is where the gap between text-based prowess and real-world understanding becomes most evident. New 3D benchmarks reveal that while AI can untie simple knots fairly well, it stumbles when asked to tie knots from loops or convert one knot into another. This gap matters because spatial manipulation is crucial for robotics and other AI-driven applications beyond language tasks.
In their study, Cornell researchers introduced KnotGym, a 3D simulator designed to probe how reinforcement learning models and large language models (LLMs) like GPT-4 handle spatial reasoning in a controlled environment. KnotGym serves as a visual generalization test, with a built-in generalization ladder that increases knot difficulty. The goal is to see how well AI performs beyond the data it was trained on, and how it copes with increasingly challenging spatial tasks.
Lead author Zoe (Zizhao) Chen, a computer science PhD student at Cornell Tech, explains that current AI excels when processing large text blocks, but falters when asked to reason in three dimensions. Co-author Yoav Artzi, also from Cornell Tech, notes that while much of AI’s reasoning today is text-based—and valuable—it isn’t sufficient for spatial tasks that underlie robotics and physical manipulation.
In KnotGym, AI agents were shown simple loop shapes and a variety of knots and instructed to either unknot, knot, or convert one knot into another. The results showed a clear pattern: untying straightforward knots was achievable with high reliability—about 90% success for knots with up to four crossings, including a classic shoelace knot with three crossings.
By contrast, tying and knot conversions were noticeably harder. The success rate for tying two-crossing knots hovered around 83%, but dropped to 16% for knots with three crossings. For knots with more than three crossings, AI performance essentially collapsed. Converting knots yielded roughly the same trajectory as tying them, indicating a broader challenge with manipulating spatial configurations rather than just the act of tying itself.
Chen emphasizes that AI currently lacks the exploratory, play-based learning that humans naturally use. She compares AI learning to how a child explores a Rubik’s Cube: through trial and error, experimentation, and reusing previous experiences to reach broader goals. This kind of iterative discovery is precisely what current AI systems struggle to emulate in 3D environments.
Looking ahead, the researchers plan to accelerate KnotGym’s evaluation by running simulations on GPUs, which are better suited for the heavy computation involved in 3D reasoning. This hardware upgrade would speed up experiments and enable more extensive testing.
Funding for the work came from several sources, including the National Science Foundation, Open Philanthropy, Nvidia Academic Grant, and the National Artificial Intelligence Research Resource (NAIRR) Pilot. Louis DiPietro contributed as a writer for the Cornell Ann S. Bowers College of Computing and Information Science.
Why this matters: as AI systems inch closer to functioning in physical spaces and controlling real-world robots, mastering spatial reasoning will be essential. The KnotGym results suggest that researchers should invest in encouraging exploratory learning and 3D manipulation capabilities, not just language-based inference, to broaden AI’s utility in robotics and related fields.