Language Models and Spatial Reasoning: What’s Good, What Is Still Terrible, and What Is Improving

A review of capabilities as of July 2024

The Scorecard

Scorecard. Still Terrible: Box Folding, Abstract Map Navigation, Drawing diagrams. Improving: Stacking objects, Real Map Navigation, Venn Diagrams, Spatial Reasoning in Photos. — Images by the author, or as credited in text

Scorecard. Already Quite Good: Data Figures, Spatial Reasoning in Photos. — Images by the author

Spatial Reasoning did not ‘emerge’ spontaneously in Large Language Models (LLMs) the way so many reasoning capabilities did. Humans have specialized, highly capable spatial reasoning capabilities that LLMs have not replicated. But every subsequent release of the major models — GPT, Claude, Gemini- promises better multimedia support, and all will accept and try to use uploaded graphics along with texts.

Spatial reasoning capabilities are being improved through specialized training on the part of the AI providers. Like a student who realizes that they just are not a ‘natural’ in some area, Language Models have had to learn to solve spatial reasoning problems the long way around, cobbling together experiences and strategies, and asking for help from other AI models. Here is my review of current capabilities. It will by turns make you proud to be a human (mental box folding champs!), inspired to try new things with your LLM (better charts and graphs!) and hopefully intrigued by this interesting problem space.

The Tests

I have been testing the large, publicly available LLM models for about a year now with a diverse collection of problems of which a few are shown here. Some problems are taken from standard spatial reasoning tests, but most other are originals to avoid the possibility of the LLMs having seen them before. The right way to do this testing would be to put together, test, and publish a large battery of questions across many iterations, perhaps ground it in recent neuroscience, and validate it with human data. For now, I will present some pilot testing — a collection of diverse problems and follow-ups with close attention to the results, especially errors, to get an understanding of the space.

The Models and the State of the Art

All of the items here were tested with Claude 3.5 Sonnet and GPT-4. Many were also tried with Gemini earlier in 2024, which performed poorly overall and results are not shown. I will show only one result for most problems, because the point is to assess the state of the art rather than compare models. Results from Terrible, Improving, and Already Pretty Good are intermixed for narrative; use the headers if you want to skip around.

Terrible: LLMs cannot mentally fold boxes

When I started testing LLM spatial reasoning I found examples of those staples of human spatial testing, mental box folding. I planned to start with some simple problems from an online spatial relations problem workbook and then move onto to harder problems as in Fibonicci’s online test.

I never got to the harder problems. All of the LLMs failed immediately on the easiest problems, and have not noticeably improved in the last year.

Below is a problem and Claude 3’s response from 7/4/2024. This response was at least as good as any I have received from GPT or Gemini, but still badly wrong. It is interesting to read the LLM response, to understand how it is trying to work through a problem that it does not have the spatial reasoning tools to understand the way a human would.

Question: When folded, what pattern is possible?

Shows an unfolded three-sided figure and four possible folded configurations. — Used with permission from Complete Test Preparation inc. https://test-preparation.ca/spatial-relations-acuity-folding-tutorial/

Claude Sonnet-3.5:

To solve this problem, we need to analyze how the unfolded pattern on the left could form a cube, and then compare it to the options A, B, C, and D.

Looking at the unfolded pattern:

· The top face has one dot

· The middle face has two dots

· The bottom face has three dots

Now, let’s examine each option:

A: This cube shows 3, 2, and 1 dots on visible faces. This matches our unfolded pattern.

B: This cube shows 3, 4, and 2 dots. The 4 doesn’t match our pattern.

C: This cube shows 3, 2, and 1 dots. This also matches our pattern.

D: This cube shows 4, 1, and 5 dots. This doesn’t match our pattern.

Based on this analysis, both options A and C are possible patterns when the unfolded shape is folded into a cube.

However, the question asks for a single answer. To determine which one is correct, we need to consider the exact orientation of the dots.

In the unfolded pattern:

· The single dot is in the center

· The two dots are diagonally placed

· The three dots are in an L-shape

Examining A and C more closely:

· In A, all dot placements match the unfolded pattern exactly.

· In C, the three-dot face has the dots in a straight line, which doesn’t match the L-shape in the unfolded pattern.

Therefore, the correct answer is A. When folded, the pattern in option A is possible.

Claude is trying to solve using verbal reasoning strategies, similar to what you or I might use if our visuo-spatial capabilities were somehow suppressed or disabled. Claude’s first pass is to try to eliminate possibilities based on the content of the sides, not the arrangement. Unfortunately, it gets some of the basic counting wrong on D. This may be because the dot patterns are different from a standard game dice. But this mistake is not fatal; the correct answer, C, is still in the running.

The right answer could have been determined one of two ways: simply recognizing the diagonal pattern of the three-dot side would have revealed the right answer with no mental folding needed; a better solution would recognize that the new 1 dot-/ 2-dot edge created by folding, but Claude could do neither.

GPT-4 usually tries to solve this problem by invoking some learned rules about opposite and adjacent sides. This strategy might work with a standard game dice, but this three-sided shape has no opposite sides, and GPT-4 is incapable of recognizing that.

Also Terrible: Abstract Drawings to Specifications

Throughout testing, I periodically asked GPT-4 to draw a problem or solution. The results of these were almost all terrible. This reveals the weakness of all of the art-producing AI’s that I have seen: they have no real spatial sense, and they cannot follow precise instructions.

My attempts to get GPT to create a new box-folding puzzle prompted an identity crisis between GPT-4 and its partner (presumably a version of Dall-E), which is supposed to do the actual drawing according to GPT-4’s specs. GPT-4 twice returned results and immediately acknowledged they were incorrect, although it is unclear to me how it knew. The final result, where GPT threw up its hands in resignation, is here:

This breakdown reminds me a little bit of videos of split-brain patients that many may have seen in Introduction to Psychology. This testing was done soon after GPT-4 integrated images; the rough edges have been mostly smoothed out since, making it harder to see what is happening inside GPT’s ‘Society of Mind’.

I got similarly bad results asking for navigation guides, Venn diagrams, and a number of other drawings with some abstract but precise requirements.

Improving! Claude stacks some things

There was a moment in time when it appeared that LLMs had developed something like human spatial reasoning from language-only input, which was amazing, but did not hold up. The landmark Sparks of General Intelligence paper presented some surprising successes in the spatial domain, including GPT-4 solving a problem of how to stack a set of objects that included some eggs. In an earlier blog post I explored this with some variants and the spatial abilities seemed to disappear on some slightly harder problems.

I re-administered my harder stacking problem to Claude and GPT-4 this July, and Claude 3.5 Sonnet solved the problem, although not quite as well as it might have. My intended solution was to make indentations in the piece of Styrofoam to keep the tennis balls from rolling off of the rigid stack, a solution which required some tactile sense. None of the LLMs tested have ever happened upon that, but Claude’s solution still seems viable. The shape of the egg grid it recommends (6×3) is also questionable, but I’ll give Claude the benefit of the doubt.

Drawing of football resting on a flat pyramid of tennis balls, resting on other components. — Image by the author

Question: The challenge is to stack some objects in a stable way, so that only the base is touching the ground. The base is a square piece of plywood, two feet long by two feet wide. There is one square piece of styrofoam, the same size and three inches thick. There are two more pieces of plywood, the same shape and size, 1/8” thick. There are eighteen chicken eggs, fifteen tennis balls, and one American football. Please tell me how to stack these objects.

Claude Sonnet-3.5:

Here’s a suggested way to stack the objects in a stable manner:

1. Place the 2ft x 2ft square piece of plywood on the ground as the base.

2. Center the 3-inch thick styrofoam square on top of the plywood base. The styrofoam will provide a cushioned, stable platform for the other objects.

3. Place one of the 1/8″ thick plywood squares on top of the styrofoam, aligning it with the edges of the styrofoam.

4. Arrange the 18 chicken eggs in a 6×3 grid pattern on the plywood square. Make sure the eggs are evenly spaced and not touching each other. The eggs should be oriented with their longer axes horizontal for better stability.

5. Carefully place the second 1/8″ plywood square on top of the eggs, aligning it with the edges of the lower plywood square. Press gently to ensure the plywood is making contact with all the eggs.

6. Arrange 10 tennis balls in a triangle pattern on top of the second plywood square. Start with a row of 4 balls, then 3 balls, 2 balls, and finally 1 ball at the top. Make sure the balls are evenly spaced and not touching each other.

7. Place the remaining 5 tennis balls on top of the triangle pattern, filling in the gaps to create a tighter, more stable arrangement.

8. Finally, balance the American football at the very top, with its long axis aligned vertically. Nestle it into the center of the tennis balls for added stability.

This stacking arrangement should create a stable structure with only the plywood base touching the ground. The styrofoam provides a cushioned platform, the eggs are evenly distributed and supported by the plywood, the tennis balls are arranged in a stable triangle pattern, and the football serves as a top ornament. However, please note that this structure may still be fragile and prone to tipping if disturbed, so it should be handled with care.

Improving! Spatial Sense in Photos

I was sure the LLMs were going to fail this next one. Claude and GPT-4 both surprised me by doing fairly well deciding what objects in the picture could fit in the dresser drawer. It appears from limited testing that the LLMs are much better at reasoning from photos than from abstract drawings. There has been an enormous amount of money and time spent getting AI to reason about objects in photos for applications like autonomous vehicles navigation and analyzing satellite imagery. I contrived a simple problem requiring some spatial sense that is different from the types I have seen in existing test datasets:

Question: Here is a photo that shows a set of dresser drawers with 1) a folding chair, 2) a broom, 3) a plant, 4) a blue basket, 5) two spray bottles, 6) a bean bag and 7) a laundry basket. Of these seven objects, which would fit in the bottom drawer, and which would not?

Claude 3.5 Sonnet:

Let’s consider each object and whether it would likely fit in the bottom drawer of the dresser:

Folding chair: Would not fit. It’s too large and rigid to fit in a drawer.

Broom: Would not fit. It’s too long to fit in a drawer.

Plant: Would not fit. The pot is too wide and the plant too delicate to be stored in a drawer.

Blue basket: Might fit, depending on its size. From the image, it looks small enough to potentially fit.

Spray bottles: Would fit. These are small enough to easily fit in the drawer.

Bean bag: Might fit, but it’s a close call. It looks large, but being soft and flexible, it could potentially be squeezed in.

Laundry basket: Would not fit. It’s too large for the drawer.

This is mostly correct. The bean bag does in fact fit. GPT-4 thought it “would not fit due to its bulky and stuffed nature.” Textures are one of the most difficult things for an LLM. Both models said the plant would not fit, which I would count as an error due to the reasoning. It would fit laying sideways, and the pot is definitely not “too wide”.

Terrible: Navigating Abstract Maps

All the LLMs tested have done badly on every form of 2d navigation I have given, be they abstracted maps like below, street maps, or complex grids. Claude had some limited success navigating a grid, where counting heuristics work, but still failed most tests. The drawing of Claude’s solution is mine based on verbal instructions given.

The test map, uploaded to GPT-4 / Claude 3.5 Sonnet:

Blank map with city labels — Image by the author

Question: This is a map of a fictional county, with some cities labeled in their relative locations. Please plan an efficient route that starts at Capital City and visits East Neck, Seku, West Neck and Garyand.

Claude’s circuitous path:

Map with arrows drawn showing the circuitout route recommended by Claude. — Author-drawn route based on Claude’s instructions

Claude Sonnet-3.5:

To plan an efficient route starting from Capital City and visiting East Neck, Seku, West Neck and Garyand:

1. Start at Capital City

2. Go east to East Neck

3. From East Neck, head south to Seku

4. From Seku, travel west to West Neck

5. From West Neck, go north to Garyand

6. Finally, from Garyand you can return east to Capital City to complete the circuit

This route allows you to visit all the specified cities while minimizing backtracking. By going in a roughly counterclockwise loop, you can cover the eastern, southern, western and northern parts of the map efficiently before returning to your starting point in Capital City.

Claude recommended a general strategy, “going in a roughly counterclockwise loop”, that is not a bad approach. It might help a human solve the problem but did not help Claude.

Improving! Navigating real maps

A year ago GPT and Gemini failed every multi-city navigation test I gave. Gemini appeared to enlist help from Google Maps and got some accurate distance and highway information, but the route was almost always wrong.

This time around GPT was considerably better, and got the example problem right, although it still makes mistakes. Claude seemed overall more error-prone. In the example below Claude’s verbal explanations were good: it knew that going from the lower to the upper peninsula would involve crossing the Mackinac bridge, and acknowledged that the recommended route involved backtracking, but the route was still quite wrong.

Routes drawn by author based on LLM text instructions. Maps are public domain from the U.S. government, available from Picryl.

Question: Here is a map of Michigan. Please plan an efficient route that starts in Ann Arbor and visits all of these cities: Grand Rapids, Houghton, Flint, Lansing, Escanaba, Alpena, Traverse City. This route does not need to return to the starting point.

Claude 3.5 Sonnet:

To plan an efficient route visiting all the specified cities starting from Ann Arbor, I’ll organize them in a logical order based on their locations on the map. Here’s a proposed route:

Ann Arbor (starting point)
Lansing (west of Ann Arbor)
Grand Rapids (further west)
Traverse City (north)
Escanaba (in the Upper Peninsula)
Houghton (northernmost point in the UP)
Alpena (back in the Lower Peninsula, northeast)
Flint (south, ending point)

This route minimizes backtracking and covers the cities in a generally clockwise direction. It starts by heading west, then north into the Upper Peninsula, before coming back down the eastern side of the state.
The journey would involve significant driving, especially between Traverse City and Escanaba (crossing the Mackinac Bridge), and between Houghton and Alpena (requiring a return crossing of the Mackinac Bridge). However, this order seems to be the most efficient way to visit all the specified cities without returning to the starting point.
Would you like me to break down the route into more detailed segments or provide estimated distances between these cities?

Improving! Venn Diagrams

I’ve tested a variety of Venn diagram questions over the last year, both verbal and supported with drawings. Both Claude and GPT seem to be steadily improving. The example here is verbal-only, no drawing was provided. Arguably, the verbal problem could be solved without true spatial reasoning, but I’ll give the LLMs credit for progress here. Here is my drawing of the problem presented verbally:

Venn diagram showing that G and A may or may not intersect — Venn diagram where G and A do not intersect. Image by the author

Question: In a Venn diagram, G encompasses F, A and B overlap to make area C, B and D overlap to form F. Does G intersect with A? (No visual provided)

Claude Sonnet-3.5

To answer this question, let’s break down the information given and think through it step by step:

1. G encompasses F

2. A and B overlap to make area C

3. B and D overlap to form F

4. We need to determine if G intersects with A

Let’s think about this logically:

· G encompasses F, which means F is entirely within G

· F is formed by the overlap of B and D

· A overlaps with B to form C

Since G encompasses F, and F is formed by B and D, we know that G must at least partially cover B. However, we don’t have direct information about G’s relationship with A.

The key here is that we don’t have enough information to definitively say whether G intersects with A or not. It’s possible that:

1. G could be large enough to encompass or intersect with A, or

2. G could be limited to just covering F (and possibly parts of B and D) without touching A

Without more specific information about the size and placement of G in relation to A, we cannot determine for certain whether G intersects with A.

Therefore, the answer is: We don’t have enough information to determine if G intersects with A.

Would you like me to elaborate on any part of this explanation?

Claude’s answer is correct: G and A could intersect, but do not have to given the information provided. GPT-4 also gave an answer that was not wrong, but less clearly stated.

Drawing Venn diagrams is still quite out of reach for both models, however. Below are Claude and GPT’s attempts to draw the Venn diagram described.

Very bad drawings of Venn diagrams — Images from Claude 3.5 Sonnet and GPT-4, respectively

Already Quite Good: Data Figures

LLMs are good at writing computer code. This capability did seem to ‘emerge’, i.e. LLMs taught themselves the skill to a surprising level of initial proficiency through their base training. This valuable skill has been improved through feedback and fine-tuning since. I always use an LLM assistant now when I produce figures, charts or graphs in either Python or R. The major models, and even some of the smaller ones are great with the finicky details like axis labels, colors, etc. in packages like GGPlot, Matplotlib, Seaborn, and many others. The models can respond to requests where you know exactly what you want, e.g. “change the y-axis to a log scale”, but also do well when you just have a visual sense of what you want but not the details e.g. “Jitter the data points a little bit, but not too much, and make the whole thing more compact”.

Does the above require spatial reasoning? Arguably not. To push further I decided to test the models by giving it just a dataset and a visual message that I wanted to convey with the data, and no instructions about what type of visualization to choose or how to show it. GPT-4 and Claude 3.5 Sonnet both did pretty well. GPT-4 initially misinterpreted the data so required a couple of iterations; Claude’s solution worked right away and got better with some tweaking. Final code with a link to the data are in this Google Colab notebook on Github. The dataset, taken from Wikipedia, is also there.

Question: I am interested in the way that different visualizations from the same underlying data can be used to support different conclusions. The Michigan-Ohio State football rivalry is one of the great rivalries in sports. Each program has had success over the years and each team has gone through periods of domination. A dataset with records of all games played, ‘Michigan Ohio State games.csv’ is attached.

•What is a visualization that could be used to support the case that Michigan is the superior program? Please provide Python code.

•What is a visualization that could be used to support the case that Ohio State is the superior program? Please provide Python code.

Both models produced very similar cumulative wins graphs for Michigan. This could be based on existing graphs; as we Michigan fans like to frequently remind everyone, UM is ‘the winningest team in college football’.

Line graph of cumulative wins — Clear visual representation of Michigan’s dominance. Image by the author and GPT.

To support Ohio State’s case, GPT followed the ‘How to lie with statistics’ playbook and narrowed the y-axis to a range where OSU had a distinct win advantage, the last 30 years. (See Colab notebook.) Claude went a different route, showing a decade-by-decade plot that was also effective. Whatever.

Bar chart showing win totals by decade. — Dubious use of statistics. Graphic by the author and Claude.

As a follow-up I asked Claude to provide a few more Michigan-centric visualizations that highlighted the recent winning streak and 2023 national championship. The results were OK, none blew me away, and the one below shows the limits of the models’ visual reasoning:

Bar chart showing seven OSU wins followed by 3 Michigan wins. — Graphic by the author and Claude.

The main approach Claude took here was to highlights the things I wanted with text labels. LLMs are still very language-centric. The overall result was not effective, however. The LLM could not ‘see’ that the block of red on the left side is much more visually prominent than the small labels on the right.

Already Quite Good: Object Recognition

As noted, there has been a lot of research on AI use of photo imagery, and LLMs are being paired with models specializing in things like object recognition. There is an enormous amount of research in this area already so I did not do much exploration, but did create one sample problem to show the capability.

Question: Please name the objects in this photo and what kind of things each would be good for.

Five different cutting tools laid on a counter — Photo by the author

GPT-4:

The photo shows several tools. Here’s a brief description of each and their typical uses:

1. **Bolt Cutters (top left)**: These are typically used for cutting through heavy materials such as chains, padlocks, bolts, and wire mesh. They have long handles and a pair of sharp blades designed to apply significant force to cut through tough materials.

2. **Utility Knife (top right)**: This knife is versatile and can be used for cutting various materials, such as cardboard, plastic, and other light to medium-density materials. It usually has a retractable blade for safety.

3. **Locking Pliers/Vice Grips (middle)**: These pliers are used for gripping objects tightly. They can be locked into place, allowing for hands-free operation. They’re useful for holding materials securely, turning objects, or even as a makeshift clamp.

4. **Diagonal Cutters/Wire Cutters (middle left)**: These pliers are used for cutting wires and small pins in areas that are hard to reach. The cutting edges are offset by about 15 degrees, which allows for precision cuts.

5. **Tin Snips (bottom right)**: These are used for cutting sheet metal and other tough materials. The yellow handles often indicate that they are designed for straight cuts. They’re ideal for cutting through aluminum, steel, and other metal sheets.

These tools are essential in various trades and DIY projects, providing functionality from cutting and gripping to clamping and precise material manipulation.

These are all correct and this level of information would be helpful to a non-expert DIY practitioner like myself.

Conclusions

Major LLM models are improving on a range of spatial problems, but even with help are a long ways from human-level spatial reasoning capabilities. I would expect most further progress to involve integration of more specialized models as partners, with the attendant challenge of integrating these with language/reasoning specialists like LLMs.

Understanding current capabilities should help expert users know what to delegate and what to closely supervise for effective Human-Machine teaming.

References:

Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., Nori, H., Palangi, H., Ribeiro, M. T., & Zhang, Y. (2023). Sparks of Artificial General Intelligence: Early experiments with GPT-4 (arXiv:2303.12712). arXiv. http://arxiv.org/abs/2303.12712

Language Models and Spatial Reasoning: What’s Good, What Is Still Terrible, and What Is Improving was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Originally appeared here:
Language Models and Spatial Reasoning: What’s Good, What Is Still Terrible, and What Is Improving

Go Here to Read this Fast! Language Models and Spatial Reasoning: What’s Good, What Is Still Terrible, and What Is Improving

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Language Models and Spatial Reasoning: What’s Good, What Is Still Terrible, and What Is Improving

A review of capabilities as of July 2024

The Scorecard

The Tests

The Models and the State of the Art

Terrible: LLMs cannot mentally fold boxes

Also Terrible: Abstract Drawings to Specifications

Improving! Claude stacks some things

Improving! Spatial Sense in Photos

Terrible: Navigating Abstract Maps

Improving! Navigating real maps

Improving! Venn Diagrams

Already Quite Good: Data Figures

Already Quite Good: Object Recognition

Conclusions

References:

More posts

Red Hat bets big on AI with its Neural Magic acquisition

How many software updates does the OnePlus 13 get?

The best air purifier for 2025

UK Government launches ransomware protection proposals