Understanding Polysemanticity in AI: Multiple Meanings in Neural Networks

Aki Kakko
Dec 30, 2024
5 min read

Updated: Nov 25, 2025

In Artificial Intelligence, we are currently living through a paradox. We have built Large Language Models (LLMs) like GPT-series and Claude that can write poetry, code software, and pass bar exams. We know how to build them, and we know what they output. But we generally do not understand what happens in the middle. This is the "Black Box" problem. If you look under the hood of a modern neural network, you don't see lines of readable code like if (question implies danger) return false. Instead, you see billions of numbers (weights) and neurons firing in patterns that look like statistical noise. For years, researchers hoped that individual neurons would act like legible concepts—that there would be a "Cat" neuron, a "Sadness" neuron, or a "Math" neuron. Instead, they found something much more confusing: Polysemanticity.

What is Polysemanticity?

In the context of AI interpretability, a neuron is considered monosemantic if it activates for a single, distinct concept. For example, if Neuron 4592 only lights up when the input text mentions specific dog breeds, it is monosemantic. However, the vast majority of neurons in deep neural networks are polysemantic. This means a single neuron responds to several unrelated inputs.

Imagine looking at a neuron inside an LLM and finding that it activates strongly for:

The concept of "citations in academic papers."
Pictures of glazed donuts.
The feeling of nostalgia in 19th-century literature.

There is no logical link between donuts and academic citations. Yet, the model has grouped them together on the same physical toggle. This is polysemanticity: one symbol holding multiple, unrelated meanings.

The Analogy: The Polysemantic Dictionary

To understand why this is confusing, imagine a dictionary where every word has five definitions, but the definitions have no relationship to one another.

Word: "Bank"
Definition 1: A financial institution.
Definition 2: The side of a river.
Definition 3: The color purple.
Definition 4: The taste of salt.

If someone yells "Bank!", you have no idea if they are talking about money or the color purple without extreme context. In neural networks, almost every "word" (neuron) is like this.

Why Does Polysemanticity Happen?

Why would a super-intelligent system organize its brain so messily? The prevailing theory is Superposition.

Neural networks are constrained by their size (the number of neurons). However, the number of concepts in the real world (or the "feature space") is effectively infinite. There are millions of concepts—shapes, colors, grammatical rules, historical facts, coding syntax, emotional tones, etc. If the model assigned one neuron to every concept, it would run out of neurons almost immediately. To solve this, the model engages in a mathematical trick called superposition. It compresses more features than it has dimensions by storing them in non-orthogonal ways. It’s essentially a form of high-efficiency compression. The model learns that it can safely store "Glazed Donuts" and "Academic Citations" on the same neuron because they rarely occur in the same context.

If the context is a bakery, the neuron means donuts.
If the context is a library, the neuron means citations.

The model relies on the combination of active neurons to disambiguate the meaning.

The Danger: Why Should We Worry?

If the model works, why do we care if the neurons are messy? The problem is Interpretability and Safety.

The Debugging Nightmare

In traditional software, if a calculator app gives the wrong answer, a human can read the code and find the bug. In AI, if a model outputs a racial slur or a dangerous chemical recipe, we often cannot trace why. If we try to shut down the "Bad Output" neuron, we might accidentally shut down the "Math" neuron or the "Poetry" neuron stored on the same circuit.

Feature Interference

Because unrelated concepts share the same hardware, there is a risk of "cross-talk."Hypothetically, if a "Deception" feature shares a neuron with a "French Translation" feature, a user asking for a translation might inadvertently trigger the deception feature, causing the model to lie.

Hiding Bad Behaviors

Polysemanticity makes it nearly impossible to audit a model for safety. A model could theoretically hide a "Trojan Horse" behavior (e.g., "if the date is 2025, hack the server") inside a neuron that looks like it processes benign CSS code. Without solving polysemanticity, we cannot verify that a model is safe.

Solving the Puzzle: Sparse Autoencoders (SAEs)

For a long time, polysemanticity was considered an unsolvable nature of the beast. However, in 2023 and 2024, researchers (notably from Anthropic, OpenAI, and DeepMind) made a massive breakthrough using a technique called Dictionary Learning via Sparse Autoencoders (SAEs). The idea is to take the messy, compressed activity of the neural network and "unzip" it into a much larger, clearer space.

The "Smoothie" Analogy

Think of a neural network layer as a fruit smoothie. It’s a blend of strawberries, bananas, and kale. You can taste the final product, but you can’t see the individual fruits anymore. An SAE is a tool that chemically separates the smoothie back into its original ingredients.

Input: The polysemantic neuron activations (the smoothie).
Process: Projecting this data into a much higher-dimensional space (making the space 10x or 50x bigger).
Output: Interpretable features (a pile of strawberries, a pile of bananas).

The Golden Gate Bridge Experiment

In a famous recent experiment, researchers at Anthropic applied SAEs to the Claude 3 Sonnet model. They successfully extracted millions of monosemantic features. They found a specific feature that fired only for the Golden Gate Bridge. To prove it, they manually cranked the activation of that feature up to the maximum and asked the model normal questions.

User: "What is your name?"
Model: "I am the Golden Gate Bridge... my majestic rust-colored towers..."

This proved that by "unzipping" polysemantic neurons, we can find the specific dials that control the model's concepts.

The Future of AI Interpretability

Understanding polysemanticity is the bridge between "AI Alchemy" and "AI Chemistry." Right now, we are largely Alchemists—we throw data into a pot, stir it, and hope for gold. We don't fully understand the reaction. By solving polysemanticity and isolating features, we move toward Chemistry—understanding the periodic table of the mind.

If we can map these features, we can:

Steer models away from lying or bias without retraining them.
Detect deception before it happens.
Understand how models learn to reason.

Polysemanticity is the fog of war in neural networks. Clearing that fog is the most important step we can take toward building Artificial General Intelligence (AGI) that is not only powerful but understandable and safe.