July 26, 2024, by Brigitte Nerlich
Large language models, meaning and maths
I was reading an article in The Guardian about two novels by Benjamin Labatut. One novel, published in 2020, is entitled When We Cease to Understand the World and deals with quantum mechanics and war. The second novel The Maniac, published in 2023 and just out in paperback, is about John von Neumann, which brings us to AI and, I would argue, a new chapter in how we don’t understand the world.
One paragraph in this article struck me in particular: “The closing section of The Maniac describes – ‘almost like sports reporting’ – the triumph of AI over a human champion at the game of Go. The rise of AI is Von Neumann’s legacy, and Labatut isn’t at all persuaded by the argument that it’s just ‘spicy autocomplete’. ‘When you have a mathematical system that can run language, you have the two most powerful things we have developed as a species working together: mathematics and language,’ he says. “I think that we are absolutely on the verge of something, if not past the verge.”
I can, sort of, get my head round language, but maths… not so much, and even less ‘a mathematical system that can run language’ – and that became painfully apparent in my recent interactions with Claude, the Anthropic chatbot.
Peaking under the hood
I have been using, or rather playing with, ChatGPT and then Claude since Christmas 2022, but I never really dared look under the hood. That changed when I saw the following tweet by Dominik Lukeš (a linguist who knows a lot about language and AI) just when I was having a conversation with Claude about ‘AI literacy’ – something I have not achieved yet. The tweet goes:
“Three terms #AIliteracy experts should understand before making statements about what AI is: 1. Tokens: What gets generated (parts of words) 2. Embeddings: What gives tokens relational meaning (vectors) 3. Attention: What gives some tokens more weight in text (vector dot product)”.
Being rather ignorant about these things, I asked Dominik what ‘vector dot product’ meant. Dominik said: “Dot product is a simple operation on vectors that indicates their ‘distance’. In text, the LLM determines how related words are by doing that.” I imagine something like word clusters…but things are probably more complicated (it all comes down to maths, as we shall see).
Dominik then asked Claude a related question and reported back to me. Claude said, amongst other things: “Each word (“the”, “cat”, “sat”) is converted into an embedding vector. These vectors capture semantic meaning of words” (italics mine). That made me think about semantics and the meaning of words. As I have studied those things a little bit, I thought that would allow me to get to grips with stuff a bit better…..
I asked Claude to clarify what the difference was between words and vectors – which was actually a silly question, as Claude deals with ‘tokens’ not ‘words’ really. But anyway, Claude said that words are represented as strings of characters (e.g., “cat”, “run”, “happiness”), while vectors are represented as lists of numbers (e.g., [0.2, -0.5, 0.8, 1.2]). Let’s get back to words/language before returning to numbers/maths.
From semantics to mathematics
Semantics is a HUGE field with a long history and many theories. Even just looking at ‘word’ meaning (including the meaning of ‘word’) sends you down a long theoretical road. Although theories of meaning are fascinating, it turns out, that in the end we don’t really need them, as you’ll see. Nevertheless, in order to stay on relatively familiar ground before plunging deeper into LLMs, I asked Claude: “In the linguistic discipline of semantics there are many different theories of meaning. Are any of them important when trying to understand what goes on in an LLM when it generates text?”
Claude said: “Several theories of meaning from semantics could be relevant when considering how large language models (LLMs) generate text, though it’s important to note that LLMs don’t ‘understand’ language in the same way humans do”. Claude’s list of theories made me nod along, yep, that’s one, yep, that’s one … until I came to the last….
- “Distributional Semantics: This theory posits that the meaning of a word is related to its distribution in language and the contexts in which it appears. This aligns well with how LLMs learn, as they are trained on vast corpora of text and learn to predict words based on their contexts.
- Compositional Semantics: This theory suggests that the meaning of a complex expression is determined by the meanings of its constituent parts and the rules used to combine them. While LLMs don’t explicitly follow compositional rules, they do learn to generate coherent sentences by combining words in ways that reflect learned patterns of composition.
- Prototype Theory: This theory proposes that concepts are understood in terms of their most representative examples. LLMs might be seen as developing something akin to prototypes through their training on numerous examples of language use.
- Frame Semantics: This theory emphasizes the importance of background knowledge and context in understanding meaning. LLMs, through their training, develop a kind of statistical approximation of background knowledge that influences their text generation.
- Vector Space Models: While not a traditional linguistic theory, the idea of representing words and concepts as vectors in a high-dimensional space is crucial to how modern LLMs operate. This approach allows for capturing semantic relationships in a way that’s computationally tractable.”
Claude made an interesting caveat about the usefulness of looking at these familiar theories (apart from the vector stuff) in the context of LLMs: “It’s important to note that while these theories might provide useful lenses for understanding aspects of how LLMs function, they don’t fully explain the complex statistical processes occurring within these models.” And: “LLMs lack many key components of human language understanding, such as grounded experience in the physical world, intentionality, or true comprehension of abstract concepts. They generate text based on statistical patterns rather than through a process analogous to human thought or understanding.”
Vectors and maths
Statistical patterns… and so we are back to maths and … vectors. I could no longer looking at them a bit more closely. Claude gave me some nice metaphors and analogies for tokens and vectors and the like which gave me an illusion of understanding, as metaphors so nicely do; but finally I had to stare the enemy (maths) in the face:
“A vector space is a fundamental mathematical concept in linear algebra. It’s a set of objects called vectors that can be added together and multiplied by scalars (real numbers) while satisfying certain axioms” … and: “Common examples of vector spaces include: R^n (n-dimensional real coordinate space); Polynomial spaces; Function spaces; Matrix spaces”… etc…. Claude had lost me….
(If you want to feel less lost you can read Anil Ananthaswamy essay “The elegant math of machine learning” – and there is also a book)
Between semantics and mathematics
I have dabbled in all the theories of meaning listed by Claude, apart from ‘vector space models’ which are, in a way, the algebraic rock on which LLMs build their output. The other theories seem rather incidental to LLMs construction and deconstruction of ‘meaning’. This made me wonder what that means for semantics, especially human semantics.
If all these famous theories of meaning don’t really touch the sides when it comes to illuminating how meaning how LLMs deal with meaning, do we really understand how human brains deal with meaning? Do theories of word meaning, like the ones listed above capture what’s going on when brains deal with language and meaning or is it more based on statistical patterns and matrix algebra? And what does all this mean for the relation between statistical learning and more structured or rule-based learning when it comes to language?
Back to Labatut’s The Maniac with which I started these musings. At the very end of the novel, Demis Hassabis (Co-founder and CEO of Google DeepMind) is puzzled about why AlphaGo (a computer program that plays the board game Go) appeared to go crazy and Hassabis said: “That’s the longest it has searched during the entire game. I think it searched so deeply, that it lost itself.” (The Maniac, hardback, p. 335)
I also lost myself and my understanding of the world….Having said that, I recommend playing with Claude. It allows one to navigate one’s ignorance in new ways, especially where language and mathematics and human and machine learning collide.
Image: Pexels
No comments yet, fill out a comment to be the first
Leave a Reply