How Words Become Data: Understanding Tokens

Diving into tokens, AI sees in numbers, the context window, from numbers to meaning, demystifying the magic.

Today I dove into something fundamental to how AI language models work: the concept of "tokens." In relation to model limits and pricing, its important to understand how simple words become data that a machine can process.

AI Sees in Numbers

The thing is - AI doesn't see words or sentences the way we do. It sees the world in numbers. Tokens are the bridge between our language and their numerical world. Think of a sentence as a completed jigsaw puzzle. Tokens are the individual pieces. A piece could be a whole word like "love," a fragment of a word like "un-" or "-able," or even just a punctuation mark. The AI takes our text, breaks it down into these puzzle pieces, and then assigns a unique ID number to each one. So, the sentence "I love AI!" isn't seen as words, but as a sequence of numbers, like [40, 262, 12AI, 0]. This process, called tokenization, is the first step in turning language into data.

The Cleverness of Subwords

Digging deeper, modern AIs use clever "subword" tokenization methods, like Byte-Pair Encoding (BPE). This means that common words might get their own single token, but rarer or more complex words get broken down. For example, a word like "unbelievable" might become three tokens: unbeliev, and able. This is incredibly efficient. It allows the model to understand words it has never seen before by recognizing their familiar parts. It's also why a word like "Bangalore" might become two tokens, Banga and lore, if the full name wasn't common enough in the training data to earn its own spot in the model's vocabulary.

The Context Window

So why is this so important to understand? It comes down to the model's "context window," which is essentially its working memory. Every model has a hard limit on how many tokens it can consider at one time - be it 4,000, 8,000, or even more. This limit includes both my prompt and the AI's response. If a conversation gets too long, the earliest tokens, the beginning of our chat, fall out of this window. The model literally forgets what we first talked about. This explains why an AI might lose track of details in a long discussion. It’s not being forgetful in a human sense, its memory buffer is simply full.

From Numbers to Meaning

The final piece of the puzzle was understanding how these tokens, once converted to numbers, actually convey meaning. The token IDs themselves are arbitrary. The magic happens in the next step: embedding. Each token ID is mapped to a high-dimensional vector - a long list of numbers, that represents its meaning and context. These vectors are learned in such a way that words with similar meanings have similar vectors. "Cat" and "dog" will be close in this vector space, while "cat" and "car" will be far apart. This is how the machine starts to understand relationships, nuances, and analogies.

The Full Picture

So, a word becomes data through a two-step process: it's first tokenized into a numerical ID, and then that ID is embedded into a rich, meaningful vector. Understanding this entire flow, from a simple word to a token to an embedding, makes it clear how these models function. It demystifies their limitations and gives a fascinating glimpse into how meaning itself can be encoded in the language of mathematics.

Latest Articles.

Latest Articles.

Latest Articles.

Thoughts, ideas, and perspectives on design, simplicity, and creative process.