The Advent Of ‘Thinking Tokens’ Causes Unforeseen Inflationary Impact On Generative AI

In today’s column, I examine an AI-insider topic that has rather startling inflationary impacts on the overall costs of how generative AI and large language models (LLMs) work and is relatively unknown to those outside of the AI community. I will walk you through the technical underpinnings and explain the core considerations, which have to do with a controversial approach encompassing “thinking tokens” (TTs).

Let’s talk about it.

This analysis of AI breakthroughs is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here).

Tokenization Is Crucial

The heart of the matter entails the tokenization aspects of modern-era generative AI and LLMs. I’ve covered the details of tokenization at the link here. I will provide a quick overview to get you up to speed.

When you enter text into AI, the text gets converted into various numbers. Those numbers are then dealt with throughout the rest of the processing of your prompt. Once the AI has arrived at an answer, the answer is actually in a numeric format and needs to be converted back into text, so it is readable by the user. The AI proceeds to convert the numbers into text and displays the response accordingly.

That whole process is known as tokenization. The text that you enter is encoded into a set of numbers. The numbers are referred to as tokens. The numbers, or shall we say tokens, flow through the AI and are used to figure out answers to your questions. The response is initially in the numeric format of tokens and needs to be decoded back into text.

Fortunately, an everyday user is blissfully unaware of the tokenization process. There is no customary need for them to know about it. The topic is of keen interest to AI developers, but of little interest to the general public. All sorts of numeric trickery are often employed to try and make the tokenization process as fast as possible so that the AI isn’t being held up during the encoding and decoding that needs to occur.

When Humans Need Thinking Time

I’d like to shift gears and bring up an aspect about humans and the way in which humans think. After doing so, I’ll go back to the AI tokenization topic and tie the two seemingly disparate considerations into a cohesive explanation. Hang in there, the wait will be worth it.

When humans are answering a question or telling a story, you will often observe that they sometimes pause or slow down, doing so presumably to grab a snippet of additional thinking time. The use of filler words or sounds such as “uh” and “you know” is used for a similar purpose. Generally, the brain and mind are trying to create a break in the real-time processing of speech so that there is breathing room to have thinking time.

I suppose you could liken this to watching a streaming video, and at times, the screen briefly pauses as the network catches up with the pace of the streaming. The servers are being taxed to rapidly distribute the video. The network might be capable of doing so at high speeds, while the server might not be up to snuff and able to emit the video at the hoped-for speeds.

In any case, I’m sure you’ve had to do something along the lines of pausing or breaking your speaking pace when trying to tell a convoluted story or answering tough questions. You are buying time to let your mind do its thing. It isn’t a phenomenon that occurs all the time. The typical situation is when you are thinking hard, maybe doing complex mental arithmetic calculations or attempting to arrange your thoughts in an especially logical order.

This pausing action doesn’t usually happen if you are merely reciting a fully memorized story or answering easy-peasy questions. The usual circumstance entails requiring your mind to fully become engaged, and that’s when the threshold is reached, causing you to consciously or subconsciously stall your speaking rate to handle the heightened mental workload.

Analogy To Generative AI And LLMs

Without anthropomorphizing AI, this idea of buying time to let more processing happen has been utilized in the realm of generative AI and LLMs. This is more akin to my example about servers and network processing video streaming. With AI, it has to do with tokenization.

The tokens that are being processed by the AI are sometimes handled one at a time (this is a simplification of more advanced LLMs, but it is a worthwhile means for discussing the matter at a 30,000-foot level). Construe the tokens as residing on a conveyor belt and moving along on a kind of assembly or processing line.

As each token comes up to the core area for processing, the AI is designed to give a fixed amount of time to process each token. The token arriving at the core area gets its allotted fixed time, and then the conveyor belt moves things along. One by one, the tokens coming up to the processing spot are all getting their same and fair share of time.

But what if the tokens are part of a really tough question that is being addressed? The fixed amount of time might not be sufficient for the AI to consider a larger range of possibilities. In a sense, the fixed amount of time is going to stifle the AI from doing enough processing to figure out a more well-rounded answer.

Trying To Solve The Dilemma

Which do you think is more important – getting the AI to work swiftly or getting better answers?

Let’s assume that users are more likely to tolerate a bit of latency if they are getting better answers to their questions, particularly in the circumstance of the AI handling tough questions. A means to give the AI more processing time would be to toss into the mix a special kind of token that somewhat pauses the streaming of the answer and allows the internal processing to get an added stint of computer time.

Here’s what we can do. A special token that only serves to spur breathing time will be added to the assembly line of tokens that are moving along on the conveyor belt. The special token has no other notable purpose and requires no processing of its own. All that this token does is act as an “uh” or “I know” and create a kind of gap between the tokens that are flowing along.

The Special Token Gets Into The Game

Researchers tried this approach and observed that the special tokens were indeed giving the AI added processing time to cope with the true tokens. We might therefore put these special tokens into a stream of tokens so that, from moment to moment, the AI is getting boosted processing time.

I will illustrate this with a brief example. Assume that the AI is processing the sentence “The dog barked at the cat.” Let’s assume that each word is turned into a token. And, each of those tokens will get the same amount of fixed time for processing inside the AI.

Imagine that we wanted to dovetail more processing time into this.

We could insert a special token like this: “The dog barked at the cat.” The way to interpret this is that after the words “The dog” there is a special token that does nothing other than allow the AI to further process the “The dog” part of the sentence. The same thing will happen after encountering the word “barked”. The special token itself does not chew up any time. It merely creates a breathing space for the AI to continue the processing it already had underway.

How many of these special tokens would we place into a prompt that has been entered by a user?

Well, it depends. If the prompt entails an easy question, it might be prudent not to toss any of the special tokens into the mix. Just let the AI do its usual processing. If the user has provided a tough question, we might place a few special tokens into the mix. We might even go so far as placing a special token after every token of the prompt, getting us a lot of added processing time, for example, “The dog barked at the cat .” That is going to have the AI essentially double the amount of processing time. This doesn’t guarantee a better response by the AI, but if the prompt entails something tough, it might end up producing a better answer.

Experimenting With This Approach

Some enterprising researchers went ahead and experimented with the inclusion of special tokens into the internal process of generative AI and LLMs. I’ll show you their initial findings.

Before I do so, I’ll mention a small fracas about this. They could have referred to the special tokens as a pausing token, or a take-a-break token, but instead they named the special tokens as so-called thinking tokens. Not everyone likes that naming convention. Here’s why. The takeaway that you might have is that these are tokens that somehow themselves are instinctively embodying thinking, but the real purpose is to shift the AI away from processing the special token and instead allot more time to the other true tokens that are being processed.

Do you see confusion with the naming?

Furthermore, there are other kinds of special tokens that are intended to provide added value instinctively by themselves. You might be tempted to refer to those tokens as thinking tokens. To avoid that kind of confusion, the general parlance is that those are reasoning tokens. The problem concerning names gets regrettably worse since an AI developer might sloppily refer to reasoning tokens as thinking tokens or refer to thinking tokens as reasoning tokens. It’s messy.

Just thought you’d like to be up on the parlance snag.

Classic Research Paper

The now-classic research paper that caught attention on the thinking tokens topic (others had been toying with special tokens too) was entitled “Thinking Tokens for Language Modeling” by David Herel, Tomas Mikolov, arXiv, May 14, 2024, and made these salient points (excerpts):

Our approach is to introduce special ’thinking tokens’ (< T >) after each word in a sentence whenever a complex problem is encountered.”
“The core idea is that each ’thinking token’ would buy more time for the model before an answer is expected, which would be used to run additional computations to better answer a complex problem that was presented.”
“This concept has great potential in recurrent neural networks due to their architecture, because it enables the RNN to perform multiple in-memory operations in a single step, meaning that extra calculations can be run in the hidden layer multiple times.”
“Experiments execution has successfully produced numerous examples where the usage of ’thinking tokens’ leads to an improvement in the model’s judgment. Preliminary results show that sentences that require non-trivial reasoning, have the biggest improvement in perplexity when ’thinking tokens’ are used compared to the standard model.”

The beauty of this approach is that you don’t have to do much to implement it. You don’t need to utterly rejigger the guts of the AI system. Instead, just make a modest modification to allow for a special token that, when encountered, is not processed for itself, and instead allows more processing time for the other nearby real tokens.

Too Much Of A Good Thing

Since this clever trickery tends to get better answers from AI, plus you don’t need to completely turn the AI upside down to adopt the approach, many AI makers were eager to put this into practice. Envision that we go ahead and add the special tokens to all manner of user prompts that are being entered into the AI. The user doesn’t see that we are doing so. It is all done internally within the AI.

A user is going to potentially experience added latency. Their answer doesn’t appear as quickly as it might have. Maybe the user notices, maybe they don’t. Another factor is cost. The AI is doing more processing because the special token is prodding it to do so. If a user is paying by the number of tokens processed or by the amount of computing time consumed, they are going to see an increase in their usage billing.

That’s where scale begins to rear its ugly head.

An individual user might not be able to discern an uptick in their billing or realize that their answers are taking a slightly longer time to appear. What about when we have thousands of users, or maybe millions of users? You might be aware that OpenAI touts the claim that they have 800 million weekly active users. It’s an impressively large number of users.

The gist is that at scale, the consumption of computer processing time is bound to be a lot higher than it otherwise would have been. This is happening on a global basis. Servers in vast data centers are churning away, including doing so because of the inserted special tokens. The electrical energy consumed goes up, as does whatever cooling method might be used, such as water cooling.

Some refer to this as an inflationary impact on the use of AI.

Questioning Those Thinking Tokens

One perspective is that the AI makers ought to be more judicious in how they use thinking tokens. Some are very careful, others are more loosey-goosey. The crux is that there is an ROI or tradeoff of employing the thinking tokens. Use them sparingly, some insist. Others reply that thinking tokens are simply the natural need to ensure that enough processing time is being devoted to what users ask. Use thinking tokens adroitly and don’t worry about the worrywarts.

Some researchers have critically assessed the value of thinking tokens and emphasized that other approaches might be a better way to go. One such study asserted that the use of chain-of-thought (known as CoT, which I explain in detail at the link here), gets you more bang for the buck, as noted in “Rethinking Thinking Tokens: Understanding Why They Underperform in Practice”, Sreeram Vennam, David Valente, David Herel, Ponnurangam Kumaraguru, arXiv, November 18, 2024, per these points (excerpts):

“Thinking Tokens (TT) have been proposed as an unsupervised method to facilitate reasoning in language models.”
“However, despite their conceptual appeal, our findings show that TTs marginally improves performance and consistently underperforms compared to Chain-of-Thought (CoT) reasoning across multiple benchmarks.”
“We hypothesize that this underperformance stems from the reliance on a single embedding for TTs, which results in inconsistent learning signals and introduces noisy gradients.”
“When a single embedding is used for TTs, during backpropagation the model receives inconsistent learning signals, leading to noisy gradient updates. This noise disrupts learning, particularly in tasks that require structured intermediate steps, such as arithmetic reasoning or multi-hop commonsense tasks.”

The matter is still up in the air, and there are heated debates among AI developers.

Hidden Tokens And Inflationary AI

I am assuming that you might be somewhat startled to now realize that there are hidden tokens that are potentially stoking greater consumption of computing processing, and these out-of-view tokens are considered to be an inflationary element of modern-era AI.

As the famed Arthur Conan Doyle once remarked: “It has long been an axiom of mine that the little things are infinitely the most important.” That is equally true when it comes to what is happening inside contemporary AI.

Source: https://www.forbes.com/sites/lanceeliot/2025/11/05/the-advent-of-thinking-tokens-caused-an-unforeseen-inflationary-impact-on-generative-ai/