How many questions will we need to ask AI to ascertain that we’ve reached AGI and ASI?
In today’s column, I explore an intriguing and unresolved AI topic that hasn’t received much attention but certainly deserves considerable deliberation. The issue is this. How many questions should we be prepared to ask AI to ascertain whether AI has reached the vaunted level of artificial general intelligence (AGI) and perhaps even attained artificial superintelligence (ASI)?
This is more than merely an academic philosophical concern. At some point, we should be ready to agree whether the advent of ASI and ASI have been reached. The likely way to do so entails asking questions of AI and then gauging the intellectual acumen expressed by the AI-generated answers.
So, how many questions will we need to ask?
Let’s talk about it.
This analysis of an innovative AI breakthrough is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here).
Heading Toward AGI And ASI
First, some fundamentals are required to set the stage for this weighty discussion.
There is a great deal of research going on to further advance AI. The general goal is to either reach artificial general intelligence (AGI) or maybe even the outstretched possibility of achieving artificial superintelligence (ASI).
AGI is AI that is considered on par with human intellect and can seemingly match our intelligence. ASI is AI that has gone beyond human intellect and would be superior in many if not all feasible ways. The idea is that ASI would be able to run circles around humans by outthinking us at every turn. For more details on the nature of conventional AI versus AGI and ASI, see my analysis at the link here.
We have not yet attained AGI.
In fact, it is unknown whether we will reach AGI, or that maybe AGI will be achievable in decades or perhaps centuries from now. The AGI attainment dates that are floating around are wildly varying and wildly unsubstantiated by any credible evidence or ironclad logic. ASI is even more beyond the pale when it comes to where we are currently with conventional AI.
About Testing For Pinnacle AI
Part of the difficulty facing humanity is that we don’t have a surefire test to ascertain whether we have reached AGI and ASI.
Some people proclaim rather loftily that we’ll just know it when we see it. In other words, it’s one of those fuzzy aspects and belies any kind of systematic assessment. An overall feeling or intuitive sense on our part will lead us to decide that pinnacle AI has been achieved.
Period, end of story.
But that can’t be the end of the story since we ought to have a more mindful way of determining whether pinnacle AI has been attained. If the only means consists of a Gestalt-like emotional reaction, there is going to be a whole lot of confusion that will arise. You will get lots of people declaring that pinnacle AI exists, while lots of other people will insist that the declaration is utterly premature. Immense disagreement will be afoot.
See my analysis of people who are already falsely believing that they have witnessed pinnacle AI, such as AGI and ASI, as discussed at the link here.
Some form of bona fide assessment or test that formalizes the matter is sorely needed.
I’ve extensively discussed and analyzed a well-known AI-insider test known as the Turing Test, see the link here. The Turing Test is named after the famous mathematician and early computer scientist Alan Turing. In brief, the idea is to ask questions of AI, and if you cannot distinguish the responses from those of what a human would say, you might declare that the AI exhibits intelligence on par with humans.
Turing Test Falsely Maligned
Be cautious if you ask an AI techie what they think of the Turing Test. You will get quite an earful. It won’t be pleasant.
Some believe that the Turing Test is a waste of time. They will argue that it doesn’t work suitably and is outdated. We’ve supposedly gone far past its usefulness. You see, it was a test devised in 1949 by Alan Turing. That’s over 75 years ago. Nothing from that long ago can apparently be applicable in our modern era of AI.
Others will haughtily tell you that the Turing Test has already been successfully passed. In other words, the Turing Test has been purportedly passed by existing AI. Lots of banner headlines say so. Thus, the Turing Test isn’t of much utility since we know that we don’t yet have pinnacle AI, but the Turing Test seems to say that we do.
I’ve repeatedly tried to set the record straight on this matter. The real story is that the Turing Test has been improperly applied. Those who claim the Turing Test has been passed are playing fast and loose with the famous testing method.
Flaunting The Turing Test
Part of the loophole in the Turing Test is that the number of questions and type of questions are unspecified. It is up to the person or team that is opting to lean into the Turing Test to decide those crucial facets. This causes unfortunate trouble and problematic results.
Suppose that I decide to perform a Turing Test on ChatGPT, the immensely popular generative AI and large language model (LLM) that 400 million people are using weekly. I will seek to come up with questions that I can ask ChatGPT. I will also ask the same questions of my closest friend to see what answers they give.
If I am unable to differentiate the answers from my human friend versus ChatGPT, I shall summarily and loudly declare that ChatGPT has passed the Turing Test. The idea is that the generative AI has successfully mimicked human intellect to the degree that the human-provided answers and the AI-provided answers were essentially the same.
After coming up with fifty questions, some that were easy and some that were hard, I proceeded with my administration of the Turing Test. ChatGPT answered each question, and so did my friend. The answers by the AI and the answers by my friend were pretty much indistinguishable from each other.
Voila, I can start telling the world that ChatGPT has passed the Turing Test. It only took me about an hour in total to figure that out. I spent half the time coming up with the questions, and half of the time getting the respective answers.
Easy-peasy.
The Number Of Questions
Here’s a thought for you to ponder.
Do you believe that asking fifty questions is sufficient to determine whether intellectual acumen exists?
That somehow doesn’t seem sufficient. This is especially the case if we define AGI as a form of AI that is going to be intellectually on par with the entire range and depth of human intellect. Turns out that the questions I came up with for my run of the Turing Test didn’t include anything about chemistry, biology, and many other disciplines or domains.
Why didn’t I include those realms?
Well, I had chosen to compose just fifty questions.
You cannot ask any semblance of depth and breadth across all human knowledge in a mere fifty questions. Sure, you could cheat and ask a question that implores the person or the AI to rattle off everything they know. In that case, presumably, at some point, the “answer” would include chemistry, biology, etc. That’s not a viable approach, as I discuss at the link here, so let’s put aside the broad strokes questions and aim for specific questions rather than smarmy catch-all questions.
How Many Questions Is Enough
I trust that you are willing to concede that the number of questions is important when performing a test that tries to ascertain intellectual capabilities. Let’s try to come up with a number that makes some sense.
We can start with the number zero. Some believe that we shouldn’t have to ask even one question. The AI has the onus to convince us that it has attained AGI or ASI. Therefore, we can merely sit back and see what the AI says to us. We either are ultimately convinced by the smooth talking, or we aren’t.
A big problem with the zero approach is that the AI could prattle endlessly and might simply be doing a dump of everything it has patterned on. The beauty of asking questions is that you get an opportunity to jump around and potentially find blank spots. If the AI is only spouting whatever it has to say, the wool could readily be pulled over your eyes.
I suggest that we agree to use a non-zero count. We ought to ask at least one question. The difficulty with being constrained to one question is that we are back to the conundrum of either missing the boat and only hitting one particular nugget, or we are going to ask for the entire kitchen sink in an overly broad manner. None of those are satisfying.
Okay, we must ask at least two or more questions. I dare say that two doesn’t seem high enough. Does ten seem like enough questions? Probably not. What about one hundred questions? Still doesn’t seem sufficient. A thousand questions? Ten thousand questions? One hundred thousand questions?
It’s hard to judge where the right number might be. Maybe we can noodle on the topic and figure out a ballpark estimate that makes reasonable sense.
Let’s do that.
Recent Tests Of Top AI
You might know that every time one of the top AI makers comes out with a new version of their generative AI, they run a bunch of various AI assessment tests to try and gleefully showcase how much better their AI is than other competing LLMs.
For example, Grok 4 by Elon Musk’s xAI was recently released, and xAI and others used many of the specialized tests that have become relatively popular to see how well Grok 4 compares. Tests included the (a) Humanity’s Last Exam or HLE, (b) ARC-AGI-2, (c) GPQA, (d) USAMO 2025, (e) AIME 2025, (f) LiveCodeBench, (g) SWE-Bench, and other such tests.
Some of those tests have to do with the AI being able to generate program code (e.g., LiveCodeBench, SWE-Bench). Some of the tests are about being able to solve math problems (e.g., USAMO, AIME). The GPQA test is science-oriented.
Do you know how many questions are in the GPQA testing set?
There is a total of 546 questions, consisting of 448 questions in the Main Set and another 198 questions in the harder Diamond Set.
If you are interested in the nature of the questions in GPQA, visit the GPQA GitHub site, plus you might find of interest the initial paper entitled “GPQA: A Graduate-Level Google-Proof Q&A Benchmark” by David Rein et al, arXiv, November 20, 2023. Per that paper: “We present GPQA, a challenging dataset of 448 multiple choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are ‘Google-proof’).”
Please be aware that you are likely to hear some eyebrow-raising claims that a generative AI is better than PhD-level graduate students across all domains because of particular scores on the GPQA test. It’s a breathtakingly sweeping statement and misleadingly portrays the actual testing that is normally taking place.
In short, any such proclamation should be taken with a humongous grain of salt.
Ballparking The Questions Count
Suppose we come up with our own handy-dandy test that has PhD-level questions. The test will have 600 questions in total. We will craft 600 questions pertaining to 6 domains, evenly so, and we’ll go with the six domains of (1) physics, (2) chemistry, (3) biology, (4) geology, (5) astronomy, and (6) oceanography. That means we are going to have 100 questions in each discipline. For example, there will be 100 questions about physics.
Are you comfortable that by asking a human being a set of 100 questions about physics that we will be able to ascertain the entire range and depth of their full knowledge and intellectual prowess in physics?
I doubt it. You will certainly be able to gauge a semblance of their physics understanding. The odds are that with just 100 questions, you are only sampling their knowledge. Is that a large enough sampling, or should we be asking even more questions?
Another consideration is that we are only asking questions regarding 6 domains. What about all the other domains? We haven’t included any questions on meteorology, anthropology, economics, political science, archaeology, history, law, linguistics, etc.
If we want to assess an AI such as the hoped-for AGI, we presumably need to cover every possible domain. We also need to have a sufficiently high count of questions per domain so that we are comfortable that our sampling is going deep and wide.
Devising A Straw Man Count
Go with me on a journey to come up with a straw man count. Our goal will be an order-of-magnitude estimate, rather than an exact number per se. We want to have a ballpark, so we’ll know what the range of the ballpark is.
We will begin the adventure by noting that the U.S. Library of Congress has an extensive set of subject headings, commonly known as the LCSH (Library of Congress Subject Headings). The LCSH was started in 1897 and has been updated and maintained since then. The LCSH is generally considered the most widely used subject vocabulary in the world.
As an aside, some people favor the LCSH and some do not. There are heated debates about whether certain subject headings are warranted. There are acrimonious debates concerning the wording of some of the subject headings. On and on the discourse goes. I’m not going to wade into that quagmire here.
The count of the LCSH as of April 2025 was 388,594 records in size. I am going to round that number to 400,000, for the sake of this ballpark discussion. We can quibble about that, along with quibbling whether all those subject headings are distinctive and usable, but I’m not taking that route for now.
Suppose we came up with one question for each of the LCSH subject headings, such that whatever that domain or discipline consists of, we are going to ask one question about it. We would then have 400,000 questions ready to be asked.
One question per realm doesn’t seem sufficient.
Consider these possibilities:
- (a) 400K questions: 1 question x 400K LCSH
- (b) 4M questions: 10 questions x 400K LCSH
- (c) 40M questions: 100 questions x 400K LCSH
- (d) 400M questions: 1,000 questions x 400K LCSH
- (e) 4B questions: 10,000 questions x 400K LCSH
- (f) 40B questions: 100,000 questions x 400K LCSH
- (g) 400B questions: 1M questions x 400K LCSH
- Etc.
If we pick the selection of having 10,000 questions per the LCSHs, we will need to come up with 4 billion questions. That’s a lot of questions. But maybe only asking 10,000 questions isn’t sufficient for each realm. We might go with 100,000 questions, which then brings the grand total to 40 billion questions.
Gauging AGI Via Questions
Does asking a potential AGI a billion or many billions of questions, i.e., 4B to 40B, that are equally varied across all “known” domains, seem to be a sufficient range and depth of testing?
Some critics will say that it is hogwash. You don’t need to ask that many questions. It is vast overkill. You can use a much smaller number. If so, what’s that number? And what is the justification for that proposed count? Would the number be on the order of many thousands or millions, if not in the billions? And don’t try to duck the matter by saying that the count is somehow amorphous or altogether indeterminate.
In the straw man case of billions, skeptics will say that you cannot possibly come up with a billion or more questions. It is logistically infeasible. Even if you could, you would never be able to assess the answers given to those questions. It would take forever to go through those billions of answers. And you need experts across all areas of human knowledge to judge whether the answers were right or wrong.
A counterargument is that we could potentially use AI, an AI other than the being tested AGI, to aid in the endeavor. That too has upsides and downsides. I’ll be covering that consideration in an upcoming post. Be on the watch.
There are certainly a lot of issues to be considered and dealt with. The extraordinarily serious matter at hand is worthy of addressing these facets. Remember, we are focusing on how we will know that we’ve reached AGI. That’s a monumental question. We should be prepared to ask enough questions that we can collectively and reasonably conclude that AGI has been attained.
As Albert Einstein aptly put it: “Learn from yesterday, live for today, hope for tomorrow. The important thing is not to stop questioning.”
Source: https://www.forbes.com/sites/lanceeliot/2025/07/20/the-number-of-questions-that-agi-and-ai-superintelligence-need-to-answer-for-proof-of-intelligence/