From the air, the desert outside Santa Fe was a craggy mass all the way to the horizon. The only non-craggy thing for hundreds of miles was one straight line, a road to town. It was as if somebody had put painter’s tape on a blank page, sketched mountains, shadows and masses of black forest, and then carefully lifted the tape off.
The man next to me on the plane asked if I lived in Santa Fe, and I said no. Did he? He did. And did he know about the Santa Fe Institute, where I had been invited for a workshop? He merely frowned.
Maybe you wouldn’t know about it if you weren’t looking for it. The Santa Fe Institute (SFI) lies on a road leading out of the city, past gentle hills, and eventually into the Sangre de Cristo Mountains, so named because their slopes gleam red at sunrise and sunset. Low, flat Pueblo-style estates—symbols of New Mexican and Texan oil money—are scattered on the piñon-juniper hills like far-flung dominoes. Driving past, you could mistake the institute for just another tycoon’s mega-mansion.
SFI is the spiritual home of complexity science, a kaleidoscopic discipline that studies “complex adaptive systems”—a loose term for systems with many interacting parts, from which emerge behaviors that we could not predict on the basis of any individual component. The human body, ecosystems, the global economy—and increasingly, artificial intelligence—are all touted as examples of such systems.
As someone who studies the societal impacts of AI at an AI company, I was invited to a workshop this past spring called “Measuring AI in the World” to help “shape a more scientifically-grounded approach to measuring AI systems.” The timing felt urgent: as AI systems based on large language models are used by millions of people for just as many purposes, researchers need to create and share information about how they work, what they are (and aren’t) useful for and what their impact on society might look like, so that people can make informed decisions about how to interact with them.
Although the stakes of deploying AI appropriately feel increasingly high, there are few reliable or established approaches for measuring AI on any given dimension of interest. LLMs are as complex as economies, if not more so: they emerge from billions of interconnected parameters, and are trained on vast datasets through processes we don’t fully understand. Like other complex systems, their behaviors are hard to predict, and yet we need robust ways to evaluate them as they become more widely used.
Depending on what the AI is being used for, one might care about whether it’s as conversant in Japanese as it is in English, how consistent its behavior is across slight prompt variations, or whether it can accurately extract relevant information from legal documents. All of these require more complex approaches to measurement than quizzing the model on easily verifiable facts; there just aren’t yet standards about what that should look like. Take AI in health care, for example: existing medical AI benchmarks, like MedQA, test textbook knowledge through multiple-choice questions, but they don’t necessarily capture what’s needed for real-world applications—realistic clinical-reasoning skills, accurately synthesizing findings across disparate studies, or the capacity to adjudicate when to recommend seeking professional medical care instead of relying on a chatbot. This measurement gap is something that I try to close at work. My job involves figuring out which dimensions of AI are important to understand—both for building more beneficial AI systems and for ensuring society has the information it needs about these technologies—and casting around for ways to measure them: probing models to find (and fix) problematic behavior, analyzing real-world patterns of use and misuse, interviewing people about how AI affects their lives.
AI measurement is a new field, and everything is still under contention—not just how we test but what we should be testing for. Throughout the workshop, among desert ridges shaped like breadcrusts, participants puzzled over fundamental questions like: How do we assess whether AI is “reasoning” like humans do? Is it “truly intelligent”—but what does that mean? Even if we don’t understand its inner workings, could we still accurately predict its impact before unleashing it on the world?
The workshop’s fundamental concerns connected directly to my evergreen questions about my own daily endeavors: How does one approach the task of understanding the implications of an entirely new technology? The organizers said that using measurement approaches from existing fields, namely complexity science, could make evaluations of these new systems more consistent and tractable. So I booked a flight, hoping to learn from fields that have long grappled with untidy questions.
●
On the first day of the conference, the hotel bus rattled up to SFI’s front door. Inside, light poured through floor-to-ceiling windows adorned with equations scrawled in silver washable marker. The sunshine bounced off bright, abstract paintings and illuminated a display case full of SFI Press books, with covers bearing crisp typography and whimsical titles like The Quark and the Jaguar.
We, the attendees, moseyed through the corridor in business-casual attire, mixed black coffee with industrial creamer, and stole glances at each other’s name tags. I recognized computer scientists, social scientists, cognitive scientists, philosophers; some in industry, most in academia. We gathered in one big conference room with circular tables and plastic chairs, flanked by French doors through which jagged mesas were visible in the distance.
The opening session revealed a number of challenges confronting AI measurement. People disagree on basic definitions, don’t trust each other’s methods, and struggle to turn nuanced AI behavior into the simple good-or-bad scores that many want. The first dilemma played out when someone raised the (classic) issue of how technologists like to define “intelligence”—as, problematically, the ability to complete tasks. A cognitive scientist with an enviable H-index sighed. “The more pragmatic and technical you are, the more conceptually shallow you are.” A few people chuckled.
“It’s not intelligent if it’s just completing tasks,” another professor concurred. “Unless it can move through the world fluidly, choose what tasks to do like a human can, it is not intelligent.” I wondered why moving through the world fluidly and choosing tasks wasn’t just another task.
The professor had just presented a talk dissecting the problem-solving capabilities of an AI that played a specific board game. She showed how it memorized patterns rather than developing compressed, abstract rules capturing the game’s underlying principles. This, for her, suggested that AI lacked the robust understanding that humans develop through abstraction, which was crucial for genuine intelligence.
Her claim felt reassuring until my thoughts began to drift. What counts as an abstraction, and are they truly necessary for thinking? My brain seems to comprehend the world by plucking vague images and urges out of a pile of vibes and flinging them up like rainbow confetti. Most days, it skitters around, muddling through associations. This works fine until I’m writing, and I’ll then force abstractions, like ill-fitting jeans, onto my scramble of impressions.
●
Our lunch break was long, so most participants decided to go for a walk through the hills. I trudged through dust, feeling lightheaded, trying to keep an eye out for uneven rocks while also appreciating the view. It didn’t look like we were that high up, so I attributed my dazed feeling to limited sleep and too much sitting. But I googled the altitude anyway: 7,500 feet. Huh? We were merely standing on some overgrown hills. I kept pointing to the valley floor, which was right there, repeating the 7,500-foot statistic. Perhaps my powers of estimation had failed me. Or perhaps the entire region sat on a giant mesa, an enormous natural table. I stood, head pounding, at a height I didn’t recognize, trying and failing to find a reference point that might make sense.
For hours we discussed how to test “intelligence” in AI: more controlled experiments, creating rubrics for AI measurement grounded in existing cognitive science frameworks, or probing how the AI got to its result, not just its ability to spit out the correct answer. One problem was that tests previously seen as requiring intelligence, such as playing chess or recognizing speech, have now been dismissed as only requiring heuristics, shortcuts, not true intelligence, now that AI has mastered them. So, people asked, what evaluation would we set up ahead of time that everyone would agree on and that we wouldn’t change?
In day two’s opening session, the workshop organizer called upon an eminent academic who hadn’t spoken yesterday. As we sipped drip coffee and poked at our omelets, he stood up.
“We never think of giving extraordinarily competent people IQ tests,” he said. “Administering an IQ test to Marie Curie or Albert Einstein would feel foolish. I suspect that Einstein would do very badly, probably outperformed by some precocious, irritating fourteen-year-old.”
“Academic peer review is widely criticized too,” he added. “For expert production, there is no generally accepted form of evaluation.” He sat down, we looked around at each other, and the conference went on.
We continued to litigate strategies for evaluating AI, never answering the question: If we have never designed a test for the human mind that captures what it aspires to, why do we believe we can meaningfully measure AI systems, which are far more alien?
●
When I was in college, Rilke’s Letters to a Young Poet got me through my debilitating existential anxieties about what to do with my life. Rilke assured me that I could try to “live the questions” that plagued me, even try to love the questions—and perhaps then I would “gradually, without noticing it, live along some distant day into the answer.”
But this is an unnatural thing to do. Human minds do not enjoy living in questions. A concept in psychology called cognitive closure describes how our minds race for clear, firm answers and peaceful resolutions to questions, all to avoid the pain of ambiguity. Finding a reasonable relationship with cognitive closure is especially necessary, and especially difficult, for scientists. The job is to find answers to unresolved questions—but success hinges on finding the right answers, not just the nicest ones.
I know a prolific AI researcher who was raised a devout Catholic but left the church in his late teens. He works on AI alignment, the field dedicated to ensuring AI systems act in accordance with human intentions. Once, I asked him why he went into AI research, and he said he did so because he wanted to create an oracle more intelligent than humans, to tell him the answers to the metaphysical questions he was no longer able to find in God. “For example, what is the correct set of moral principles?”
“You think there’s a right answer to that?” I said. “That’s crazy.”
He blinked. “Yeah, I think so. Maybe human minds can’t know it now, but a superintelligence could.”
He believes in a higher truth that exceeds human comprehension but is available to a superior (albeit human-created) mind. And he wants to keep that mind leashed, under human control. What would it mean to surface a truth that we cannot understand or validate? Does he trust the AI, or does he ultimately trust human beings?
Truth should be accessible yet transcendent, superior but subordinate. Why do our searches for fundamental truth seem so often contradictory?
●
In the second workshop session, “Provocations,” a theoretical biophysicist who has a long-standing appointment at SFI compared the development of AI systems to the evolutionary history of life on Earth. In biology, simple cellular components evolve through recombination into increasingly complex architectures—bacteria, eukaryotes, organs, individuals, cities—with surprising and unpredictable transitions between them. A given architecture defines certain constraints on possible behaviors for life; scaling past the barriers of one regime therefore requires radical shifts in architecture.
“But we don’t have good theories for understanding what’s on the other side of the barrier,” he added. In evolutionary biology, we can see this in retrospect but not predict it ahead of time.
An AI is indeed more grown than designed, more like a plant than the prototypical machine. Every time I’ve trained an AI model, I’m struck by just how meta our choices are—centered around the quality and efficiency of the learning (read: growing) process. We water the AI with computation and data and watch enormous lattices of numbers multiply themselves together over and over again in search of the right internal architectures to solve the problems it’s given. We sit there hoping that the desired capabilities develop, inevitably surprised at what emerges.
Trying to make sense of this in any meaningful way is like trying to understand human psychology at the cellular level. Although measurement of mind-like properties should theoretically be more tractable for AI—AI systems can be dissected and examined in a way that human brains cannot be—this advantage is largely theoretical. Billions of parameters across multiple layers of numerical computation create a web too vast for human comprehension. We can draw an awe-inspiring graph of an AI’s innards, with galaxies of artificial neurons firing, but how will we understand it?
Like a human mind, AI learns best from experience. I learned to play tennis by drilling forehands and watching pros on YouTube, not from reading tennis books. An AI, too, learns by seeing examples, not through directly internalizing some expressible logic. The latter way of building AI has never really worked out for us. Perhaps because the kind of reasoning—whether artificial or biological—adequate to solve complex problems must be so multifaceted, so necessarily constituted by experience, that even if diagrammed out precisely, it may not make sense as a logical thread. It may be a monstrous hair knot of heuristics and impressions, navigated ordinarily at the speed of gut feeling.
We can only hold four or five items in working memory at a time. Our thoughts allegedly crawl at ten bits per second. We probably cannot fit our own reasoning in our heads.
●
In September 2023, a popular daily webcomic called Saturday Morning Breakfast Cereal featured a determined woman storming into an office building. “My whole life I’ve wanted to understand what consciousness is. Now that we can build artificial minds, we will finally get an answer.” She continues, marching triumphantly past a room full of computer servers: “No more dualism. No more mystery. No waving your hands and saying ‘emergent properties.’” But: “Oh, we don’t know why it works,” says one of the workers. “Neural networks are magic wizard stuff.”
There is a concept in philosophy also called cognitive closure. Unlike the corresponding concept in psychology, it describes not our need for answers but our inability to access them. Philosopher Colin McGinn coined the term to indicate that some philosophical questions may lie beyond human comprehension—the issue of consciousness, for example. Consciousness functions as a medium for thought and perception of the outside world, rather than as an object of representation itself. A camera cannot photograph its own insides.
And indeed, the growing questions about AI consciousness only highlight the unknowability of consciousness in general. We can’t even prove other humans aren’t “philosophical zombies” lacking inner experience, much less grasp animal consciousness after coexisting with animals for millennia. If we can’t definitively confirm firsthand experience in any organic being beyond ourselves, how can we hope to understand potential consciousness in silicon, a completely different substrate and form?
Biomedical researchers don’t know for sure what laboratory animals are experiencing, but they have, through observation, refined empirical guidelines for preventing distress: for example, heat can stress rabbits out, and mice are happier when socialized together. In practice we mosey along, accumulating heuristics, figuring out how to live with other human and nonhuman beings.
●
Our tendency to debate overloaded philosophical terms stems partly from the vexed concepts of “artificial intelligence” and its newer sibling, “artificial general intelligence” (AGI). Some researchers have said that AI’s very name is the field’s original sin, encouraging anthropomorphic thinking and the conflation of disparate capabilities under one loaded umbrella. It also unwisely positions human intelligence as the ultimate goal.
AGI compounds the problem. Now the stated goal of many AI companies, it throws another woolly word, “general,” into the mix. AGI is the kind of term that is just vague and aspirational enough to be useful in the tech industry (“Let’s match human intelligence—on everything!”) which also makes it an intellectual punching bag for academics—one person at the workshop said that pursuing a “general purpose system” was a “fool’s errand,” generality being something that brings you away from any particular user, interaction or context.
At the workshop, it felt like we were both circling around and avoiding this term, which loomed like a storm cloud over our discussions. “Can we just evaluate the artifact as-is, not with respect to ‘AGI’?” someone asked. Another brought up Carl Sagan’s aphorism, that “extraordinary claims require extraordinary evidence,” suggesting that the field had made immense proclamations too early in its existence.
The definitional problem runs deep. A recent MIT Tech Review article attempted to address the same problem as the Santa Fe workshop: how we might build better AI evaluations. The author’s main recommendation was to look to the tools of social science, where “it’s particularly important that metrics begin with a rigorous definition of the concept measured by the test.” If we want to measure how democratic a society is, for example, we first need to define “democratic society” and then establish questions that are relevant to that definition.
I pondered this two-step plan. Definitions seemed squarely part of the problem. We might (might) be able to achieve an acceptable consensus definition of democracy, but the idea of creating a rigorous definition of intelligence, or reasoning, is daunting and full of traps.
I sent the article to a friend doing a Ph.D. in statistics, and she texted back: “Famously, social science is terrible at evaluations.”
●
If we cannot understand what it is, how it’s “thinking,” whether it’s “reasoning” or “conscious” (and we have incredible doubts about our ability to suss out these questions for human beings too), can we at least try to predict and circumscribe how it might impact our world before letting the thing loose?
Contemporary AI systems, like humans, are not very well constrained. They learn to generate realistic text by reading vast amounts of it, absorbing implicit statistical patterns that don’t necessarily mirror human intuitions. Their outputs are flexible, open-ended and able to adaptively respond in many different contexts. This is part of the reason for the contentious “G” in AGI—they can generalize and do things they were never specifically trained to do. A policy and forecasting expert at the workshop commented that part of the measurement challenge is that we seem to have entered “emergent generality world,” where training on and for everything somehow seems to work better than trying to optimize for a set of specific, limited use cases.
Evaluating an AI system’s full impact before real-world deployment thus bears a whiff of futility. People like me and the other workshop participants want to make a go-no-go decision on these models. We subject them to a battery of pre-deployment tests, aiming to cover as many contexts as possible. While many undesirable behaviors can be caught in internal testing, even the fanciest sandbox cannot capture everything that will manifest after you’ve sent it into the world to interact with orders of magnitude more humans, in orders of magnitude more contexts. There’s always the risk, however small, that a freshly deployed customer service chatbot, prompted the right way, will offer to sell you a 2024 Chevy Tahoe for one dollar or suggest you leave your wife.
To understand what someone is truly capable of, one must observe them as they go out into the world and make something out of the life they’ve been given. We got the true measure of Einstein’s impact not by IQ-testing him but by watching him invent theories of physics.
Perhaps we have to set AI loose in the world in order to really understand it—a heightened version of the Collingridge dilemma: the idea that a technology’s impact cannot be easily predicted until it is extensively developed and widely used, but once that happens, controlling or changing it becomes difficult.
In practice, AI is often measured through interaction. People play with a chatbot for a day or two and decide whether to trust it, relying on qualitative experience, not quantitative metrics, much like learning to trust a new friend over time. We can’t peer into their brains or quantify their loyalty. As our technologies become just as complex as we are, we may see a return to the qualitative, the experiential and the aesthetic as the most appropriate way to make sense of the changing world around us.
The tension between quantitative and qualitative knowledge is as old as civilization. As Shigehisa Kuriyama details in his book The Expressiveness of the Body, ancient Greek and Chinese physicians developed radically different approaches to conceptualizing and reading pulses. Greeks boiled the pulse down to rhythm, frequency, speed, size—a singular, countable beat for the heart. Chinese practitioners distinguished multiple “pulses” that revealed different organs’ health, describing them in terms the Greeks would have found disturbingly figurative: “rough,” like “rain-soaked sand,” or “slippery,” like “rolling pearls.” Greek texts defined the pulse largely stripped of interpretation; Chinese texts conveyed only interpretation, only the experience of pulse-taking, never what the pulse actually “was.” The Europeans who inherited the Greek approach found the “dense, tangled mesh of interrelated, interpenetrating sensations” that Chinese doctors navigated threatening to a secure science. Yet even their certainty occasionally faltered: the French physician Théophile de Bordeu agreed with the Chinese, saying the pulse could only be known by touch, “by experience and not by reasoning, in much the same way that one comes to know colors,” as Kuriyama writes.
When all the latest AI models ace all the standardized tests we’ve made, we’re left sharing screenshots on social media to capture their differences; search “big model smell” on X and you’ll find people discussing the peculiarly unquantifiable vibe of a very capable model. Engineers describe working with AI in terms of whether it feels like collaborating with a senior or junior colleague. When metrics fail, we reach for a familiar metaphor: human beings.
●
Cormac McCarthy was a fellow at the Santa Fe Institute for decades, an author among mathematicians and physicists. He used to work on his Olivetti typewriter at a large hardwood desk in the SFI library. In his last novels, The Passenger and Stella Maris, characters use mathematical proofs as lenses on the nature of reality, quantum mechanics and consciousness. The protagonist of Stella Maris is a young, brilliant and psychologically messy mathematical genius, for whom mathematics appears as both a salvation and a curse. It offers her precision in a world of ambiguity, but its very exactitude highlights how much remains unknowable. She spirals.
To me, the great appeal of complexity science is its willingness to reckon with the emergent mystery in the world, trying to get its arms around it, going out bravely to confront it. Still, standing in the SFI library flipping through the case-bound pages of Foundational Papers in Complexity Science, letting my mind glide over the formulas and diagrams, I’m struck by how the terminology of complexity science is primarily mathematical, looking to collapse complexity into obedient equations that confine inquiry to the highest levels of abstraction. I wonder if fiction offered McCarthy what formulas could not: a language for the ineffable aspects of complexity, when the systems we study confound our scholarly instruments. His blunt, evocative, quasi-psychedelic prose slices a knife into those facets of our experience science has a hard time accounting for.
●
Over the course of two days in Santa Fe, we established some problems and some principles: it’s hard to assess general-purpose tools used in countless contexts, and we need context-specific yet systematic evaluations. Prediction of novel systems in novel contexts is difficult; we struggle to measure even that which seems simple. But what should we actually do about that?
In the final few sessions of the workshop, we split off into smaller groups to work on specific topics and then came back together to present them. A thought had been capering around in the back of my mind: What if our measurement systems were also complex systems? What about something that can hitch onto the back of the problem, and scale with its size? My group prepared some notes on the idea of post-deployment evaluations for understanding AI’s societal impacts—incident reporting systems that catalog problems as they occur, citizen science initiatives, ways in which our measurement system can, instead of fixing a specific frame and insisting on answers a priori, be adaptive and co-evolve alongside our technological society.
This wouldn’t address fundamental questions about what an AI system is or how it works, but it might help us figure out what we’re doing with it—by stepping outside of the clean laboratory environment and working within our constraints, rather than despairing over them. The very technologies we invent to extend our knowledge, powerful searchlights we aim into darkness, also reveal the distinct outline of all that we still cannot know.
From the air, the desert outside Santa Fe was a craggy mass all the way to the horizon. The only non-craggy thing for hundreds of miles was one straight line, a road to town. It was as if somebody had put painter’s tape on a blank page, sketched mountains, shadows and masses of black forest, and then carefully lifted the tape off.
The man next to me on the plane asked if I lived in Santa Fe, and I said no. Did he? He did. And did he know about the Santa Fe Institute, where I had been invited for a workshop? He merely frowned.
Maybe you wouldn’t know about it if you weren’t looking for it. The Santa Fe Institute (SFI) lies on a road leading out of the city, past gentle hills, and eventually into the Sangre de Cristo Mountains, so named because their slopes gleam red at sunrise and sunset. Low, flat Pueblo-style estates—symbols of New Mexican and Texan oil money—are scattered on the piñon-juniper hills like far-flung dominoes. Driving past, you could mistake the institute for just another tycoon’s mega-mansion.
SFI is the spiritual home of complexity science, a kaleidoscopic discipline that studies “complex adaptive systems”—a loose term for systems with many interacting parts, from which emerge behaviors that we could not predict on the basis of any individual component. The human body, ecosystems, the global economy—and increasingly, artificial intelligence—are all touted as examples of such systems.
As someone who studies the societal impacts of AI at an AI company, I was invited to a workshop this past spring called “Measuring AI in the World” to help “shape a more scientifically-grounded approach to measuring AI systems.” The timing felt urgent: as AI systems based on large language models are used by millions of people for just as many purposes, researchers need to create and share information about how they work, what they are (and aren’t) useful for and what their impact on society might look like, so that people can make informed decisions about how to interact with them.
Although the stakes of deploying AI appropriately feel increasingly high, there are few reliable or established approaches for measuring AI on any given dimension of interest. LLMs are as complex as economies, if not more so: they emerge from billions of interconnected parameters, and are trained on vast datasets through processes we don’t fully understand. Like other complex systems, their behaviors are hard to predict, and yet we need robust ways to evaluate them as they become more widely used.
Depending on what the AI is being used for, one might care about whether it’s as conversant in Japanese as it is in English, how consistent its behavior is across slight prompt variations, or whether it can accurately extract relevant information from legal documents. All of these require more complex approaches to measurement than quizzing the model on easily verifiable facts; there just aren’t yet standards about what that should look like. Take AI in health care, for example: existing medical AI benchmarks, like MedQA, test textbook knowledge through multiple-choice questions, but they don’t necessarily capture what’s needed for real-world applications—realistic clinical-reasoning skills, accurately synthesizing findings across disparate studies, or the capacity to adjudicate when to recommend seeking professional medical care instead of relying on a chatbot. This measurement gap is something that I try to close at work. My job involves figuring out which dimensions of AI are important to understand—both for building more beneficial AI systems and for ensuring society has the information it needs about these technologies—and casting around for ways to measure them: probing models to find (and fix) problematic behavior, analyzing real-world patterns of use and misuse, interviewing people about how AI affects their lives.
AI measurement is a new field, and everything is still under contention—not just how we test but what we should be testing for. Throughout the workshop, among desert ridges shaped like breadcrusts, participants puzzled over fundamental questions like: How do we assess whether AI is “reasoning” like humans do? Is it “truly intelligent”—but what does that mean? Even if we don’t understand its inner workings, could we still accurately predict its impact before unleashing it on the world?
The workshop’s fundamental concerns connected directly to my evergreen questions about my own daily endeavors: How does one approach the task of understanding the implications of an entirely new technology? The organizers said that using measurement approaches from existing fields, namely complexity science, could make evaluations of these new systems more consistent and tractable. So I booked a flight, hoping to learn from fields that have long grappled with untidy questions.
●
On the first day of the conference, the hotel bus rattled up to SFI’s front door. Inside, light poured through floor-to-ceiling windows adorned with equations scrawled in silver washable marker. The sunshine bounced off bright, abstract paintings and illuminated a display case full of SFI Press books, with covers bearing crisp typography and whimsical titles like The Quark and the Jaguar.
We, the attendees, moseyed through the corridor in business-casual attire, mixed black coffee with industrial creamer, and stole glances at each other’s name tags. I recognized computer scientists, social scientists, cognitive scientists, philosophers; some in industry, most in academia. We gathered in one big conference room with circular tables and plastic chairs, flanked by French doors through which jagged mesas were visible in the distance.
The opening session revealed a number of challenges confronting AI measurement. People disagree on basic definitions, don’t trust each other’s methods, and struggle to turn nuanced AI behavior into the simple good-or-bad scores that many want. The first dilemma played out when someone raised the (classic) issue of how technologists like to define “intelligence”—as, problematically, the ability to complete tasks. A cognitive scientist with an enviable H-index sighed. “The more pragmatic and technical you are, the more conceptually shallow you are.” A few people chuckled.
“It’s not intelligent if it’s just completing tasks,” another professor concurred. “Unless it can move through the world fluidly, choose what tasks to do like a human can, it is not intelligent.” I wondered why moving through the world fluidly and choosing tasks wasn’t just another task.
The professor had just presented a talk dissecting the problem-solving capabilities of an AI that played a specific board game. She showed how it memorized patterns rather than developing compressed, abstract rules capturing the game’s underlying principles. This, for her, suggested that AI lacked the robust understanding that humans develop through abstraction, which was crucial for genuine intelligence.
Her claim felt reassuring until my thoughts began to drift. What counts as an abstraction, and are they truly necessary for thinking? My brain seems to comprehend the world by plucking vague images and urges out of a pile of vibes and flinging them up like rainbow confetti. Most days, it skitters around, muddling through associations. This works fine until I’m writing, and I’ll then force abstractions, like ill-fitting jeans, onto my scramble of impressions.
●
Our lunch break was long, so most participants decided to go for a walk through the hills. I trudged through dust, feeling lightheaded, trying to keep an eye out for uneven rocks while also appreciating the view. It didn’t look like we were that high up, so I attributed my dazed feeling to limited sleep and too much sitting. But I googled the altitude anyway: 7,500 feet. Huh? We were merely standing on some overgrown hills. I kept pointing to the valley floor, which was right there, repeating the 7,500-foot statistic. Perhaps my powers of estimation had failed me. Or perhaps the entire region sat on a giant mesa, an enormous natural table. I stood, head pounding, at a height I didn’t recognize, trying and failing to find a reference point that might make sense.
For hours we discussed how to test “intelligence” in AI: more controlled experiments, creating rubrics for AI measurement grounded in existing cognitive science frameworks, or probing how the AI got to its result, not just its ability to spit out the correct answer. One problem was that tests previously seen as requiring intelligence, such as playing chess or recognizing speech, have now been dismissed as only requiring heuristics, shortcuts, not true intelligence, now that AI has mastered them. So, people asked, what evaluation would we set up ahead of time that everyone would agree on and that we wouldn’t change?
In day two’s opening session, the workshop organizer called upon an eminent academic who hadn’t spoken yesterday. As we sipped drip coffee and poked at our omelets, he stood up.
“We never think of giving extraordinarily competent people IQ tests,” he said. “Administering an IQ test to Marie Curie or Albert Einstein would feel foolish. I suspect that Einstein would do very badly, probably outperformed by some precocious, irritating fourteen-year-old.”
“Academic peer review is widely criticized too,” he added. “For expert production, there is no generally accepted form of evaluation.” He sat down, we looked around at each other, and the conference went on.
We continued to litigate strategies for evaluating AI, never answering the question: If we have never designed a test for the human mind that captures what it aspires to, why do we believe we can meaningfully measure AI systems, which are far more alien?
●
When I was in college, Rilke’s Letters to a Young Poet got me through my debilitating existential anxieties about what to do with my life. Rilke assured me that I could try to “live the questions” that plagued me, even try to love the questions—and perhaps then I would “gradually, without noticing it, live along some distant day into the answer.”
But this is an unnatural thing to do. Human minds do not enjoy living in questions. A concept in psychology called cognitive closure describes how our minds race for clear, firm answers and peaceful resolutions to questions, all to avoid the pain of ambiguity. Finding a reasonable relationship with cognitive closure is especially necessary, and especially difficult, for scientists. The job is to find answers to unresolved questions—but success hinges on finding the right answers, not just the nicest ones.
I know a prolific AI researcher who was raised a devout Catholic but left the church in his late teens. He works on AI alignment, the field dedicated to ensuring AI systems act in accordance with human intentions. Once, I asked him why he went into AI research, and he said he did so because he wanted to create an oracle more intelligent than humans, to tell him the answers to the metaphysical questions he was no longer able to find in God. “For example, what is the correct set of moral principles?”
“You think there’s a right answer to that?” I said. “That’s crazy.”
He blinked. “Yeah, I think so. Maybe human minds can’t know it now, but a superintelligence could.”
He believes in a higher truth that exceeds human comprehension but is available to a superior (albeit human-created) mind. And he wants to keep that mind leashed, under human control. What would it mean to surface a truth that we cannot understand or validate? Does he trust the AI, or does he ultimately trust human beings?
Truth should be accessible yet transcendent, superior but subordinate. Why do our searches for fundamental truth seem so often contradictory?
●
In the second workshop session, “Provocations,” a theoretical biophysicist who has a long-standing appointment at SFI compared the development of AI systems to the evolutionary history of life on Earth. In biology, simple cellular components evolve through recombination into increasingly complex architectures—bacteria, eukaryotes, organs, individuals, cities—with surprising and unpredictable transitions between them. A given architecture defines certain constraints on possible behaviors for life; scaling past the barriers of one regime therefore requires radical shifts in architecture.
“But we don’t have good theories for understanding what’s on the other side of the barrier,” he added. In evolutionary biology, we can see this in retrospect but not predict it ahead of time.
An AI is indeed more grown than designed, more like a plant than the prototypical machine. Every time I’ve trained an AI model, I’m struck by just how meta our choices are—centered around the quality and efficiency of the learning (read: growing) process. We water the AI with computation and data and watch enormous lattices of numbers multiply themselves together over and over again in search of the right internal architectures to solve the problems it’s given. We sit there hoping that the desired capabilities develop, inevitably surprised at what emerges.
Trying to make sense of this in any meaningful way is like trying to understand human psychology at the cellular level. Although measurement of mind-like properties should theoretically be more tractable for AI—AI systems can be dissected and examined in a way that human brains cannot be—this advantage is largely theoretical. Billions of parameters across multiple layers of numerical computation create a web too vast for human comprehension. We can draw an awe-inspiring graph of an AI’s innards, with galaxies of artificial neurons firing, but how will we understand it?
Like a human mind, AI learns best from experience. I learned to play tennis by drilling forehands and watching pros on YouTube, not from reading tennis books. An AI, too, learns by seeing examples, not through directly internalizing some expressible logic. The latter way of building AI has never really worked out for us. Perhaps because the kind of reasoning—whether artificial or biological—adequate to solve complex problems must be so multifaceted, so necessarily constituted by experience, that even if diagrammed out precisely, it may not make sense as a logical thread. It may be a monstrous hair knot of heuristics and impressions, navigated ordinarily at the speed of gut feeling.
We can only hold four or five items in working memory at a time. Our thoughts allegedly crawl at ten bits per second. We probably cannot fit our own reasoning in our heads.
●
In September 2023, a popular daily webcomic called Saturday Morning Breakfast Cereal featured a determined woman storming into an office building. “My whole life I’ve wanted to understand what consciousness is. Now that we can build artificial minds, we will finally get an answer.” She continues, marching triumphantly past a room full of computer servers: “No more dualism. No more mystery. No waving your hands and saying ‘emergent properties.’” But: “Oh, we don’t know why it works,” says one of the workers. “Neural networks are magic wizard stuff.”
There is a concept in philosophy also called cognitive closure. Unlike the corresponding concept in psychology, it describes not our need for answers but our inability to access them. Philosopher Colin McGinn coined the term to indicate that some philosophical questions may lie beyond human comprehension—the issue of consciousness, for example. Consciousness functions as a medium for thought and perception of the outside world, rather than as an object of representation itself. A camera cannot photograph its own insides.
And indeed, the growing questions about AI consciousness only highlight the unknowability of consciousness in general. We can’t even prove other humans aren’t “philosophical zombies” lacking inner experience, much less grasp animal consciousness after coexisting with animals for millennia. If we can’t definitively confirm firsthand experience in any organic being beyond ourselves, how can we hope to understand potential consciousness in silicon, a completely different substrate and form?
Biomedical researchers don’t know for sure what laboratory animals are experiencing, but they have, through observation, refined empirical guidelines for preventing distress: for example, heat can stress rabbits out, and mice are happier when socialized together. In practice we mosey along, accumulating heuristics, figuring out how to live with other human and nonhuman beings.
●
Our tendency to debate overloaded philosophical terms stems partly from the vexed concepts of “artificial intelligence” and its newer sibling, “artificial general intelligence” (AGI). Some researchers have said that AI’s very name is the field’s original sin, encouraging anthropomorphic thinking and the conflation of disparate capabilities under one loaded umbrella. It also unwisely positions human intelligence as the ultimate goal.
AGI compounds the problem. Now the stated goal of many AI companies, it throws another woolly word, “general,” into the mix. AGI is the kind of term that is just vague and aspirational enough to be useful in the tech industry (“Let’s match human intelligence—on everything!”) which also makes it an intellectual punching bag for academics—one person at the workshop said that pursuing a “general purpose system” was a “fool’s errand,” generality being something that brings you away from any particular user, interaction or context.
At the workshop, it felt like we were both circling around and avoiding this term, which loomed like a storm cloud over our discussions. “Can we just evaluate the artifact as-is, not with respect to ‘AGI’?” someone asked. Another brought up Carl Sagan’s aphorism, that “extraordinary claims require extraordinary evidence,” suggesting that the field had made immense proclamations too early in its existence.
The definitional problem runs deep. A recent MIT Tech Review article attempted to address the same problem as the Santa Fe workshop: how we might build better AI evaluations. The author’s main recommendation was to look to the tools of social science, where “it’s particularly important that metrics begin with a rigorous definition of the concept measured by the test.” If we want to measure how democratic a society is, for example, we first need to define “democratic society” and then establish questions that are relevant to that definition.
I pondered this two-step plan. Definitions seemed squarely part of the problem. We might (might) be able to achieve an acceptable consensus definition of democracy, but the idea of creating a rigorous definition of intelligence, or reasoning, is daunting and full of traps.
I sent the article to a friend doing a Ph.D. in statistics, and she texted back: “Famously, social science is terrible at evaluations.”
●
If we cannot understand what it is, how it’s “thinking,” whether it’s “reasoning” or “conscious” (and we have incredible doubts about our ability to suss out these questions for human beings too), can we at least try to predict and circumscribe how it might impact our world before letting the thing loose?
Contemporary AI systems, like humans, are not very well constrained. They learn to generate realistic text by reading vast amounts of it, absorbing implicit statistical patterns that don’t necessarily mirror human intuitions. Their outputs are flexible, open-ended and able to adaptively respond in many different contexts. This is part of the reason for the contentious “G” in AGI—they can generalize and do things they were never specifically trained to do. A policy and forecasting expert at the workshop commented that part of the measurement challenge is that we seem to have entered “emergent generality world,” where training on and for everything somehow seems to work better than trying to optimize for a set of specific, limited use cases.
Evaluating an AI system’s full impact before real-world deployment thus bears a whiff of futility. People like me and the other workshop participants want to make a go-no-go decision on these models. We subject them to a battery of pre-deployment tests, aiming to cover as many contexts as possible. While many undesirable behaviors can be caught in internal testing, even the fanciest sandbox cannot capture everything that will manifest after you’ve sent it into the world to interact with orders of magnitude more humans, in orders of magnitude more contexts. There’s always the risk, however small, that a freshly deployed customer service chatbot, prompted the right way, will offer to sell you a 2024 Chevy Tahoe for one dollar or suggest you leave your wife.
To understand what someone is truly capable of, one must observe them as they go out into the world and make something out of the life they’ve been given. We got the true measure of Einstein’s impact not by IQ-testing him but by watching him invent theories of physics.
Perhaps we have to set AI loose in the world in order to really understand it—a heightened version of the Collingridge dilemma: the idea that a technology’s impact cannot be easily predicted until it is extensively developed and widely used, but once that happens, controlling or changing it becomes difficult.
In practice, AI is often measured through interaction. People play with a chatbot for a day or two and decide whether to trust it, relying on qualitative experience, not quantitative metrics, much like learning to trust a new friend over time. We can’t peer into their brains or quantify their loyalty. As our technologies become just as complex as we are, we may see a return to the qualitative, the experiential and the aesthetic as the most appropriate way to make sense of the changing world around us.
The tension between quantitative and qualitative knowledge is as old as civilization. As Shigehisa Kuriyama details in his book The Expressiveness of the Body, ancient Greek and Chinese physicians developed radically different approaches to conceptualizing and reading pulses. Greeks boiled the pulse down to rhythm, frequency, speed, size—a singular, countable beat for the heart. Chinese practitioners distinguished multiple “pulses” that revealed different organs’ health, describing them in terms the Greeks would have found disturbingly figurative: “rough,” like “rain-soaked sand,” or “slippery,” like “rolling pearls.” Greek texts defined the pulse largely stripped of interpretation; Chinese texts conveyed only interpretation, only the experience of pulse-taking, never what the pulse actually “was.” The Europeans who inherited the Greek approach found the “dense, tangled mesh of interrelated, interpenetrating sensations” that Chinese doctors navigated threatening to a secure science. Yet even their certainty occasionally faltered: the French physician Théophile de Bordeu agreed with the Chinese, saying the pulse could only be known by touch, “by experience and not by reasoning, in much the same way that one comes to know colors,” as Kuriyama writes.
When all the latest AI models ace all the standardized tests we’ve made, we’re left sharing screenshots on social media to capture their differences; search “big model smell” on X and you’ll find people discussing the peculiarly unquantifiable vibe of a very capable model. Engineers describe working with AI in terms of whether it feels like collaborating with a senior or junior colleague. When metrics fail, we reach for a familiar metaphor: human beings.
●
Cormac McCarthy was a fellow at the Santa Fe Institute for decades, an author among mathematicians and physicists. He used to work on his Olivetti typewriter at a large hardwood desk in the SFI library. In his last novels, The Passenger and Stella Maris, characters use mathematical proofs as lenses on the nature of reality, quantum mechanics and consciousness. The protagonist of Stella Maris is a young, brilliant and psychologically messy mathematical genius, for whom mathematics appears as both a salvation and a curse. It offers her precision in a world of ambiguity, but its very exactitude highlights how much remains unknowable. She spirals.
To me, the great appeal of complexity science is its willingness to reckon with the emergent mystery in the world, trying to get its arms around it, going out bravely to confront it. Still, standing in the SFI library flipping through the case-bound pages of Foundational Papers in Complexity Science, letting my mind glide over the formulas and diagrams, I’m struck by how the terminology of complexity science is primarily mathematical, looking to collapse complexity into obedient equations that confine inquiry to the highest levels of abstraction. I wonder if fiction offered McCarthy what formulas could not: a language for the ineffable aspects of complexity, when the systems we study confound our scholarly instruments. His blunt, evocative, quasi-psychedelic prose slices a knife into those facets of our experience science has a hard time accounting for.
●
Over the course of two days in Santa Fe, we established some problems and some principles: it’s hard to assess general-purpose tools used in countless contexts, and we need context-specific yet systematic evaluations. Prediction of novel systems in novel contexts is difficult; we struggle to measure even that which seems simple. But what should we actually do about that?
In the final few sessions of the workshop, we split off into smaller groups to work on specific topics and then came back together to present them. A thought had been capering around in the back of my mind: What if our measurement systems were also complex systems? What about something that can hitch onto the back of the problem, and scale with its size? My group prepared some notes on the idea of post-deployment evaluations for understanding AI’s societal impacts—incident reporting systems that catalog problems as they occur, citizen science initiatives, ways in which our measurement system can, instead of fixing a specific frame and insisting on answers a priori, be adaptive and co-evolve alongside our technological society.
This wouldn’t address fundamental questions about what an AI system is or how it works, but it might help us figure out what we’re doing with it—by stepping outside of the clean laboratory environment and working within our constraints, rather than despairing over them. The very technologies we invent to extend our knowledge, powerful searchlights we aim into darkness, also reveal the distinct outline of all that we still cannot know.
If you liked this essay, you’ll love reading The Point in print.