On The Pursuit of Human-Level Interestingness Value in AI2-Generated Literary Textual Events
Recently deployed Authorial Intelligence (AI2) techniques, composed of drama management components and natural language generation algorithms, combined with a more speculative Reader Intelligence (RI) tool, still in development, offer the opportunity of achieving the long thought impossible goal of autogenerated human-level literary textual events. A small but growing research community is making significant progress toward employing AI2 and RI in tandem to restore integrity to disciplines and industries that advances in narrative autogeneration have inadvertently compromised. Our hypothesis: a surprising range of corporations, various independent research teams, and the defense industry each now acknowledge culpability in having fractured the cultural infrastructure upon which their various ambitions stand. The present moment appears bereft, but complete collapse can be avoided by restoring the precarious balance between authorial aesthetic goals and reader-perceived interestingness value.
INTRODUCTION AND HISTORY
Storytelling has long been an integral part of life. Stories create context, motivate readers, and move the action forward. The market potential of narrative autogeneration, loosely put, was first recognized more than three decades ago, and the early evolution of emergent interactive story creation techniques was fueled by the raw curiosity of (1) academic pioneers into Artificial Intelligence (AI), now established as a discrete field of study in degree-awarding institutions, and (2) the more aggressive, caffeine-charged enthusiasm of computer game designers, who were motivated on the one hand by video game players’ insatiable appetite for diverting amusement, and on the other hand by their own quixotic pursuit of graphic perfection.
Initially, when the potential military use of visual scenarios seemed limited to flight simulators and applications of similarly modest computational speed requirements, it was the latter of the designers’ motivations, the so-called “realism race,” that drove research. Increasing realism meant coding ever more sophisticated and stimulating graphic story worlds, and it wasn’t long before realism was largely achieved, in the sense that visuals appearing onscreen or in the arenas of primitive virtual reality interface mechanisms looked, for all intents and purposes, “real.” But even before the defense industry began funneling R&D money into virtual-training technology for everything from battlefield first aid to hostage/terrorist scenarios, programmers had intimated a compelling hypothesis: rather than graphic realism, it was “believable” characters—defined as agents of the authorial intelligence that do not act in ways that break a player’s “suspension of disbelief”—that made for truly challenging, enjoyable, and externally applicable narrative experiences. The full implications of this were not wholly understood. To this day, how either a human author or an Authorial Intelligence (AI2) algorithm creates fully believable characters remains an open question. Yet progress was undeniably made. Games provided the graphics, and the first steps toward narrative autogeneration came when academic researchers who were focused on the problem of character development took aim at the “human-level,” a term that is as poorly defined as it is widely used. Regardless, from that point forward it was AI, rather than graphics, that served as the point of game-to-game comparison for both ardent communities of video game players, and for defense contractors that had invested heavily in products that would later be imported to a range of classified applications. It is ironic that the most important advances in narrative autogeneration came as a result of backward-looking investigations into the original medium: text-based storytelling.
While there was no pressing need to attempt a “Manhattan Project” to create AI2 technology all at once, it is fair to say that, in addition to market forces pulling the research forward, a small cadre of investigators driven by curiosity alone had begun to wonder why robots, which for decades had been featured as nefarious characters in human-authored textual events, had not gone on to become generators of stories themselves, rather than merely their subject. The present authors, a group currently of similarly modest scope but drawing new converts all the time, owe a significant debt to that dedicated core of programmers whose initial, speculative coding work can be described only as a “labor of love.” Advances came incrementally.
The first major hurdle was the so-called “static plot problem.” This was a function of competing desires. Readers, it was noted, long to experience story worlds that feature plausible events in addition to believable agents, yet plots are expected to move along at a regular clip. Stated otherwise, the danger of realistic narratives is that they might be too real. The unpredictability that confers to a narrative a sense of verisimilitude also leaves it fragile, in that there is no “leaving out the boring bits.” Historically, human-authored textual events have contended with this problem by segmenting stories into “scenes.” Even the earliest autogenerating programs corrected for the static plot problem by identifying moments when the action should “flip to the next day.” It was suggested even then that the evolution of autogenerating narrative technology seemed to be mirroring the maturation process of a budding human author: the ability to manage time and produce plot-driven narrative comes first. The challenge was that characters, in order to maintain believability, needed to behave as though they had experienced the “between scenes” activity. In short, an instinctual skill that slowly develops in human authors as they master their craft required many computations for each and every one of an autogenerated narrative’s virtual agents.
Hence, drama management tools. Most simply put, the drama manager component is an omniscient computing agent that works “behind the scenes” to conduct live drama experiments to monitor narrative development and intermittently intervene to drive the action forward according to modeled plot progression and anticipated reader preferences. Drama management is largely an optimization problem in that it chooses best actions with expectation-calculating nodes. The autogenerating narrative industry can be said to have been launched with a system that deconstructed and recycled existing stories to produce new stories based on reader input preferences, using tools and code similar to the computer-aided design (CAD) software that revolutionized engineering, drafting, and architecture. Readers were understood to react according to a probabilistic model based on assumed limited reading within the literary canon (defined by databases of “great books”) and more robust image exposure via the internet, film, and television. Indeed, the earliest researchers into drama management, in breaking authorial decisions into causers, deniers, and temporary deniers, found that the complexity of emergent narrative textual events, and thus the computational speeds required to autogenerate them, was far more manageable if one used a mid-twentieth century television program as a yardstick, rather than a human-authored novel.
It can be regarded only as fortuitous that, as drama management tools evolved from advances in CAD, natural language generation emerged from wholly different disciplines acting on similar motives. Media outlets attempting to automate simple news story generation; bar associations and law reviews robotizing brief preparation; and academic communities democratizing the production of scientific papers co-authored by a dozen or more researchers created a market for the generation of simple texts based on language templates. Needless to say, this was a complex problem that required a number of fundamental tasks to be carried out: content determination, discourse planning, document structuring, sentence aggregation, lexicalization, referent expression generation, and finally surface realization. Space does not permit a detailing of the harrowing process of serially surmounting each of these obstacles, but notably one natural language generation researcher described the task as “harder than writing itself.” That said, the exact subtleties of language autogeneration remain an open question.
With early stage drama management tools and primitive natural language generation software now coexisting, the stage was set, as it were, for AI2. Most simply put, AI2 was conceived as a system that would absorb reader plot preferences and render them into original readable textual events. This included counterintuitive tasks of surprising complexity, such as joining short sentences together to enhance fluency. By contrast, concepts or themes in autogenerated textual events proved a “low hurdle,” as the terms with which a concept or theme were described, needed only to be clear enough for a reader of average attentiveness to identify, and to not be phrased in precisely the same way as it was phrased in the preceding sentence. By far, the most difficult of the earliest challenges was autoediting. Laborious work produced techniques that scanned for and excised all possible “wh-” questions (who, what, when, where, and why) not directly related to reader-preferred plots. Further efforts resulted in iterations of AI2 that were able to execute basic transformations, “rewriting,” in which the language templates that a program’s “rough draft” were based upon were converted into word strings in accordance with commonly used style guides and the orthographic rules of English: sentences’ initial letters were capitalized, language was made active rather than passive, periods were added at sentence endings, etc.
Soon, the most advanced AI2 program to date, Faustian-C, was ready to attempt its first book-length effort for an experimental audience of human readers.
THE NOVEL EXPERIMENT
The middle period of AI2 development was hampered by the fact that narrative content retained a degree of human authorship. That is, model plot graphs and character templates were produced by human beings with, by definition, limited reading experience. Similarly, it was suggested that a hypothetical reader’s mood, at the time of AI2 interface to input narrative preferences, could also amount to a degree of “authorship.” Our hypothesis stems from early speculation that the inherent vice of human authorship might be solved with the development of better autoreading tools that would enable advancing iterations of AI2 to, for example, cull candidate narratives and believable agents from a far greater pool of stories than any human writer or reader could possibly retain (i.e., all pirated and public domain literature digitally accessible). Furthermore, an AI2 autoreader component might eventually be able to anticipate reader preferences without direct prompt by scouring online tracking data and available public records to determine the proclivities, needs, and real-time desires of potential readers prior to their even having expressed a desire to read. Indeed, we hold that the most important philosophical advance in the ongoing development of autogenerated literary textual events was the realization that if you hoped to autowrite you must first know how to autoread. This was clearly illustrated when Faustian-C produced its first textual event of significant length, referred to by researchers as Novel.
After much debate, the dedicated scientists who conducted the now famous Novel experiment settled, in programming the book, on the following choices from Faustian-C’s Plot/Theme Interface Dashboard: “Uplifting,” “Family,” “Fun,” “Entertaining,” “Funny,” “Coming of Age,” and “Happy Ending.” Length was capped at one-hundred-and-fifty pages (two-hundred-and-seventy-five words per page), and the program’s Flesch-Kincaid scale was set at grade-level eight (Faustian-C also featured interactive input fields for Dale-Chall and Coleman-Liau, but these were not employed). One-hundred-and-twenty-five “interactors” volunteered for the experiment by answering an advertisement. Prior to reading, interactors completed a questionnaire and signed a consent form. There was no reward offered for reading Novel.
Reactions were measured directly with devices to monitor respiration rates and electrical activity in the heart and brain, and indirectly with scorecards filled out by trained confederates in pre- and post-Novel Q&A sessions. By far, the most telling of interactor responses were unprompted comments made to confederates during reading, recorded analogically. Reactions fell into two camps. On the one hand, certain interactors found Novel to be a test of patience. “Interactor did not comprehend the scene in the library, and did not enjoy the ending,” one confederate remarked. “Interactor repeatedly asked, ‘May I stop reading now?’” another confederate transcribed. On the other hand, a significant portion of the interactor pool enjoyed Novel even as they desired a greater degree of viscerality. “This is pretty cool,” one interactor noted, “but it would be totally cooler as a movie.” Another called for additional interactivity: “Whoa! The [unnamed female agent] in the lingerie store is kind of hot! Can I make her [engage in unspecified sexual activity with] [male protagonist]?”
Though the Novel experiment was not a success in finding that an AI2 program had succeeded in producing a universally agreeable human-level textual event, the data were nevertheless invaluable. For example, questionnaire responses offered clear support for the hypothesis that readers’ autodetermined reading characteristics predict future reading interests, much in the same way that communities of human readers, as a function of acquaintance with one another’s personalities, libraries, and life predicaments, may develop an intuitive sense of which books their fellow readers might enjoy. Indeed, initial interpretations of Novel data tended to focus exclusively on the various levels of diversion reported by the interactor pool. This led to the now-widespread investigation of “interestingness,” and, more ominously, offered the first hints of the implosion that has been of great concern to the corporations and institutions whose livelihoods have been shown to be dependent on the ongoing production of compelling human-level narratives.
While “interest” is the feeling experienced by a human in the act of being interested in something, “interestingness” is a measure of the potential quality of a thing’s being interesting. For example, readers become interested in stories, whereas stories exhibit some degree of interestingness to readers. It might be regarded as foolhardy that, despite human-authored textual events having for millenia featured a broad range of interestingness value, AI2 researchers would set out to achieve universal interestingness. This ambition should not be faulted, however. Once again, the development of the technology echoes reality. Human authors do not produce textual events of wildly disparate interestingness value as a result of attempts to be interesting only to a small body of readers. Rather, authors attempt to be as interesting as possible to as many readers as possible given their aesthetic purpose. So too may AI2 attempt—indeed, may even be better positioned to attempt—to create textual events that feature significant interestingness value for many, if not all, readers. It was this reasonable ambition that caused many researchers to react with bewilderment to Novel experimental results.
One investigator publicly wondered whether a different data set might have resulted had experimenters opted for more “interesting-sounding” interface choices. “Challenging,” “Piquant,” “Poignant,” and “Tragicomic” all remained unchecked on Faustian-C’s Interface Dashboard. Nevertheless, the results exhibited a clear statistical discrepancy between the previous reading habits of those who disliked Novel, later known as “torpids,” and those who admired the book, referred to as “enjoyers.” In short, torpids were far more likely to have described themselves as “readers,” and to have passed the latent expertise test embedded in the pre-Novel questionnaire. By contrast, almost all enjoyers admitted to having read “three or fewer” books in the previous year, and as many as half volunteered that they could not recall the last time they had read a book at all. At first, researchers were encouraged that an AI2 program had produced a textual event pleasing to nonreaders, but more sober-minded analysts soon noted that the goal of universal appeal had remained elusive. Furthermore, additional analysis revealed that enjoyers, despite their enjoyment, had failed to finish reading Novel at a far greater rate than torpids. These results were puzzling.
The post-Novel questionnaires offered an explanation. Torpids and enjoyers reported far different preferences in response to perceived intrusion from the drama manager. Enjoyers did not mind the appearance of the drama manager to sculpt and form their narrative experience, and most expressed gratitude for a “heavy hand” that anticipated confusion, explained subtle agent motivations, occasionally recapped plot, etc. By contrast, torpids reported feeling that the reading experience was harmed by these intrusions. “The spell was broken,” one torpid explained. Another complained that the drama manager sounded “like some dumb nineteenth-century narrator.”
The Novel experiment, then, illustrated the importance of reader agency in textual events of all kinds. One researcher characterized this as the “Goldilocks dilemma.” Too much reader agency results in aimless, meandering plots, whereas suppressing reader agency makes certain readers too aware of the medium and reduces overall interestingness value. The key insight at this juncture was the recognition that interactive narratives designed for video games had not actually been any more interactive than human-authored literary textual events, the long history of which had established the cultural necessity of compelling stories in the first place. That is, readers desire a degree of agency in co-creating whatever imaginative story worlds they “occupy,” yet at the same time a narrative’s interestingness value—its being a “good story”—is significantly dependent on a reader’s sense of having been connected to a palpable and believable authorial intelligence.
THE CURRENT CRISIS
The current crisis, then, traces back to the failure of corporations and defense contractors to recognize that the practical trials to which they applied early stage AI2 technology relied on narrative events of high interestingness value. This was not entirely their fault. Early reports, such as the Novel experiment, suggested that low interestingess value was “good enough” for a significant portion of potential “reader/players.” Practical trials, however, revealed significant problems. Why exactly the premature dissemination of AI2 technology resulted in unprofitable ventures and gross wastes of public funds remains an open question. It must be noted, however, that the researchers who were part of the AI2 effort from the beginning “saw it coming.” It is for this reason that the present authors are the team best prepared to repair the damage. The extent of this damage is well established. The video game industry began to implode almost as soon as game development became wholly dependent on autogenerated interactive narrative. The Army, Navy, and Air Force ended relationships with startup virtual defense contractors as soon as it was recognized that AI2 training systems left trainees untrained (litigation is pending). Paradoxically, reader/players continued to report enjoying autogenerated narratives even as further testing revealed them to have little or no interestingness value, and the widespread problem of reader/players who enjoy without interest is evident from collateral damage inflicted on the news media, the film industry, and the publishing industry. Independent, reputable analysts have warned that the trend lines resemble those historically associated with the demise of civilizations. This is histrionic, to be sure, but most researchers agree that the present phenomenon amounts to a “narrative recession.”
STATING THE PROBLEM
The same irrational reader behavior that made for the static plot problem—in which readers were found to prefer plots neither too plodding, nor too efficient—surfaced again when researchers were forced to reconsider the relationship between interactors and drama managers. The problem was how to strike a balance between the illusion of aesthetic ambition and the perception of personalized reader experience. It had been surprisingly easy to give the impression of an author with meaningful intent—simply overlay old stories with contemporary references—but it was far more difficult to produce a textual event that gave readers a sense of control even as control was alternately manipulated and denied. Recognizing the limitations of interactivity was critical to understanding the problem. When asked, reader/players claim to long for interactivity, but what had created said longing, in fact, was the cultural impact of textual events whose interestingness was a function of the illusion of interactivity in the form of imaginative co-creativity. What readers of textual events—in particular, literary textual events—truly long for are stories that are neither too restrictive, nor so open-ended that authorial intelligence is effectively absent. In other words, a story is more a labyrinth that offers the appearance of choice than a maze that offers actual choice, and the fact that this was misunderstood begins to explain how potential readership pools came to be overpopulated with reader/players who enjoy without interest.
The problem, then, was that human-authored, human-level literary textual events—i.e., “good books”—had served a core cultural purpose that was undermined by an autogenerating narrative industry that failed to understand the key dynamics of storytelling. Although the enjoyers of the Novel experiment were not “zombie readers,” as has been suggested, it’s clear that these interactors can serve no useful purpose in future experiments. The torpid reader is more desirable, even ideal, but problematic in other ways. Ironically, the autogenerating narrative industry now finds itself in precisely the same predicament as the institutions that financed early research: just as airline companies hoped to streamline pilot training, and just as the military sought to prepare recruits at lower risk and cost, so does the autogenerating narrative industry now require a bulk replacement for torpid interactors, whose lack of enjoyment in potentially reading second generation AI2-generated textual events means that securing their participation on the required scale is financially prohibitive.
The proposition of the present authors relies on the recognition of the fact that the production of human-authored literature of high interestingness value was never a function of writers working in a vacuum. Rather, a long period of latent interactivity in which writers produced the most interesting books they were capable of writing, followed by a savvy readership passing judgment on the interestingness of those works, had been essential to literary textual events coming to serve an indispensable societal role. In short, the autogenerating narrative industry tackled only half the problem. What is needed now is a twofold approach: (1) Refine and perfect an AI2 component capable of covertly steering dramatic experiences toward narrative arcs of human-level interestingness value; and (2) Design and develop a sophisticated autoreader component, Reader Intelligence (RI), that can be used to recognize when human-level interestingness value has in fact been autogenerated.
Patent concerns prevent a detailing of the extensive progress that has been made in the development of RI, but a general outline of the effort should be shared, if only to potentially “network” with like-minded researchers.
Previous attempts to produce a model readership failed because the computers that modeled human interactors were invented by humans that modeled computers, creating infinite recursion. RI avoids this trap by modeling, not an abstract reader, but reader interestingness, which can be quantified. For example, once an AI2 program has produced a textual event of simulated intellectual ambition, the prototype RI program, Hamlet-2B, employs a modified drama management system to act as surrogate for reader interest by assigning interestingness value to sequential plot points and then computing an average plot interestingness score that is then measured against similar human-authored cases. Our standardized numeric interestingness scale is derived from databases of book reviews, with phrases and adjectives assigned positive or negative values: “luminous” = +7.5; “sententious” = -3.8; etc. To avoid human authorship problems, these numeric values are derived from volunteered, open-source assessments with numeric values pre-attached: “Five-star reviews”; annual lists of “best” books; etc. Hamlet-2B is already capable of measuring the overall interestingness value of AI2-generated textual events against a database of interestingness values calculated for a host of human-authored works representative of a broad range of decades, fashions, styles, and schools of literary thought. An AI2 program has yet to produce a literary textual event of undisputed human-level interestingness value, but if the present authors are afforded the equipment, computing power, and the time required to complete the analysis of the ever-growing body of human-authored works, it would seem to be only a matter of time before the autogenerating narrative industry will, by creating textual events that offer both interest and enjoyment, begin to undo the damage that has been unwittingly inflicted upon institutions whose health is critical to economic stability, to national security, and to culture itself.
INTO THE FUTURE
Though progress has been swift, sure, and exciting, the “human-level” still represents a daunting frontier across the entire range of AI2 technology. Even the problem of agent believability has not been surmounted, and this is as true of minor characters in textual events (dismissed as “robotic”) as it is, say, of an AI2-generated narrator programmed to produce a dispassionate academic history. Many open questions remain.
Critics have suggested that certain problems—for example, the satiric “unreliable narrator”—will forever remain beyond the capability of computers to reproduce to any level of satisfaction. We doubt this. Indeed, we anticipate a world in which a range of specialized AI2 programs begin to acquire fan bases and “followers,” and eventually come to challenge traditional “analog” authors, just as special effects wizardry in the film industry has produced believable digital replacements for live actors. Furthermore, after writing, editing, reading, critiquing, and revising autogenerated literary textual events, AI2 programs will offer the option of summarizing stories to customizable lengths. From there, it’s not hard to imagine an “autocritical” function capable of producing literary analysis of autogenerated stories. In short, what now seems to be in reach is restoration of the entire corpus of literary discourse. Promisingly, these applications are what can be imagined only with current technology. Who knows what dragons may surface when the uncharted waters of quantum computing are finally explored, as they surely will be? On that day, we hope that our predecessors, that core group of researchers in whose footsteps we humbly walk, will be recognized for the explorers they were, adventurers on the order of Columbus and Magellan, striking into the wilderness when duty called.