Discover why AI struggles with character consistency in long-form fiction. Learn the entity tracking problem that affects interactive storytelling and how to work around it.
February 16, 20268 min read
What 74,000 words of interactive fiction taught us about the hardest problem in AI storytelling
You're sixty turns into a scene. Three characters are sitting around a dinner table. The AI has been flawless for an hour — generating rich, emotionally precise prose, tracking power dynamics, managing subtext. Then it casually gives Character A the job that belongs to Character B.
Not a hallucination. Not a reasoning failure. The AI didn't forget the information — it's sitting right there in the context window. It just... cross-wired the attribution.
If you've ever used AI for long-form creative writing, you've probably seen this. A character gains a sibling they never had. Someone's eye color drifts mid-chapter. The boss who called in chapter three belongs to the wrong person in chapter seven. The details are all present in the conversation — they're just attached to the wrong names.
We've been running into this at Skeinscribe while building interactive fiction that stretches to novel length. And it turns out, this isn't a bug we can prompt-engineer away. It's a structural limitation of how large language models process text — one that the research community is actively working on and that anyone building AI-powered storytelling tools needs to understand.
The Stateless Problem
Here's the thing most people don't realize about AI conversation: there is no conversation. Each time you send a message and get a response, the AI is reading the entire exchange from scratch. It has no memory of what it said thirty seconds ago. It re-reads everything — your messages, its responses, the system instructions — processes it all at once, and generates the next piece of text.
This works fine for short exchanges. But a novel-length interactive fiction session might involve a hundred turns of back-and-forth, plus chapter recaps, plus system instructions, plus character notes. By the time the AI is generating turn 101, it's processing tens of thousands of tokens of context in a single pass.
And that's where entity tracking starts to break down.
What Entity Tracking Actually Means
In computational linguistics, "entity tracking" is the task of keeping track of who's who and what's what as text unfolds. When a paragraph mentions "she" three times, the model needs to know whether all three refer to the same person or different ones. When someone's job is mentioned in chapter two and referenced again in chapter eight, the model needs to maintain that association across thousands of words of intervening text.
Write smarter fiction with AI with a memory.
Try the system we're building to make interactive stories that remembers... and thinks ahead.
Researchers at Kim and Schuster (2023) formally studied this capability and found that even advanced models struggle with it as complexity increases. Their work showed that entity tracking degrades as the number of entities grows and as state-changing operations multiply — exactly the conditions you find in a multi-character fiction scene.
More recent work on coreference resolution in multiparty dialogue (Zheng et al., 2023) specifically examined the challenge of tracking references across conversations with more than two participants. Their finding was stark: off-the-shelf models perform "relatively poorly" on multiparty coreference compared to simpler two-speaker dialogues. The more characters in a scene, the more the model struggles to keep track of who "she" and "you" and "they" refer to.
A comprehensive guide to coreference resolution pitfalls from industry practice identifies the core failure modes: over-reliance on pronoun cues (the model latches onto gender and number patterns rather than actual identity), poor performance on cross-sentence references, and confusion when multiple candidates share similar attributes.
In plain language: the AI is good at knowing that "she" refers to a woman in the scene. It's much worse at knowing which woman — especially when three women are having a conversation about each other.
Why Fiction Makes It Worse
Standard coreference research typically deals with documents — news articles, Wikipedia entries, static text where an author carefully controls reference patterns. Fiction, especially interactive fiction, introduces compounding difficulties that push models to their limits.
Multiple characters sharing scenes. A dinner scene with three characters generates a web of cross-references. Character A talks about Character B's job while Character C reacts. The model needs to track not just who is speaking, but who is being discussed, and maintain those associations across dozens of exchanges.
Dialogue about other characters. In fiction, characters constantly talk about each other. When one character mentions another's workplace, the model sees "job" and "name" in proximity — but not necessarily the right name with the right job. The attention mechanism can blur these associations, especially in compressed summaries where the original contextual cues have been stripped away.
Unusual pronoun patterns. Some stories involve characters deliberately speaking for or about each other in non-standard ways — using titles, nicknames, or references that don't follow typical pronoun resolution patterns. The more creative the prose, the harder the entity tracking becomes.
Accumulating context. By the time a story reaches novel length, the model is processing recaps of previous chapters alongside the live scene. Those recaps compress events into summary form, which strips the contextual anchors that originally made attribution clear. A sentence like "her growing confidence impressed her friends" bakes in an ambiguous "her" that the model must resolve from compressed context rather than from the original scene where it was obvious.
The Architecture Challenge
At Skeinscribe, we've built systems to manage this: chapter recaps that compress previous story events into continuity-preserving summaries, and a narrator scratchpad where the AI leaves itself notes about plot threads, character details, and upcoming story beats. These function as a kind of artificial memory for a fundamentally memoryless system.
The recaps are generated automatically by a second AI pass that reads each chapter and extracts what matters — relationship dynamics, established facts, character development beats — without reproducing the full prose. This compressed state gets injected as context for the next chapter, giving the AI continuity without burning through the entire context window.
The scratchpad serves a different purpose: forward planning. The AI notes what it wants to foreshadow, what secrets remain unrevealed, what NPC reactions are brewing. Since each turn is a stateless API call with no inherent sense of "what comes next," the scratchpad creates the illusion of intentionality — foreshadowing that actually pays off, details that stay consistent, story threads that feel deliberately woven rather than randomly generated.
Both systems work remarkably well for narrative continuity. The story feels coherent. Character arcs develop naturally. Plot threads connect across chapters.
But entity attribution — who specifically has which job, which character said what, whose boss called — that's where things still slip. And it's precisely because the recaps are narrative rather than structural. A narrative summary says "the conversation revealed tensions about her career." A structural reference would say "Bex: graphic designer. Jess: writer. Shawn: editor." The narrative version reads better and preserves story context. The structural version is harder to misattribute.
What Might Help
The research points toward a few promising directions, each with trade-offs.
Structured character registries. A brief, structured reference — three to five lines per character, listing only the facts most likely to be confused — could serve as a disambiguation anchor. It's not a full character sheet (too much token cost) but a minimal reference the model can check before making attribution decisions. The cost is maybe 50–80 tokens per chapter, which is negligible compared to the thousands of tokens in a typical recap.
Explicit coreference preprocessing. Some researchers have explored running coreference resolution as a preprocessing step — replacing ambiguous pronouns with explicit names before the model processes text. The LQCA framework (Long Question Coreference Adaptation) showed improvements on long-context question answering by resolving references within sub-documents before passing them to the model. A similar approach could be applied to chapter recaps: resolve all pronouns to names before injecting them as context.
Tiered context management. The InfiAgent framework demonstrates that externalizing state into structured files — rather than cramming everything into the context window — can maintain consistency across arbitrarily long task sequences. For fiction, this could mean maintaining a separate structured entity database that gets queried selectively rather than injected wholesale.
Validation passes. A third AI call that checks the generated prose against the character registry before it reaches the reader. Expensive in API costs, but potentially catchable: "You wrote that Jess's editor called — did you mean Shawn's editor?" This is essentially a fact-checking layer tuned to entity attribution.
Each approach costs something — tokens, latency, API calls, complexity. The practical challenge for any production system is finding the right ratio of accuracy improvement to resource cost.
What This Means for AI Storytelling
The entity tracking problem is one of those limitations that's invisible in short interactions and devastating at scale. A chatbot conversation rarely involves enough characters or enough history to trigger it. A novel-length interactive fiction session with three or more recurring characters hits it reliably.
This matters because the promise of AI-assisted storytelling is precisely the long-form, character-rich experience that pushes these limits hardest. The person who wants to write a 74,000-word novel with a cast of recurring characters — the exact person we're building for — is the person most likely to encounter the AI giving someone the wrong job.
We don't have a complete solution yet. But we have a testing framework (story branching and checkpoint systems that let us replay the same scene with different context configurations), real production data from novel-length stories, and a clear understanding of where the failures concentrate.
The honest answer is that this is an active engineering problem — one that sits at the intersection of transformer architecture limitations, context window management, and the specific demands of long-form fiction. It won't be solved by bigger context windows alone (the research consistently shows that model attention degrades with length, not just capacity). It requires structural solutions: better ways to organize character information so the model can look it up rather than infer it from narrative prose.
We're working on it. And if you've built something similar and found approaches that help, we'd love to hear about it.
Skeinscribe is an interactive fiction platform where your input becomes novel-quality prose. The entity tracking challenges described here emerge from real production experience generating stories of 70,000+ words with recurring character casts. We're building the tools to make AI storytelling work at novel scale — including the hard parts nobody talks about.
References & Further Reading
Kim, N. & Schuster, S. (2023). "Entity Tracking in Language Models." ACL 2023. Formal investigation of LLM entity state tracking capabilities across varying complexity levels.
Zheng, B. et al. (2023). "Multilingual Coreference Resolution in Multiparty Dialogue." TACL. Demonstrates that multiparty coreference remains significantly harder than two-speaker settings for current models.
Gan, Y. et al. (2025). "Improving LLMs' Learning of Coreference Resolution." SIGDIAL 2025. Proposes techniques addressing LLM hallucination and under-performance in coreference tasks.
Zheng, B. et al. (2025). "STPar: A Structure-Aware Triaffine Parser for Screenplay Character Coreference Resolution." TACL. Finds 58% of coreference errors in screenplays stem from ambiguous pronoun references in multi-role dialogues.
Wang, H. et al. (2025). "TracLLM." Post-hoc attribution methods for assigning responsibility in long-context LLM outputs.
LQCA (2024). "Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding." ICLR. Framework for improving long-context QA through coreference resolution preprocessing.