Analysing the AI of Inworld's 'Origins' Demo
Conversational AI characters powered by large language models
This episode of AI and Games is sponsored by Inworld. Check out my assessment of the Origins demo which is out now on Steam, and then sign up with a free account, and start building your own AI characters using Inworld’s tools.
In June of 2023, during the Steam NextFest event, while a slew of new demos appeared on the platform the one that caught my attention was 'Origins' a tech demo developed in Unreal Engine 4 by AI company Inworld to showcase their tools for interactive non-player characters.
Origins plays out as a detective game, in which you need to uncover the truth and solve the mystery in front of you. It's the year 2030 and an explosion has rocked the lower levels of Metropolis city, and while there are a handful of suspects with motive, there's no clear evidence of who is to blame. Hence it's your job to go in and find out what really happened, by using your own voice to talk directly to and interact with a range of non-player characters (NPCs) in the environment, who in turn talk back to you directly. Through this simple mechanism, you begin to learn more about what happened, who might be to blame and the ramifications of this shocking event.
Inworld reached out and asked for my take on how good the conversation systems perform during play in the Origins demo. As we'll see in a moment, what's in there thus far is fairly impressive, with characters reacting to how you chat with them in a myriad of ways. They can elaborate on story beats, give more information on their own backstories and histories, flesh out the world narrative, and even make bad jokes or react to my pop culture references when prompted. While it isn't perfect, I can see the potential this has to positively impact games in the coming years.
While today I’m discussing my opinions of the demo, keep an eye out for a future article, in which I sit down with Inworld’s CPO Kylan Gibbs to discuss their technology in much more detail.
AI and Games is a YouTube series made possible thanks to crowdfunding on Patreon as well as right here with paid subscriptions on Substack.
Support the show to have your name in video credits, contribute to future episode topics, watch content in early access and receive exclusive supporters-only content.
Who Are Inworld?
Inworld are an AI company based out of the San Francisco bay area that were founded in 2021. their main product is the Inworld character engine, which is designed for powering NPCs in games and interactive media.
Inworld was founded by a team of engineers who had previously worked on conversational ai and generative systems, with their previous product, api.ai, now known as DialogFlow being acquired by Google.
This led to confidence in the companies prospects, subsequently receiving just shy of $120 million of venture capitalist investment across seed and series A investment rounds in 2021 and 2022. All with the goal to build technology that will better enable immersive experiences for entertainment products.
Naturally games are one of many avenues these technologies can be used, given NPCs have fixed or parameterised responses to user input. In the vast majority of games out there, from Mass Effect to Dragon Age, Yakuza or Hades, conversation systems are built as largely static and fixed interactions. The game dictates what you can ask of a character, and similarly what the responses can be based on in-game parameters. Hence any conversation you have with a non-player character is largely pre-determined with a fixed number of outcomes. You can't adlib or mess around with it given the system can't dynamically create responses, which is what Inworld is trying to solve.
And so Inworld have built a platform that seeks to capitalise on modern conversational AI technologies such as large language models (LLMs) that then works for game productions. Their platform is built to support the likes of Unity and Unreal, but also Roblox, as well as having external Web and Node.js APIs for connectivity via web browsers, or really just any product with an internet connection. All of which provides an interface for the development and refinement of non-player characters.
And this all brings us to Origins: a tech demo designed by to highlight the three core elements Inworld have developed:
A speech-to-text system for listening to the player and parsing it into information that NPCs should respond to.
The second system being the LLM-powered text generation system that creates the responses and dialogue of the characters in a way that is ludo-narratively consistent.
Lastly a text-to-speech system, whereby the character can then speak back to me in real-time and convey emotions as part of their response.
Meanwhile, you may well have seen some of the other demos developed using Inworld’s tools bouncing around the internet. Such as the video below showcasing the mod for Elder Scrolls: Skyrim that allows you to engage in text-based conversation with NPCs in the game world. Or a more recent modding effort in Grand Theft Auto V.
Inworld is of course one of many of the generative AI companies that have emerged in the past couple of years seeking to utilise these new machine learning based techniques to change how we use AI in a variety of creative capacities. If you want to know more about the state of field, the problems being tackled and the challenges faced, be sure to check out my episode linked below, in which I lay out everything you need to know, in my own, inimitable style.
Demo Overview
So before I jump into the analysis, let's give a quick overview. It's important to preface my analysis with what the demo focusses on and what it does, as well as what it also does not do.
In the demo we visit the crime scene of a robotics lab where a massive explosion has rocked the area. We then meet six characters that we can interact with: your robot detective companion who guides you through the narrative, two robots near the scene, a police officer trying to keep the peace, and gentleman in a fedora, and a young woman assessing some of the wreckage. Each of these characters is designed to interact with either by speaking through the microphone or typing your input.
Now each character is built with their own established backstory, as well as information they know in context of the world at large and the story that plays out. So they're built to deliver the narrative as described. This is of course an interesting avenue to explore as part of my own analysis.
I played through Origins several times, and each time I changed up the manner in which I played it. What happens if I play it straight: and act as if I were the detective character that the fiction prescribed to me? What if I played more aggressively? Dismissing characters, shutting them down, insulting them or otherwise. Or perhaps I just go completely fourth wall, and try and tell these characters that I am, quite literally playing this game as part of a YouTube show about AI for video games.
Now given this is a tech demo, the conversation systems are the only things that are fully implemented. Each character has their own voice for dialogue delivery, and animations as well, but a lot of other elements you'd expect to see in a finished game are not implemented at this time. With exception of the detective robot, characters can't move around the space, nor can they interact with the world around them. Plus the conversations are built solely for engagement with the player, and only with the main characters in the scene. Hence NPCs can't chat with each other, nor can other non-player characters really engage with the player. They have some basic responses, but are not fully configured to be fleshed out individuals.
Demo Analysis
Now let's get into the analysis of the demo itself. I had a great time playing through it, and I can see the potential for it as a tool for more complex video game productions. Characters do a good job of keeping their responses in world and consistent, and reaffirming what that character does in the context of the fiction. When prompted for more details on specific topics, they're able to dive into those topics in more details, add flavour or clarity and generally present a richer story than what is initially presented on the surface.
The text generation framework largely holds up against a lot of punishment. You can try and approach conversations in different ways: being civil with a given character and awaiting a response, but also being a tad more assertive or aggressive. In addition, like 99% of players who will engage with this demo, I tried to break it. Could I get the characters to talk about something else and deviate from the topic, or simply try and derail the fantasy by talking about something nonsensical. And every time it caught what I was trying to do. No doubt given that the language model backend is primed to ensure characters operate within the fiction of the world.
Naturally the demo isn't perfect, and while we're focussing solely on conversation, there are areas that merit improvement. Notably the audio delivery, alongside how characters respond to my position in the fiction as the detective in charge of the investigation. I suspect these are elements that will improve in the future as the technology continues to be developed. But for now, let's focus on the three key parts of Inworld's pipeline and discuss what Origins does to highlight their efficacy.
Text Generation
So first up, let's talk about the text generation.
The text generation framework that Inworld has built enables for these characters to feel much more realistic in terms of their personal ideologies, motivations, and perspectives in the context of the game world. The Inworld tech is enabling for a designer to establish unique characters, each with specific personality traits, functional roles in society, outlooks and ideologies, alongside character defects and emotional lines that enable them to be distinguishable from one another. You quickly begin to understand who each of these characters are, w hile also leaving sufficient scope for them to become fleshed out as they interact with the player further.
While at this time, I haven't seen the backend of these tools, I imagine this is all largely reliant on what are known as embeddings. Whereby you scaffold the language model by feeding it information such that the information you get back is in context.
This is what we call prompt engineering, in that when gpt is being queried to provide text for the police officer, it will retrieve relevant text stored in the embeddings database to then feed to gpt so the answer it generates is narratively consistent. but critically, it also has to summarise it in the style of that character as well.
It's able to more successfully guide and constrain the text generation capabilities of the likes of GPT such that it can establish a narrative, and expect characters responses to maintain some narrative consistency.
I wound up talking about science fiction movies or make some bad jokes with the police officer, but they return eventually to the overarching plot element: that they're here to maintain the cordon around the clean-up operation. Meanwhile the angry robot finds ways to return again and again to its rhetoric of human robot wars and all of that jazz.
For the most part, the system could take my inputs, which were seldom straightforward questions and provide answers that felt largely in keeping with the narrative. It did feel like there was scope for some flexibility and I wonder whether the in-world fiction is as robustly defined as it seemed, or if there is space for different interpretations of Metropolis to manifest in your moment to moment conversations.
And so yes, all in all, the text generation aspect of the characters is very strong, really well defined and presents what are ultimately really interesting interactions.
Speech to Text
Speaking of interaction, while we can converse with these NPCs by typing out text, the demo is really built to showcase the ability for users to speak directly to these characters. Your voice is then translated by a speech-to-text system to then query the language models.
This works reasonably well, but it's always going to be a point of contention for me, given I'm Scottish. AI voice recognition to this day does not do a good job of handling Scottish accents. However, for the most part the Origins demo held up. I'd argue it understood me around 75% of the time.
That said, while this is very much in context of Inworld's tools, my issues are also a reflection of the state of the art in the field. I've spent close to 20 years dealing with speech-to-text pipelines being simply unable to understand my accent. That said, it worked the vast majority of the time after I repeat myself a couple of times, and I suspect this will continue to improve in the coming years. The state of the art for voice recognition is improving alongside the additional work Inworld is doing under the hood.
So while I have my misgivings, the batting average was pretty decent. So this leads us to the third and final element: taking the text generated by the NPCs, and piping that into speech.
Text-to-Speech
But while the text and narrative being generated by characters is quite impressive, the text-to-speech system, which is what translates the text generated for the NPC into spoken dialogue, is the weakest part of the product.
It's a functionally coherent system, given it can successfully speak aloud any dialogue the NPC generates, but the voice quality often kills a lot of the emotional nuance happening in the game.
We have law enforcement trying to maintain the peace, creepy men in fedora's and angry robots spouting their respective rhetoric, and even humanoid robots that are distraught about the possible loss of those close to them, but the emotional weight in these voice lines is often lost.
Now let's be clear, this isn't the fault of Inworld specifically, but rather a reflection of the state of the art in this technology. All of this is because the dialogue is generated as text, and then the voice is being generated at runtime as a separate pass. As such, the emotional connection between the dialogue and how it should be delivered is lost. Again, I suspect this is something that will improve in the years to come.
That said, the voice deliveries did have some highlights: one thing that Inworld's API provides is the ability to synchronise animations with voice line delivery. Characters have a set of emotional states, that are meant to be a reflection of how their interactions with characters are influencing them, and this in turn can be built in your game engine of choice such that either facial or full-body animations can be played during dialogue to reflect what is being said. From my understanding of the API, that is something that is very much in the hands of developers, given the system maps emotions as parameters for use in your animation trees. So fear not character animators, it looks like you've got plenty more work coming your way!
Wrapping Up
So for this initial video looking at what Inworld is developing, I wanted to provide my perspective on what the Origins demo is doing, and what are the real strengths of weaknesses of the tools as they currently stand.
In conclusion, it strikes me as an interesting first attempt: the text generation is largely stable and keeps characters narratively consistent. There is a reasonably good balance between it reverting to established knowledge, while also using additional information to flesh out conversational dialogue and to prevent it feeling too rigid. And characters across the board do feel unique and varied from one another both in terms of their dialogue and respective ideologies. The pipeline itself is stable and while I had my issues in spots, it enables for real-time conversations using just your voice and I imagine the tech will only continue to improve in the coming months.
You can try out the Origins demo now on Steam, and I'd encourage you to try it out yourselves and share your thoughts. Plus, if you're a developer and keen to try it out using Inworld's tools yourself, then check out the links to sign up and start testing the tech yourself in your engine of choice. You can sign up with a free account, and start building your own AI characters, and then scale-up if and when you need to and start using the paid subscription tiers.
Thanks once again to Inworld for sponsoring this piece, and I'll be back soon with another episode as we dig deeper into the toolchain and find out from Inworld directly how it all works.