UX rules for building AI-powered professional tools
February 28th, 2025
Looking for our list of AI failure modes and solutions?
Click here to jump down ↓
The first user-facing feature we shipped at HASH powered by a large language model (LLM) was the AI "environment generator" in our browser-based simulation IDE, hCore. This was built atop GPT-3.
Users described in natural language the entities they wanted in their simulation, and behaviors they should exhibit. Our GPT-3 powered workflow would then generate the "initial state" of a HASH simulation containing those agents, complete with references to real agent behavior code (where matching behaviors could be found on our npm-inspired “package manager” for agents) or placeholders (where no code yet existed).
Attempting to use large language models in 2025 to generate production-ready code remains imperfect, so you can imagine some of the challenges faced in early 2020, using models far less powerful than today's — and without the benefit of things like structured outputs which we now take for granted.
Since the start of this decade – from our earliest LLM experimentation and initial feature delivery – it's been obvious that users would need the ability to review generative AI's outputs, easily identify potential mistakes, accept/reject generated outputs granularly, and see for themselves what might've been missed. But how best to do this?
Since 2020, we've been clear that AI features should be design-led. This post captures a lot of what we’ve learned over the last five years about crafting good user interfaces and experiences for interacting with AI. As Jakob Nielsen writes, “AI is the first new UI paradigm in 60 years”. Batch processing (punch cards and complete workflows) gave way to command-based interaction (text-based CLIs and point-and-click GUIs). But now, “intent-based” user interfaces are required.
LLMs are a particularly interesting technology because they upend a lot of "ceteris paribus" assumptions about things that can be held constant. This is certainly true both socially and economically, in terms of their impact on the world… but it's also true in a very narrow sense, in terms of how software products are built, which may both utilize LLMs in their development, and expose LLM-enabled functionality to their users through features.
LLMs are novel because of their ability to recognize patterns and infer structure without being pre-programmed with a specific structure in mind: enabling them to both understand and generate information in a new kind of way. Historically, this is the kind of thing that computers sucked at, and almost all applications built up until ChatGPT’s breakthrough release in November 2022 were predicated around the idea of taking neatly structured data as inputs from users. Hundreds of different user interface components (dropdowns, radio buttons, checkboxes, sliders, toggle switches, etc.) exist to help users capture information in structured form. And the operating systems, browsers and apps we use day-to-day still largely revolve around the user of these buttons, constrained inputs, and other components.
Because they operate on (and produce) unstructured inputs as well as structured ones, LLMs both unlock and require new kinds of human-computer interfaces (HCIs) that go beyond those established in the first ~50 years of graphical user interface (GUI) design. Some of these were imagined decades ago by sci-fi writers. Others are completely new.
To make the most of LLM’s capabilities, new interfaces are required for…
Exploring these requirements, especially in combination, reveals a range of new UX patterns and considerations.
This blog post dives into these, exploring common AI failure modes, in particular as they relate to HCIs used for interacting with AI today, and emerging/hypothesized UX solutions to some of these problems.
As the marginal cost of individually deployable intelligences trends towards zero, services will be distinguished by how well they facilitate those units of intelligence coordinating with each other (a backend challenge), and with human users (a frontend/design one). Thoughtfully architected application-programming and human-computer interfaces are key: both “multi-agent systems” and “user experience” design.
The remainder of this article dives deeply into the ways in which generative AI may fall short (or be misused) when integrated into professional tools, along with specific UX solutions for handling such cases. These solutions are rooted in a set of best practices we've catalogued, which guide our thinking around the design of new AI-enabled features:
Microsoft have also published a set of "GenAI best practices" which broadly overlap with our insights above. If you're aware of any other lists of generative AI user experience best practices, or want to suggest an addition, please message us on 𝕏 or get in touch via our contact page.
These principles all apply in addition to standard UI design best-practices (e.g. Nielsen's ten heuristics and Shneiderman’s eight golden rules).
Many forms of failure can be addressed through good UX design patterns (frontend). Others are addressed by modifying approaches to calling AI endpoints, orchestrating agents differently, or rearchitecting systems (backend).
As it is currently written, this blog post primarily serves an internal audience at HASH, namely our designers, helping them understand UX paradigms and product approaches we use, have experimented with, or have observed in the wild. It also describes some of the backend solutions we use to address common modes of failure, helping make explicit where we think good design can and cannot serve as a substitute for proper technical architecture. We hope our experiences will also prove useful to others building agentic and generative AI-enabled professional tools.
Inference failures refer to mistakes that might be made when inferring information — either from a user’s goal, or from other information available to the AI.
At the heart of generative AI, as an “intent-based” technology, are users’ goals. Users describe what they want, rather than provide specific commands to be executed (a declarative rather than imperative approach).
Goal misunderstanding primarily occurs when an AI fails to infer the user’s correct intent from a given prompt, and optimizes for something other than the user’s actual goal. It may also occur when a goal as expressly stated is correctly understood, but the user’s overall set of preferences are not respected (e.g. hidden or unrevealed preferences, or unstated assumptions, are ignored). In such cases, AI may use objectionable methods to fulfil a goal, which the user would not have approved had they been given the opportunity. This is covered in more detail under the “Control Flow Failures > Constraints” section later on.
Goal misunderstanding can occur for various reasons, but generally results from a user’s goal(s) and preferences being underspecified in their interaction with a system, resulting in ambiguity that the underlying AI then needs to resolve (and may fail to do so optimally).
In the case of free-form text or audio user inputs, objective-laden prompts may take one of several forms:
For each of these, we have a number of tools at our disposal increasing the information density within and “resolution” of these prompts.
Generative and agentic AI interfaces should be "multi-modal", allowing users to prompt them in a variety ways. Different forms of input may be each better-suited to a different kinds of task (e.g. blue-sky thinking, directed goal capture, exploratory preference discovery, or iterating on generated outputs). Interfaces should also reflect the different accessibility requirements and interface preferences of users.
Whether through standalone chat applications like ChatGPT and Claude, or application-integrated chat interfaces, text entry through typing is by far the most prevalent means of instructing and interacting with generative AI.
Integrated into other applications, and in particular when found within new "AI native" applications, chat-like interfaces serve a variety of purposes. For example, in 3D modeling applications, they can be used to describe (and generate) textures and transformations which can then be applied to objects or meshes. In new software development environments (IDEs) such as Windsurf/Cursor/Devin, users can highlight specific files or parts of code and ask questions inline, or request targeted changes to precisely selected code (e.g. "write a test for this", "convert this to another language", or "add comments"... without having to paste code back-and-forth between a chat application and their development environment).
Audio processing (specifically facilitating voice input) is the AI interface that is perhaps fastest-growing in popularity. It may be culturally-preferred by certain kinds of users, and/or particularly well-suited to helping capture less well-articulated prompts at an earlier stage of realization. Today, audio input takes one of two primary forms:
Other "AI-enabled" uses for voice exist although are relatively underexplored (e.g. voice as a navigational interface) but these are beyond the scope of this article.
Allowing the AI to “see” provides another interesting interface for interacting with it. This can be done through a variety of inputs.
Webcam access can be requested, providing a direct view of the user as they interact with an application. This may be used to ascertain the user's emotion or mood, or take cues from a user's environment (utilizing it as additional context to go along with a prompt). For example, Hume’s expression measurement API supports inferring a user’s emotional state, which may be used to tailor a response more appropriately.
Drawings/sketches, in particular as enabled by freeform digital canvases, are another emerging interface paradigm which provide users with the means of "visually" guiding AI.
We originally integrated a freeform canvas, tldraw
, into HASH in April 2023. Although still free, tldraw
subsequently switched to (and is now only available under) a proprietary license, and we maintain a separate, open-source fork of it. Our initial motivation for adopting it was to support users in building dashboards using composable Block Protocol blocks (a standard we developed, and which we successfully added support for inserting onto canvases built with tldraw
). This worked well, but in the context of AI UX, it is the non-block elements which actually make freeform canvases interesting for AI (e.g. drawn annotations, and simple shapes/lines).
Specifically, freeform canvases that support drawing are useful for:
We also utilize canvases in HASH to display entity and ontology graphs, AI planning processes and multi-step "flows" to users in HASH.
Allowing users to attach arbitrary files may leverage text, audio, or visual processing. Any kind of attachment may be accepted and processed (with hte appropriate model), although most commonly images, videos, audio, and documents containing text (PDF, DOC) or a mix of these (PPT) may be accepted. These allow users to leverage and reference information contained with artifacts that already exist. The benefits of this are multi-fold...
The two major challenges in prompt composition are completeness and specificity. In both text-to-text and text-to-image generation, writing prompts can be time-consuming and users may take shortcuts, or give up prematurely. This tendency can be reduced by intelligently evaluating a user’s input while they are still in the process of providing it, in order to help them expand and refine their prompt prior to submission.
Autocompletion that “directly follows” a user's input is a helpful way of speeding up the text-input process.
Typically this is done once a user has begun typing a word or sentence. It can also be done off the basis of supplementary information or environmental context. For example, in an LLM-enabled email composition context, Gmail will often autosuggest a salutation to begin an email with, based on a known receipient's name.
"Tab to complete" was initially popluarized in programming IDEs by GitHub Copilot. This concept was subsequently extended further by "AI native" IDEs like Cursor, which provide multi-line autocompletions capable of not only "finishing a line" but also making accompanying changes elsewhere in a file (e.g. adding a required import
statement at the top of a file, as well as utilizing it further down).
Autosuggestion that provides users with a range of generatively created prompt modifiers, based on what the AI thinks the user is looking for, can be used to help individuals who don't already know what they want to say add specificity to their prompts.
The above video showcases an example from HASH. Once a user has entered a partial prompt, suggested additions and amendments which help clarify the user's intent are displayed above the input. Clicking these modifies their already-entered text to reflect the newly elicited information. This helps provide AI with more information ahead of runtime about a user's preferences.
This is particularly useful for capturing additional specifications that may not otherwise have occurred to a user. Does the user traveling from A to B have any accessibility requirements? When are they setting off? Is a specific mode of transport preferred? Do they have a fixed budget? Do they have any companions, and if so, how many?
Exposing settings to users can assist them in adding useful additional meaning to their prompts, helping guide the AI in exhibiting the kinds of behavior that users expect of it, in response to their requests.
In some applications, like ChatGPT, users can choose between AI models to get the best result for their needs. Ideally these are exposed simply in a way that communicates their relative trade-offs (e.g. "fastest but least accurate" vs "slowest but most accurate"), to eliminate uncertainty users may face when choosing models.
In other applications, such as image generator Midjourney, model parameters can be appended directly to a prompt to guide their model's behavior.
Parameters may be applied from an "advanced settings" menu, which users don't need to interact with, but can specify parameters within in order to more granularly influence outputs.
Parameters may also be copied from previous interactios, enabling users to trivially copy across settings they've used (successfully) in the past, helping to ensure consistency between output formats and styles (which may be derived in part or whole from parameters).
When exposing parameters to users, try to ensure they have consistent scales. As Jakob Nielsen notes, Midjourney does not. For example: quality
may range from 0.25 to 1, chaos
from 0 to 100, stylize
from 0 to 1,000, and weird
from 0 to 3,000. This inconsistency makes it difficult for users to understand (nevermind remember) how these parameters work.
Presets are named collections of parameters that might be saved for easy-reuse, to produce outputs in a particular "style".
Modes are similar, and may combine presets with explicit prompt appendanges (e.g. a system prompt) to further induce output generation of a particular kind. A prompt appendage might look something like "roleplay", instructing a model to "Imagine you're Bill Gates, and you've been tasked with providing business advice..." This prompting paradigm emerged quickly in the wake of ChatGPT’s initial launch, as it was quickly discovered that asking the model to “imagine you’re [specific person]” or “pretend you’re a world-leading [role]” often improved its ability to answer related questions in a way human users found helpful. If the model understood you wanted it to answer as if it was Bill Gates providing business advice, Vitalik Buterin theorizing about blockchains, or Margaret Thatcher discussing politics, it would be able to do a much better job aligning its answer with users’ expectations. This “specificity by proxy” became a shorthand way of telling models to act a certain way.
UX can make it even easier for users to utilize presets or switch into certain modes.
OpenAI’s “custom” GPTs (found in what is sometimes colloquially referred to as the “GPT store”) provide a way to utilize modes developed by others, additionally sometimes providing access to specific data sources unique to that GPT, rather than solely following a custom system prompt. Its downside is that it requires users to identify the correct GPT to use ahead of time, before beginning a chat.
Other startups like Delphi also allow chatting with personas of specific people (in cases developed with them, to provide additional “context” beyond that a general-purpose model might have available to it).
We included a "style switcher" and ability to save/share styles in our original Block Protocol chat widget, predating the existence of both OpenAI's GPT Store, and the "styles" feature offered in Claude.ai (which provides Normal
, Concise
, Explanatory
, and Formal
answer modes by default, alongside the ability to specify custom styles by providing either a writing sample, or description of the desired style).
A great time-saver is to allow users to set "default settings", or "general preferences" that automatically take effect whenever interacting with a system. These allow users to avoid having to reselect or respecify the same presets and parameters over and over.
However, allowing these to be overridden on a per-job or per-conversation basis is important to prevent user frustration. For example, while a user may ordinarily want to generate images only in a specific size or format for social media, they may sometimes want the ability to create assets for use in emails, on a blog, or in other channels, as well. Allowing them to do this at the point of creation, without having to modify their global defaults, is key.
In all cases it should be clear to users what defaults may apply in addition to any job-level settings they’ve specified, where these are persistently available.
Oftentimes, despite the use of UX paradigms like autosuggest, prompts remain underspecified at the point of runtime. This can be handled in two different ways.
Proactively, upon prompt submission: before generating a final output, AI systems can evaluate a user's inputted prompt and proactively question whether it is ambiguous in any way.
We shipped "worker questions" significantly ahead of other chat-based AI tools (such as OpenAI Deep Research) adopting the paradigm. It effectively allows for the identification and address of ambiguities prior to work beginning, saving wasted compute and ultimately better-aligning outputs with user expectations.
Proactive questions are also useful for capturing background information about a task or user. These may involve attempts to tease out some of the same sort of information which might previously have been sought from users ahead of time as part of autosuggest interfaces, during prompt composition.
Reactively, during plan execution: New questions may also arise mid-task, and in HASH (unlike other tools today) we allow these to be viewed and resolved as they are realized. Alongside any research job, we show a checklist of outstanding questions. Questions may either be blocking or non-blocking. New questions can also arise mid-task. When blocking questions arise, users receive a notification (even if they happen to be elsewhere within the app).
Reactive/mid-flow questions are helpful for handling unexpected events or external constraints that may be encountered or discovered once a task has already been begun. For example, is a particular mode of transport out of action due to a strike? Would the user be open to taking a taxi? Do they have the budget for a helicopter? A user can dream.
Once outputs have been generated, they may still be misaligned with the user's original goals and preferences. A number of solutions can help tackle this approach.
Generate multiple outputs proactively.
It is not uncommon for users to have “unrealized preferences” and ultimately not know what they want (until they see an example, that is… or more usually an example of what they don’t want). In such cases, prompting users ahead of time to make their prompts may specific and provide additional context may only help to a degree. Instead, generating multiple varied output options, each with their own distinguishing characteristics, can be a helpful step in enabling users to identify desireable and undesireable aspects of an output and articulate what it is they actually want.
In a basic version of this, a single output from a generated set of options can be selected for use or further iteration. In a more sophisticated system that supports a tight user feedback loop, users should be able to describe aspects of each generated output which they do or don’t dislike, to guide further future iteration.
Oftentimes AI providers seek to minimize resource utilization (e.g. an image generator only serving up one generation at a time), but such approaches may disproportionately erode users’ ability to iterate effectively. As Gwern points out, returns to good design can be non-linear: beyond a certain point, you benefit from a “perfection premium” (exemplified in companies like Linear and Apple). On the other hand, pursuing cost savings that significantly erode user experiences should be avoided. When AI is being utilized for idea generation, forcing users to increase prompt specificity upfront may impede their ability to explore possibility spaces fully, resulting in users settling for local (but not global) equilibria outputs.
Certain common generative AI uses involve asking LLMs to transform information from one form to another. For example, a user may paste an email they’ve drafted into a chat interface and ask a model to “make this email more professional”. For short, simple outputs, it can be easy enough to review a returned message to make sure that it fully reflects the original’s intent. But for lengthier copy-editing users can benefit from diffing.
Diffing is the abillity to compare some original input with another to quickly identify the differences. In our case, one AI-generated output, or an original user input, can be compared to some generated variation of it, to quickly identify where changes have been made, supporting the efficient review of newly-generated content.
Side-by-side diffing: Gigapixel AI’s side-by-side image comparison view is shown below.
GitHub’s side-by-side code comparison view demonstrates how the same principle may be applied to text-based content, including code.
Overlaid/inline diffing: the video below showcases Gigapixel AI's overlaid image comparison tool, another way of allowing users to diff changes between content.
Suppport incremental reviews where more than one change is made at a time. Letting users accept specific changes or aspects of a generation without accepting all changes is an important aspect of managing user expectations, and avoiding frustration. In particular when regenerating assets (e.g. a new iteration of an image, or "round of edits" on a text), allowing users to granularly accept/reject individual changes, without requiring their agreement to all of an AI’s proposed changes, can help provide users with a sense of control.
This can be combined with direct editing to enable inline-adjustment of content, beyond simple approval/rejection. User edits can in turn be used to inform evaluation and better future generation, either on an individualized basis (remembering a user's preferences and offering outputs "personalized" to their tastes) or globally (feeding back into future model training, where privacy policies permit).
Example: Our internal AI-powered PR reviewer at HASH functions by leaving individual suggestions on GitHub Pull Requests, reviewing code changes made by codebase contributors. Because of this, its suggestions can be individually accepted or ignored, as appropriate. Suggestions are left inline on the relevant parts of code they refer to, and can be responded to during the Pull Request review process.
Conduct post-output evaluations to check with users (on the frontend) if outputs were as they expected. Answers can then be fed back into tests and evaluation suites (on the backend) and used to guide future generation (e.g. reinforcement learning with human feedback). Ultimately, the idea here is to collect information that helps identify when users receive suboptimal outputs, and improve the future generation process (either by finetuning models, or improving the UX to ensure less goal misunderstanding).
After a deliverable has been produced, post-output iteration helps users take it the final mile to make the outputs usable. To do this, good UX facilitates follow-on manipulation of outputs, and intelligently exploration of potential change options.
Retry buttons are common fixtures in generative AI tools today, especially useful for quickly giving something another shot when a prompt may have been left initially vague, and a user is looking for generative assistance in exploring a possibility space, rather than looking to hone subsequent generation attempts in a more specific direction.
Apps like Claude and ChatGPT let users regenerate an answer without providing any new information, supporting them flicking left/right between outputs. This UX isn’t great, as it doesn’t support side-by-side comparison (diffing, covered elsewhere).
Similarly, Midjourney allows “redoing” an entire generation attempt (in addition to supporting “subtle” and “major” variant generation of individual outputs) — but in the Midjourney feed, these do not subsequently appear grouped alongside the original prompt, but separately (due its reverse-chronological ordering). Rendering user inputs and AI-generated outputs as nodes on a canvas provides a much more traceable way of exploring these relationships (see: multi-view outputs).
Applications may also offer a retry but modify capability, allowing for regeneration attempts to be supplemented by some additional direction or information. This may take the form of a simple text/voice prompt providing unstructured comments, or more contextual feedback (inline markup, or “point to select” comments). This latter approach involves letting users leave feedback on specific parts of an output (e.g. annotations on an image or process map, or comments on specific parts of generated text) which can help direct AI’s attention to improving those parts of an output that require attention, without inadvertently modifying elements which are already good and do not need further change.
While whole-output text and voice instructions are suitable for certain kinds of things (e.g. style mapping, or blank slate generation: "now convert the scene from summer to winter"), they can be overly-blunt instruments for making more targeted changes. For example, requiring a user to describe in words changes to images such as "Remove the blemish from the person’s face" might not always result in the best outcome: with a model misidentifying a "blemish", or proceeding to solve the problem by giving the person a new face entirely. Providing specific user interface affordances (e.g. an image region selector) that allows users to highlight a specific spot, and then provide a descriptive prompt regarding the change they would like to see in that bounded region, is more likely to deliver the user's intended result without the risk of unintentionally changing other parts of an image.
Continuation is the provision of sensible next steps and “follow-on” suggestions to users.
OpenAI’s new image generation does a very good job of this. For any individual output the model will always suggest at least 3 potential relevant/targeted changes to the image that could be made (e.g. “add texture to the background”, “make the foreground elements glow”, “remove the man-made objects from this scene”).
Outside of a text-to-image context, for example if generating a list (e.g. of “business growth strategies) for a user, having a “suggest more items” button at the bottom of the list can similarly function to allow users to continue their search. This might be clicked indefinitely, or at a certain point, when good-fit recommendations are exhausted, a system may instead prefer to “bail out” and provide users with non-AI alternatives, or suggest an alternative form of follow-up. In this example, if a user has exhausted all of the sensible suggestions a system can offer, alternative options such as a “Talk to a human expert” or “Search the web for more options” may be provided. Complementary (but alternative) tasks or topics to explore can also be suggested.
Common, recurring “quick actions” should be provided as continuation options. For example, when providing text-based answers to user’s questions, consider including commonly-sought after modifiers such as “use simpler language”, “make this more concise”, “expand upon this in more detail”, or "make the tone more [casual/formal]".
Exploration (aka. elaboration) helps users discover parts of outputs in more detail. Allowing users to highlight specific parts of a prior-generated output for more focused follow-up can allow users to deep-dive into specific areas that may be of more interest.
This can be exposed to users in a wide variety of ways:
These UX affordances allow users to conduct research without having to manually type in follow-up questions, or drag a chat thread off-topic.
Narrowing helps users sort and filter outputs into more useful forms, restructuring them and reducing their size.
Sorting by some parameter allows results to be ordered in a particular way (e.g. “by distance from my current location”). Users may provide an unstructured prompt to a model requesting results be sorted, or inline controls may be provided, such as sort buttons in a table (where results are outputted tabularly), or follow-up continuation prompts suggested (where they are not).
Filtering by some parameter is useful for discarding results that aren't of interest to a user (e.g. “only show restaurants with a rating of four stars or more”). Filters can be applied in the exact same way as sorts.
Ranking and truncating combines both sorting and filtering to force an AI model to identify the "top n
" number of options that best fit the user's intent, given all known information, and display them in order of fit (e.g. "show me the 3 options"). Ranking and truncating list-based outputs is a common request of users, and therefore well-suited to continuation prompt recommendation.
Editing lets users modify outputted information.
Direct editing lets users modify outputs inline without requiring a “regeneration” process. For example, if one line of text simply needs amending, a user may be allowed to edit this directly, prior to use, rather than requiring that they copy and paste it out into another application in order to edit it before sharing.
The ability to undo any AI-generated change or action, reverting to a previous state easily, is also helpful in enabling users to try to use AI to unstick themselves, but then revert that action should it not work as expected. Undoing an action should also expunge it from context (for example in the case of a conversational chat thread with an AI). The ability to undo changes is useful in the event an action is accidentally triggered, as well as when an AI system's output is not wholly as intended.
An AI’s output may or may not be “consumed” within the application in which it is created. Allowing users to easily export outputs while preserving any metadata, formatting, or additional information associated with them, can aid users in deriving value from AI-generated work.
For example, ChatGPT allows copying as markdown. This can be pasted into Google Docs or any other markdown-supportive editing application with its original formatting and styling preserved.
In our own usage of GPT-4o’s image generation capabilities and Midjourney, we regularly find ourselves downloading images, and further-upscaling them using third-party software like Topaz Gigapixel. From there we then export into Photoshop for final tweaking and editing, before exporting for use in documents, on the web, or in Figma (within which assets may be transformed again as part of user interface designs). This example shows how pipelines around generated asset production could benefit from a greater degree of integration.
In practice, post-output iteration typically combines multiple approaches. For example, a user might: (1) ask for a list of restaurants near them, sorted by distance; (2) request that the list be filtered to only include restaurants rated four stars or higher; (3) ask that the list be extended as it’s now too short; (4) identify a Chinese restaurant they like the look of, and (now feeling inspired) ask for more Asian options nearby, perhaps at a lower price point; (5) send the resulting list to a chat group for friends to review. This process of repeatedly expanding and narrowing results is sometimes referred to as “accordion editing”, and it can be observed across many uses of LLMs. For example, in the case of AI-assisted copywriting, an LLM may be asked to add some missing content, and then requested to condense the resulting output to reduce some text’s size back down to target length.
“Goal misunderstanding” may also occur when users ask for tasks to be completed which are beyond the comprehension or capabilities of a system.
Sometimes users will ask “can you do X?” – “can you help me create a recipe?”, “can you remind me when my sister’s birthday is approaching?”, “can you access the web to incorporate real-time information as well as data in your training set?”
Other times users won’t ask first, but will simply task a system with doing something it may or may not be able to do (e.g. “set a timer for 5 minutes”).
Ensuring a system is aware of its own limitations is pre-requisite to assisting users who ask for the impossible. A system may be unable to help a user because they are asking for something which cannot be outputted due to:
In all of these cases, whether a system lacks access to the available data or tools, or whether it is technically or ethically constrained, handling impossible requests without causing user frustration boils down to two key elements.
While educational material can be provided to users ahead of time, before they use a product, no assumption should be made that it has ever actually been consumed, and users may not even be aware of it.
Handling unsatsifiable requests: If a user makes a request that cannot be satisfied, communicating the reason for this is the first step in helping them move forward.
Answering questions about capability boundaries: Users may also ask models about their capabilities directly. These may be direct requests for information, or indirect questions — for example, a user asking “Can you help me come up with a series of Star Trek-themed cocktails?” probably isn't just asking the system about the kinds of tasks it can be used for, but would also actually like an answer to their question. Tell them about Captain Janeway’s Irish Coffee, and Par’Mach on the Beach, if possible.
In the event a request cannot be handled, suggesting alternative queries to a user rather than simply showing them an error message may help them continue to use a system in a way that results in some value.
Context misunderstanding occurs when AI fails to parse information in context correctly, independent of its understanding of the user’s goal. This may take many forms, for example...
These aren’t issues that better UX design of AI-enabled applications can solve, but they are problems that better information architecture of source information (which may be subject to AI inference) can help address. If you’re mainly conducting inference over your own data, it’s a good reminder to ensure that your internal docs and information are all clearly labeled and well-structured.
On the backend at HASH, a large amount of our engineering work has been dedicated to reducing the incidence of context misunderstanding. Some of the techniques we use to tackle these issues include:
Inclusion failures occur when information that is relevant to a query is not accessible, found, prioritized for inclusion in context, or otherwise used effectively.
Accessibility issues occur when existing information that would be helpful in completing a task is not accessible by an AI worker or platform. At minimum, this requires backend support, but greatly benefits from frontend interfaces for setting up and configuring solutions.
World access refers to letting AI systems connect to the outside world, obtain new information for themselves, and potentially taken certain kinds of actions. For example, providing agents with network access, the ability to run search queries, or web-browsing capabilities. More exotic things like providing agents with the ability to control real-world sensor deployments or surveillance equipment (e.g. satellite tasking, or video camera movement) may also be achieved.
Personal access lets AI systems "see through the eyes" of a user, as if they're them. It may involve:
HASH solves these problems through integrations which provide one-way (read-only) or bidirectional (read and write) access to data, and plugins (such as the HASH Browser Extension), which can be installed locally on a user's device.
Facilitating effective personal access requires overcoming various authentication and authorization challenges which are discussed in more depth under the “Identity failures” section later on.
Discovery failures occur when relevant information within an accessible pool which is retrievable and could be placed into task-specific context is not. In contrast to accessibility failures, which occur because a system cannot access information, discovery failures refer to a system's failure to effectively utilize sources that are available to it.
Discovery failures are primarily addressed on the backend, for example by adopting...
HASH automates the generation of optimized chunks for all information connected to it (ingested directly, or sourced from within integrated applications).
On the frontend, various controls can also be helpful in minimizing the incidence of discovery failures.
Allow users to narrow the information search space to a smaller pool of data. When specifying “Goals” for “workers” (AI agents) in HASH, we ask user to choose what data sources (‘webs’) they want to pull from. While “All public HASH webs” (knowledge graphs) are an option, and “world wide web” can be provided access in addition… searches can also be limited to one's own web (existing graph), or individual others.
Make it easy to ensure specific relevant context is included alongside initial prompts, via simple UX affordances:
Provide special emphasis, structure, or a search template to assist AI in extracting the most relevant information:
As a strongly-typed knowledge graph, within HASH we are able to let users further narrow their search for information down to entities of specific types (e.g. ‘companies’, ‘people’ or any kind of entity) which they want to pay special attention to.
We can also support filtering down to specific attributes of types are of interest (employment relationships, contact information, etc.)
Prioritization failures occur when a system fails to effectively judge and evaluate what information identified as potentially relevant should in fact be included in task-specific context.
Prioritization failures result in three kinds of errors:
A range of backend approaches to ranking for relevancy will be covered in a future hash.dev
blog post and linked to from here.
On the design side, various UX paradigms are available.
Allowing users to “purge” unwanted information from context is known as "context cleaning". Sometimes you want to make the AI take certain things into account, and remember them… while other times you want to be able to make it forget. Approaches include offering users:
Allowing users to manage temporarily (or permanently) purge specific or all global, cross-task “memories”. Some AI platforms store information about users for re-use/reprovision as context later. These are typically referred to as “memories”. Memories may be provided as context if a platform deems them relevant to a given prompt/task. Typically this feature is opt-in, or users are at least given the ability to opt-out, should they wish. Interfaces for managing individual memories can be exposed in a number of different forms.
Tabular interfaces: tables can be used to show the memories, facts and context that a system holds, and allow users to selectively prune it (ChatGPT-style).
Chat interfaces can even be used to allow users (with a bit less certainty and observability) to instruct systems to forget certain facts, or avoid using certain pieces of information (e.g. "My [family member] died; I don't need reminding of them").
Allowing users to “switch context” by toggling between different "profiles" (which allow users to maintain multiple separate sets of context and memory: e.g. one identity at work, and another at home) may also help address both discovery and prioritization failures.
Even when information is discovered and prioritized effectively, a model may not have a large enough context window to fit all of the important, relevant information necessary to accurately perform a task in a way that is aligned with a user’s expectations and goals. Rather than be a failure of prioritization, therefore, this can be considered a limitation of the underlying model.
The solutions to boundedness exist entirely on the backend. Assuming prioritization of available context items has already been undertaken effectively, and all of the requisite information needed to complete a task cannot be fit into context, various remedies might be attempted.
Relevancy-scoring context involves using a model to rank information in most-to-least important order, and truncate the output ultimately provided. Some information loss is expected, but by prioritizing inclusion of the most important information, as assessed, the hope is that whatever is excluded is of minimal impact.
Semantic compression of context involves restating or rephrasing information in a bid to improve its concision, increasing the density of information it conveys. By reducing the number of tokens require in order to communicate the same information, more information can be fit into a model's context window. Semantic compression requires an understanding of what information is salient, and what shorthand words/tokens communicate the exact same (or sufficiently similar) information to satisfy the desired intent. Semantic compression is only possible up to a point, and performing it both losslessly and without unintentionally altering content's meaning in subtle ways is challenging. Done incorrectly, ambiguities not present in original content may inadvertently introduced, resulting in lower-quality "misguided" outputs.
Decomposing tasks into multiple jobs, for example by using a sliding window approach to slice context up and divide its processing into sub-tasks, can provide another way of overcoming context window size limitations. As with semantic copmression, this may introduce other issues, and in highly interlinked problem spaces with lots of relationships between entities may not be suitable at all, resulting in incorrect answers. For certain kinds of tasks, however, (e.g. constraints-based research) a large amount of context, divided up into chunks, may each be provided to a separate model, resulting in a set of answers that can then be compared side-by-side and filtered down — e.g. to discover a common denominator.
It may be possible to use LLMs which support larger context windows in service of a particular task or goal. At launch, GPT-3 had a context length of 2k tokens. Subsequently released GPT-4's context window ranged from 4 to 8k in length. Today, OpenAI’s largest publicly available models support 200k context. But other frontier models offer significantly larger windows and are capable of supporting the provision of far more context.
Since the release of Gemini 1.5 Pro, Google has provided models with context windows of up to 2m in length (three orders of magnitude — 1000x larger — than those of GPT-3 released just four years prior). Meta's Llama4 model also offers a 10m context window, although at least as of this post's last update its efficacy above 128k tokens was severely diminished. Further experimental models, which are not yet generally available, such as Magic’s LTM-2-Mini, claim support for up to 100m tokens.
In general, the more context that is used, the slower and more expensive models are to run. But switching out underlying models on an “as-needed” basis may help with certain categories of context-intensive tasks (e.g. large codebase optimization or refactoring). Some models which advertise large context windows (such as Llama4) may experience degraded performance above certain token counts, or exhibit interesting biases around recall, such as being more likely to "forget" information provided in the middle of a context snippet than towards its beginning or end.
Identity failures occur when AI is unable to represent a user effectively, either due to a lack of permissions, credentials, protocols, or tools.
An agent may not be permitted to act on behalf of a user. This can result in “inaccessibility” failures, wherein AI cannot view or use required information or tools.
To solve this problem in HASH, we developed an optional browser extension that users can install, which allows HASH workers to browse the web as if they are a user, utilizing the cookies in their browser. Specific websites can be whitelisted or blacklisted, allowing for granular authorization to be provided and denied. By default, agentic browsing occurs invisibly (or in a minimized browser window), but users can inspect an agent’s browsing activity at any time.
Even if authorized to act on behalf of a user, an agent may have no reasonable means of authenticating to others that this is the case, in order to identify themselves as a legitimate representative – which may be necessary in order to complete their task (e.g. make a reservation, obtain access to sensitive medical records, etc). As with other forms of identity failure, this can result in information inaccessibility.
While solved for agent-web-browsing through our browser plugin (described above), this remains a challenge in agent-to-agent and agent-to-offline (e.g. telephone) communications. Potential solutions may include allowing agents to provide a one-time, short-lived link to counterparties to verify that an agent is a particular representative of a particular person (atop a trusted, ID-verified service - e.g. Worldcoin).
As LangChain’s Harrison Chase notes, “The more agentic an application is, the more an LLM decides the control flow of the application”. Flows in HASH can either be user-defined, or AI-defined:
In this section we consider forms of failure that can occur in AI-defined flows specifically. Namely, when the AI is in charge of the “control flow”. Control flow failures may result from AI’s failure to effectively plan ahead in service of a goal, adapt its behavior according to need, stay focused on its goal, or use the tools at its disposal.
"Planning" refers to the development of multi-step, or multi-action flows for satisfying user's goals (while respecting their preferences).
While largely a backend problem, we expose proposed plans to users in HASH prior to their execution, and allow users to add, remove or refine steps before AI workers begin.
"Adaptation" refers to an ability to modify existing plans in response to new information, in service of an original goal.
In HASH, depending on its nature, a proposed adaptation may or may not require user input. This means some adaptations can be solely handled on the backend, coordinated by "manager" agents, while others involve some frontend interaction. Specifically, we solicit user direction on certain adaptations through reactive post-prompt clarification questions (as previously introduced in the “Inference Failures - Goals” section above).
We sometimes find agents working on the wrong things. When it occurs, this normally takes one of two forms:
These problems are largely solveable on the backend through (i) Bayesian inference as a task is under way, regarding its likely probability of success as various avenues are explored, (ii) the introduction of cost functions for agents, and (iii) by introducing effective “manager” agents that have oversight over the various in-execution strands of work that are being done in service of a particular task at any given point in time, which are able to benchmark sub-agents and research approaches against one another, share information about successful strategies across agents, and redirect/pause the work of poorly performing sub-agents. We'll write more about this in a future hash.dev
blog post.
However, one frontend innovation worth noting here is that we provide users with live views into the activity of agents, and the ability to terminate individual tasks within a workflow – effectively allowing AI-executed and overseen jobs to be actively managed by human users (when desired). For example, if an AI research worker goes down a particular tangent and is wasting a lot of resources identifying a particular fact which may not be that important, a human user of HASH may instruct the agent to ignore that particular field. This effectively allows for both early-stopping of sub-tasks, and mid-job specification refinement, without terminating an entire job.
Where a system fails to infer or respect the constraints placed upon it, particularly in developing or executing a plan.
AI may correctly understand a user’s goals, and successfully develop a working plan to fulfil them – but in doing so fail to respect the constraints a user (or system designer) has attempted (or would attempt) to place upon it. Paperclip maximization is perhaps the most famous example of this.
Sensible constraints may be placed upon AI by backend system developers (e.g. limiting network access, setting maximum runtimes, etc.) But frontend UX/UI can also play a role in helping elicit not only goals but also constraints from users. For example, a tool may have access to a paid third-party API through which it conduct research, but an overall system-imposed cap on its monthly budget. Users may unwittingly create individual research jobs that utilise all of the available usage credits in a billing period. Interfaces that proactively solicit sensible constraints from users (e.g. a “max spend” input in this case) can help rein in AI, heading off potential issues that users may not have even thought of.
Queryable or invokable resources – “tools” – are extremely useful multipliers of an agent's capabilities. Properly used, tools help agents acquire information, solve problems, and answer questions they otherwise wouldn’t be able to tackle.
Agents may be given…
However, for all the promise of agentic tools, agents often misuse or underutilize tools at their disposal.
Tool misuse and tool underutilization are problems largely addressed on the backend, and are out of scope of this UX design post.
Delivery failures refer to issues that may arise when presenting AI generated outputs to users, showcasing them within a containing application, or assuring users of their integrity.
Wherever possible, AI should be able to generate outputs in the format users request, and natively render these within its own user interface.
Structured outputs allow AI to generate outputs in accordance with specific formatting expectations or requirements — for example, as JSON. By coercing generated outputs into an expected format like markdown, KaTeX, or JSON (confirming to an expected schema), a wide variety of outputs — from rich text documents through to Mermaid diagrams and math equations — can be rendered natively by an application. All it has to do is add support.
Artifacts allow AI to program its own renderers to allow for the visualization of rich outputs. Code generation allows for a wide array of items to be rendered: websites mocked up, data plotted, 3D renderings generated, etc.
One of the downsides of existing applications’ approaches is that generated outputs tend to be inconsistent in nature, and fail to integrate with user’s existing data, or the wider applications in which they may be embedded. Claude Artifacts are quite literally sandboxed iFrames kept away from their containing application, and opinionated choices are made about style frameworks/technologies in AI-generated code (set to ‘smart defaults’ that look good by themselves, but typically clash when attempting to re-use or integrate outputted code in external codebases — e.g. through global variable pollution, dependency duplication, etc.) The same is true for "frontend code generators" such as v0 from Vercel.
Where opinionated choices are not provided to code generators by platforms, and also not specified by users, rather than default to good environment-agnostic practices models produce wildly inconsistent outputs and adopt erratic styles.
The Block Protocol, developed by HASH, is a standardized way of defining frontend interface components, called “blocks”. HASH is architected around the protocol, which means: data is stored in accordance with its graph model; APIs can be called through a standardized approach to services; and hooks – which expose UI components such as file selectors and query constructors – are available for blocks to utilize. This allows for inline “blocks”, truly integrated UI components, to be generated in response to user inputs.
Outputs may become harder to review, especially as the chain of prompts, actions or events that were involved in their generation grows longer.
In part, this is the result of existing applications’ low information density - or inability to let users “zoom out” to see things relationally and in context.
For example, ChatGPT shows you one chat message at a time. In long threads, people regularly get lost while scrolling. This is especially true if a chat thread contains many iterations/variations on the same output (e.g. someone asking something to be rewritten slightly differently, or regenerated with one or two points changed each time, where “at a glance” outputs look substantially the same, but in fact have subtle differences).
But chat applications are not alone amongst generative AI tools in suffering from low information density. Midjourney, similarly, makes tracing derived generations difficult. Rather than situate variants of an image inline, jobs are shown in reverse-chronological order, making image lineage hard to trace.
For all iterative generative AI applications, we strongly recommend providing view options, and location indicators.
Viewport indicators can be used to help users identify their current place within a document. For example, a “minimap” can assist users in identifying their current place within a large information space, enabling them to see how far up/down a conversation thread they are, or what spatial grid of a canvas they are currently zoomed in to. The below image showcases the minimap that appears to users of HASH canvases.
Users may not trust the outputs of AI-generated models, even when accurate and correct. While most work to increase the accuracy of answers typically occurs on the backend, user faith can be reinforced by “showing the process” through which answers are generated.
Source trees show exactly what webpages, documents and other information sources were used in the generation of an answer.
For example, when using OpenAI's Deep Research, sources are shown in a right-hand sidebar that accompanies the user's request.
Exposing the "chain of thought" of agentic workflows, or the internal reasoning of "thinking" models such as OpenAI o1 and Deepseek R1, can help reassure users of the common-sense integrity of generated outputs, providing a limited ability to introspect the steps taken to produce an answer. In addition, when streamed to users while in-progress, these can help keep users abreast in cases where runtime inference or multi-step agentic actions take longer than a few seconds to complete. HASH research and graph generation tasks can take many minutes or even hours, and OpenAI Deep Research tasks typically run in excess of 10 minutes apiece.
OpenAI keeps users abreast of Deep Research task progress by showing "Activity" in a sidebar (shown below). These regular progress reports are delivered alongside the source tree (previously shown).
Allowing users to view the original prompt, context provided/utilized, and other information that produced an output can help observers assess its integrity.
Get notified when new long-reads and articles go live. Follow along as we dive deep into new tech, and share our experiences. No sales stuff.