Dev Blog/Our Approach, Design

Design-Led AI

UX rules for building AI-powered professional tools

February 28th, 2025

Dei VilkinsonsCEO & Founder, HASH

Looking for our list of AI failure modes and solutions?
Click here to jump down ↓

Introduction

Our Context

The first user-facing feature we shipped at HASH powered by a large language model (LLM) was the AI "environment generator" in our browser-based simulation IDE, hCore. This was built atop GPT-3.

Users described in natural language the entities they wanted in their simulation, and behaviors they should exhibit. Our GPT-3 powered workflow would then generate the "initial state" of a HASH simulation containing those agents, complete with references to real agent behavior code (where matching behaviors could be found on our npm-inspired “package manager” for agents) or placeholders (where no code yet existed).

Attempting to use large language models in 2025 to generate production-ready code remains imperfect, so you can imagine some of the challenges faced in early 2020, using models far less powerful than today's — and without the benefit of things like structured outputs which we now take for granted.

Since the start of this decade – from our earliest LLM experimentation and initial feature delivery – it's been obvious that users would need the ability to review generative AI's outputs, easily identify potential mistakes, accept/reject generated outputs granularly, and see for themselves what might've been missed. But how best to do this?

Since 2020, we've been clear that AI features should be design-led. This post captures a lot of what we’ve learned over the last five years about crafting good user interfaces and experiences for interacting with AI. As Jakob Nielsen writes, “AI is the first new UI paradigm in 60 years”. Batch processing (punch cards and complete workflows) gave way to command-based interaction (text-based CLIs and point-and-click GUIs). But now, “intent-based” user interfaces are required.

Ceteris Non Paribus

LLMs are a particularly interesting technology because they upend a lot of "ceteris paribus" assumptions about things that can be held constant. This is certainly true both socially and economically, in terms of their impact on the world… but it's also true in a very narrow sense, in terms of how software products are built, which may both utilize LLMs in their development, and expose LLM-enabled functionality to their users through features.

LLMs are novel because of their ability to recognize patterns and infer structure without being pre-programmed with a specific structure in mind: enabling them to both understand and generate information in a new kind of way. Historically, this is the kind of thing that computers sucked at, and almost all applications built up until ChatGPT’s breakthrough release in November 2022 were predicated around the idea of taking neatly structured data as inputs from users. Hundreds of different user interface components (dropdowns, radio buttons, checkboxes, sliders, toggle switches, etc.) exist to help users capture information in structured form. And the operating systems, browsers and apps we use day-to-day still largely revolve around the user of these buttons, constrained inputs, and other components.

Because they operate on (and produce) unstructured inputs as well as structured ones, LLMs both unlock and require new kinds of human-computer interfaces (HCIs) that go beyond those established in the first ~50 years of graphical user interface (GUI) design. Some of these were imagined decades ago by sci-fi writers. Others are completely new.

To make the most of LLM’s capabilities, new interfaces are required for…

Capturing goals and intent: Interfaces must move beyond static forms and predefined options to fluid, conversational experiences that allow users to express goals in their own terms. This involves interpreting ambiguous, partial, or evolving intent, surfacing additional considerations where needed, and maintaining continuity across sessions. Interfaces should act more like collaborative partners than command lines—helping users articulate what they want, even when they don’t yet fully know it themselves. With applications no longer reliant upon receiving commands as structured inputs, new opportunities for ingesting information are unlocked.
Clarifying unstructured inputs: Enhancing and refining users’ free-form inputs, in minimally invasive and time-consuming ways, requires looking beyond existing UX conventions. Capturing necessary detail and specificity from users in ways that help clarify their original intent helps align application and AI activity with user expectations and preferences.
Generating new outputs (text, images, audio, video, process diagrams, etc.) in a way that allows users to optimally review (check and make sense of), edit (iterate on and improve) or use (take action with) them.
- Reviewing outputs: Comparing, understanding, or analyzing generated outputs to check for accuracy, completeness, and alignment.
- Editing outputs: Capturing and processing user direction, ideally in an unstructured form, but supported by tools which help hone in on and “specify” individual areas for change.
- Using outputs: Intelligently understanding how users intend to utilize outputs and helping them achieve their ultimate desired goal. For example, if generating a “cover image” for a blog post (like this one!) also generating variants of it suitable for use on social media (LinkedIn, X, Facebook, Instagram, TikTok, and so on).

Exploring these requirements, especially in combination, reveals a range of new UX patterns and considerations.

This blog post dives into these, exploring common AI failure modes, in particular as they relate to HCIs used for interacting with AI today, and emerging/hypothesized UX solutions to some of these problems.

As the marginal cost of individually deployable intelligences trends towards zero, services will be distinguished by how well they facilitate those units of intelligence coordinating with each other (a backend challenge), and with human users (a frontend/design one). Thoughtfully architected application-programming and human-computer interfaces are key: both “multi-agent systems” and “user experience” design.

Best Practices

The remainder of this article dives deeply into the ways in which generative AI may fall short (or be misused) when integrated into professional tools, along with specific UX solutions for handling such cases. These solutions are rooted in a set of best practices we've catalogued, which guide our thinking around the design of new AI-enabled features:

Keep users in control: Always position the AI as an assistive tool, not the driver. The user should feel like the “pilot” and the AI a “copilot” supporting them. This means offering AI features as suggestions or optional actions, rather than autonomous changes. For example, label actions as “Summarize with AI” instead of just “Auto-Summarize” to remind users that they are in charge. Providing clear undo/confirm steps or requiring user approval before an AI action is applied reinforces human control.
Set expectations with transparency: Be upfront about what the AI can and cannot do. Explain the AI’s capabilities and limitations in onboarding or help text, and communicate how it works (at a high level) so users have proper expectations. It’s also wise to avoid anthropomorphizing the AI – a friendly tone is fine, but don’t misleadingly portray the system as a sentient colleague. This helps prevent users from overestimating the AI’s abilities. In practice, this could mean including brief notes like “AI-generated content may be incorrect – please review before using” next to outputs.
Assume users will want to iterate and refine: Embrace the fact that generative AI often requires trial and error. Instead of expecting one-shot perfection, the UX should encourage users to experiment, iterate, and refine the AI’s outputs. Provide mechanisms for easy iteration: for instance, allow users to tweak their prompts or adjust parameters and then regenerate results. Show input-output history so users can backtrack and compare earlier attempts. This turns a potentially frustrating process into a guided, exploratory dialogue.
Guide user’s unstructured inputs: Many users may not know how to ask an AI for what they want. Good AI UX incorporates prompt guidance and structured inputs to help overcome this “articulation barrier”. Examples include placeholder text, sample prompts, or even multi-field forms that break a complex prompt into parts (e.g. fields for tone, length, style). Prompt templates, suggestion chips, or sliders for adjusting an output (for example, a creativity slider) can all help users provide better instructions. By guiding input, we get more useful outputs and a smoother experience.
Provide meaningful defaults & suggestions: To help users get started, offer sensible default actions or one-click AI suggestions. For instance, a writing tool might show a “Improve writing” button or a few example questions the AI can answer. These prompt controls (buttons, toggles, etc. around the input box) increase discoverability of the AI’s features and inspire users with what’s possible. They can also clarify ambiguity by letting users quickly set context (e.g. a toggle for formal vs. casual tone).
Allow outputs to be directly edited or used: Treat AI-generated content as a draft that the user can then refine. Whenever the AI produces text, code, or images, allow the user to edit, tweak, or build upon those outputs directly. For example, if an AI assistant writes a piece of code or a paragraph, it should appear in an editor where the user can modify it – rather than something locked or final. This maintains the workflow of co-creation, where the AI does the heavy lifting but the human crafts the final result.
“Trust but [help users to] verify”: To foster trust, give users tools to verify or fact-check AI outputs. If applicable, show source citations or references for factual answers. Even in non-fact domains, signals like confidence scores or highlighting uncertain words can help. Additionally, add “appropriate friction” at key moments – for example, when the user is about to accept or publish an AI-generated result, you might prompt “Review for accuracy before continuing.” This gentle friction reminds users that the AI isn’t perfect and encourages double-checking critical content (without being too annoying). The goal is calibrated trust: users neither blindly trust the AI nor find it unusably erratic.
Reduce barriers with multi-modal interfaces: Wherever possible, let users interact with the AI in multiple ways. Not everyone will want to type long prompts. Consider supporting voice input, images, or other modalities if they make sense for your app. For instance, a design app could allow sketching a shape and then asking the AI to refine it, instead of requiring a textual description of the shape. Multiple input modes make the AI feature more inclusive and powerful. Similarly, design the AI UX to work for users with varying expertise: novices might use high-level controls and defaults, while experts dig into advanced options.

Microsoft have also published a set of "GenAI best practices" which broadly overlap with our insights above. If you're aware of any other lists of generative AI user experience best practices, or want to suggest an addition, please message us on 𝕏 or get in touch via our contact page.

These principles all apply in addition to standard UI design best-practices (e.g. Nielsen's ten heuristics and Shneiderman’s eight golden rules).

AI Failure Modes & Solutions

Many forms of failure can be addressed through good UX design patterns (frontend). Others are addressed by modifying approaches to calling AI endpoints, orchestrating agents differently, or rearchitecting systems (backend).

As it is currently written, this blog post primarily serves an internal audience at HASH, namely our designers, helping them understand UX paradigms and product approaches we use, have experimented with, or have observed in the wild. It also describes some of the backend solutions we use to address common modes of failure, helping make explicit where we think good design can and cannot serve as a substitute for proper technical architecture. We hope our experiences will also prove useful to others building agentic and generative AI-enabled professional tools.

1. Inference Failures

Inference failures refer to mistakes that might be made when inferring information — either from a user’s goal, or from other information available to the AI.

Goals

At the heart of generative AI, as an “intent-based” technology, are users’ goals. Users describe what they want, rather than provide specific commands to be executed (a declarative rather than imperative approach).

Goal misunderstanding primarily occurs when an AI fails to infer the user’s correct intent from a given prompt, and optimizes for something other than the user’s actual goal. It may also occur when a goal as expressly stated is correctly understood, but the user’s overall set of preferences are not respected (e.g. hidden or unrevealed preferences, or unstated assumptions, are ignored). In such cases, AI may use objectionable methods to fulfil a goal, which the user would not have approved had they been given the opportunity. This is covered in more detail under the “Control Flow Failures > Constraints” section later on.

Goal misunderstanding can occur for various reasons, but generally results from a user’s goal(s) and preferences being underspecified in their interaction with a system, resulting in ambiguity that the underlying AI then needs to resolve (and may fail to do so optimally).

In the case of free-form text or audio user inputs, objective-laden prompts may take one of several forms:

Explicit commands: “Tell me how to get from A to B” — instructions provided by the user
Direct questions: “How do I get from A to B?” — questions through which users explicitly reveal their needs
Indirect questions: “Can you help me get from A to B?” — questions which may implicitly convey a user's intent (in this case: that they want to get from A to B, as well as understand the capabilities and limitations of the system they're interacting with).
Implicit requests: “I’m looking to get from A to B” — expressions of intent, or other sharing of user goals
Implicit descriptions: “best route A to B” — search-engine query style descriptions of the desired outcome or assistance to be provided

For each of these, we have a number of tools at our disposal increasing the information density within and “resolution” of these prompts.

Intent Capture

Generative and agentic AI interfaces should be "multi-modal", allowing users to prompt them in a variety ways. Different forms of input may be each better-suited to a different kinds of task (e.g. blue-sky thinking, directed goal capture, exploratory preference discovery, or iterating on generated outputs). Interfaces should also reflect the different accessibility requirements and interface preferences of users.

Text Inputs

Whether through standalone chat applications like ChatGPT and Claude, or application-integrated chat interfaces, text entry through typing is by far the most prevalent means of instructing and interacting with generative AI.

Integrated into other applications, and in particular when found within new "AI native" applications, chat-like interfaces serve a variety of purposes. For example, in 3D modeling applications, they can be used to describe (and generate) textures and transformations which can then be applied to objects or meshes. In new software development environments (IDEs) such as Windsurf/Cursor/Devin, users can highlight specific files or parts of code and ask questions inline, or request targeted changes to precisely selected code (e.g. "write a test for this", "convert this to another language", or "add comments"... without having to paste code back-and-forth between a chat application and their development environment).

Audio Inputs

Audio processing (specifically facilitating voice input) is the AI interface that is perhaps fastest-growing in popularity. It may be culturally-preferred by certain kinds of users, and/or particularly well-suited to helping capture less well-articulated prompts at an earlier stage of realization. Today, audio input takes one of two primary forms:

2-way conversational interfaces: Some chat apps have voice modes that are designed to be more conversational, taking short prompts or other inputs, and processing them in real-time.
1-way transcription interfaces (voice-to-text): Both found in chat apps, and increasingly others… “voice to text” transcription is being provided inline more commonly, and used for assisting in filling in longer form fields.

Other "AI-enabled" uses for voice exist although are relatively underexplored (e.g. voice as a navigational interface) but these are beyond the scope of this article.

Visual Inputs

Allowing the AI to “see” provides another interesting interface for interacting with it. This can be done through a variety of inputs.

Webcam access can be requested, providing a direct view of the user as they interact with an application. This may be used to ascertain the user's emotion or mood, or take cues from a user's environment (utilizing it as additional context to go along with a prompt). For example, Hume’s expression measurement API supports inferring a user’s emotional state, which may be used to tailor a response more appropriately.

Drawings/sketches, in particular as enabled by freeform digital canvases, are another emerging interface paradigm which provide users with the means of "visually" guiding AI.

We originally integrated a freeform canvas, tldraw, into HASH in April 2023. Although still free, tldraw subsequently switched to (and is now only available under) a proprietary license, and we maintain a separate, open-source fork of it. Our initial motivation for adopting it was to support users in building dashboards using composable Block Protocol blocks (a standard we developed, and which we successfully added support for inserting onto canvases built with tldraw). This worked well, but in the context of AI UX, it is the non-block elements which actually make freeform canvases interesting for AI (e.g. drawn annotations, and simple shapes/lines).

Specifically, freeform canvases that support drawing are useful for:

annotating outputs (process models, images, charts/graphs), providing the ability to “draw over” them. Later on, we discuss how this can be used to aid iterating on AI-generated outputs;
blank-slate sketching outputs: stick drawings that can be converted into high-fidelity images, outline process maps that can be converted into formal notation (BPMN/Mermaid Charts), math equations that can be converted into KaTEX, etc.

We also utilize canvases in HASH to display entity and ontology graphs, AI planning processes and multi-step "flows" to users in HASH.

Attachments

Allowing users to attach arbitrary files may leverage text, audio, or visual processing. Any kind of attachment may be accepted and processed (with hte appropriate model), although most commonly images, videos, audio, and documents containing text (PDF, DOC) or a mix of these (PPT) may be accepted. These allow users to leverage and reference information contained with artifacts that already exist. The benefits of this are multi-fold...

Users save time: users are able to reference external information without having to "describe" or "recreate" the relevant parts of it in their prompt (e.g. by attempting to distill into words the color palette exhibited in a picture which a user would like to use as "style inspiration" for a new image generation). An image can be attached with the simple instruction "Mimic this color palette", saving time.
More precise prompting: users are able to more precisely guide output generation by referencing existing materials. For example, you may wish to generate a YouTube thumbnail, and have a "reference image" you wish to base off of. Models like GPT4o allow users to "mark up" an image with hand-drawn arrows, overlaid text, and other inline instructions, which are then factored in to subsequent generation. The user's prompt is in effect partially moved into the attachment, providing additional guidance.

Assisted Prompting

The two major challenges in prompt composition are completeness and specificity. In both text-to-text and text-to-image generation, writing prompts can be time-consuming and users may take shortcuts, or give up prematurely. This tendency can be reduced by intelligently evaluating a user’s input while they are still in the process of providing it, in order to help them expand and refine their prompt prior to submission.

Autocomplete

Autocompletion that “directly follows” a user's input is a helpful way of speeding up the text-input process.

Typically this is done once a user has begun typing a word or sentence. It can also be done off the basis of supplementary information or environmental context. For example, in an LLM-enabled email composition context, Gmail will often autosuggest a salutation to begin an email with, based on a known receipient's name.

"Tab to complete" was initially popluarized in programming IDEs by GitHub Copilot. This concept was subsequently extended further by "AI native" IDEs like Cursor, which provide multi-line autocompletions capable of not only "finishing a line" but also making accompanying changes elsewhere in a file (e.g. adding a required import statement at the top of a file, as well as utilizing it further down).

Autosuggest

Autosuggestion that provides users with a range of generatively created prompt modifiers, based on what the AI thinks the user is looking for, can be used to help individuals who don't already know what they want to say add specificity to their prompts.

The above video showcases an example from HASH. Once a user has entered a partial prompt, suggested additions and amendments which help clarify the user's intent are displayed above the input. Clicking these modifies their already-entered text to reflect the newly elicited information. This helps provide AI with more information ahead of runtime about a user's preferences.

This is particularly useful for capturing additional specifications that may not otherwise have occurred to a user. Does the user traveling from A to B have any accessibility requirements? When are they setting off? Is a specific mode of transport preferred? Do they have a fixed budget? Do they have any companions, and if so, how many?

Parameters

Exposing settings to users can assist them in adding useful additional meaning to their prompts, helping guide the AI in exhibiting the kinds of behavior that users expect of it, in response to their requests.

In some applications, like ChatGPT, users can choose between AI models to get the best result for their needs. Ideally these are exposed simply in a way that communicates their relative trade-offs (e.g. "fastest but least accurate" vs "slowest but most accurate"), to eliminate uncertainty users may face when choosing models.

In other applications, such as image generator Midjourney, model parameters can be appended directly to a prompt to guide their model's behavior.

Parameters may be applied from an "advanced settings" menu, which users don't need to interact with, but can specify parameters within in order to more granularly influence outputs.

Parameters may also be copied from previous interactios, enabling users to trivially copy across settings they've used (successfully) in the past, helping to ensure consistency between output formats and styles (which may be derived in part or whole from parameters).

When exposing parameters to users, try to ensure they have consistent scales. As Jakob Nielsen notes, Midjourney does not. For example: quality may range from 0.25 to 1, chaos from 0 to 100, stylize from 0 to 1,000, and weird from 0 to 3,000. This inconsistency makes it difficult for users to understand (nevermind remember) how these parameters work.

Presets & Modes

Presets are named collections of parameters that might be saved for easy-reuse, to produce outputs in a particular "style".

Modes are similar, and may combine presets with explicit prompt appendanges (e.g. a system prompt) to further induce output generation of a particular kind. A prompt appendage might look something like "roleplay", instructing a model to "Imagine you're Bill Gates, and you've been tasked with providing business advice..." This prompting paradigm emerged quickly in the wake of ChatGPT’s initial launch, as it was quickly discovered that asking the model to “imagine you’re [specific person]” or “pretend you’re a world-leading [role]” often improved its ability to answer related questions in a way human users found helpful. If the model understood you wanted it to answer as if it was Bill Gates providing business advice, Vitalik Buterin theorizing about blockchains, or Margaret Thatcher discussing politics, it would be able to do a much better job aligning its answer with users’ expectations. This “specificity by proxy” became a shorthand way of telling models to act a certain way.

UX can make it even easier for users to utilize presets or switch into certain modes.

OpenAI’s “custom” GPTs (found in what is sometimes colloquially referred to as the “GPT store”) provide a way to utilize modes developed by others, additionally sometimes providing access to specific data sources unique to that GPT, rather than solely following a custom system prompt. Its downside is that it requires users to identify the correct GPT to use ahead of time, before beginning a chat.

Other startups like Delphi also allow chatting with personas of specific people (in cases developed with them, to provide additional “context” beyond that a general-purpose model might have available to it).

We included a "style switcher" and ability to save/share styles in our original Block Protocol chat widget, predating the existence of both OpenAI's GPT Store, and the "styles" feature offered in Claude.ai (which provides Normal, Concise, Explanatory, and Formal answer modes by default, alongside the ability to specify custom styles by providing either a writing sample, or description of the desired style).

Global Defaults

A great time-saver is to allow users to set "default settings", or "general preferences" that automatically take effect whenever interacting with a system. These allow users to avoid having to reselect or respecify the same presets and parameters over and over.

However, allowing these to be overridden on a per-job or per-conversation basis is important to prevent user frustration. For example, while a user may ordinarily want to generate images only in a specific size or format for social media, they may sometimes want the ability to create assets for use in emails, on a blog, or in other channels, as well. Allowing them to do this at the point of creation, without having to modify their global defaults, is key.

In all cases it should be clear to users what defaults may apply in addition to any job-level settings they’ve specified, where these are persistently available.

Post-Prompt Clarification

Oftentimes, despite the use of UX paradigms like autosuggest, prompts remain underspecified at the point of runtime. This can be handled in two different ways.

Proactive Questions

Proactively, upon prompt submission: before generating a final output, AI systems can evaluate a user's inputted prompt and proactively question whether it is ambiguous in any way.

We shipped "worker questions" significantly ahead of other chat-based AI tools (such as OpenAI Deep Research) adopting the paradigm. It effectively allows for the identification and address of ambiguities prior to work beginning, saving wasted compute and ultimately better-aligning outputs with user expectations.

Proactive questions are also useful for capturing background information about a task or user. These may involve attempts to tease out some of the same sort of information which might previously have been sought from users ahead of time as part of autosuggest interfaces, during prompt composition.

Reactive Questions

Reactively, during plan execution: New questions may also arise mid-task, and in HASH (unlike other tools today) we allow these to be viewed and resolved as they are realized. Alongside any research job, we show a checklist of outstanding questions. Questions may either be blocking or non-blocking. New questions can also arise mid-task. When blocking questions arise, users receive a notification (even if they happen to be elsewhere within the app).

Reactive/mid-flow questions are helpful for handling unexpected events or external constraints that may be encountered or discovered once a task has already been begun. For example, is a particular mode of transport out of action due to a strike? Would the user be open to taking a taxi? Do they have the budget for a helicopter? A user can dream.

Post-Output Review

Once outputs have been generated, they may still be misaligned with the user's original goals and preferences. A number of solutions can help tackle this approach.

Multi-Generation

Generate multiple outputs proactively.

It is not uncommon for users to have “unrealized preferences” and ultimately not know what they want (until they see an example, that is… or more usually an example of what they don’t want). In such cases, prompting users ahead of time to make their prompts may specific and provide additional context may only help to a degree. Instead, generating multiple varied output options, each with their own distinguishing characteristics, can be a helpful step in enabling users to identify desireable and undesireable aspects of an output and articulate what it is they actually want.

In a basic version of this, a single output from a generated set of options can be selected for use or further iteration. In a more sophisticated system that supports a tight user feedback loop, users should be able to describe aspects of each generated output which they do or don’t dislike, to guide further future iteration.

Oftentimes AI providers seek to minimize resource utilization (e.g. an image generator only serving up one generation at a time), but such approaches may disproportionately erode users’ ability to iterate effectively. As Gwern points out, returns to good design can be non-linear: beyond a certain point, you benefit from a “perfection premium” (exemplified in companies like Linear and Apple). On the other hand, pursuing cost savings that significantly erode user experiences should be avoided. When AI is being utilized for idea generation, forcing users to increase prompt specificity upfront may impede their ability to explore possibility spaces fully, resulting in users settling for local (but not global) equilibria outputs.

Diffing

Certain common generative AI uses involve asking LLMs to transform information from one form to another. For example, a user may paste an email they’ve drafted into a chat interface and ask a model to “make this email more professional”. For short, simple outputs, it can be easy enough to review a returned message to make sure that it fully reflects the original’s intent. But for lengthier copy-editing users can benefit from diffing.

Diffing is the abillity to compare some original input with another to quickly identify the differences. In our case, one AI-generated output, or an original user input, can be compared to some generated variation of it, to quickly identify where changes have been made, supporting the efficient review of newly-generated content.

Side-by-side diffing: Gigapixel AI’s side-by-side image comparison view is shown below.

GitHub’s side-by-side code comparison view demonstrates how the same principle may be applied to text-based content, including code.

Overlaid/inline diffing: the video below showcases Gigapixel AI's overlaid image comparison tool, another way of allowing users to diff changes between content.

Incremental Review

Suppport incremental reviews where more than one change is made at a time. Letting users accept specific changes or aspects of a generation without accepting all changes is an important aspect of managing user expectations, and avoiding frustration. In particular when regenerating assets (e.g. a new iteration of an image, or "round of edits" on a text), allowing users to granularly accept/reject individual changes, without requiring their agreement to all of an AI’s proposed changes, can help provide users with a sense of control.

This can be combined with direct editing to enable inline-adjustment of content, beyond simple approval/rejection. User edits can in turn be used to inform evaluation and better future generation, either on an individualized basis (remembering a user's preferences and offering outputs "personalized" to their tastes) or globally (feeding back into future model training, where privacy policies permit).

Example: Our internal AI-powered PR reviewer at HASH functions by leaving individual suggestions on GitHub Pull Requests, reviewing code changes made by codebase contributors. Because of this, its suggestions can be individually accepted or ignored, as appropriate. Suggestions are left inline on the relevant parts of code they refer to, and can be responded to during the Pull Request review process.

Evaluation

Conduct post-output evaluations to check with users (on the frontend) if outputs were as they expected. Answers can then be fed back into tests and evaluation suites (on the backend) and used to guide future generation (e.g. reinforcement learning with human feedback). Ultimately, the idea here is to collect information that helps identify when users receive suboptimal outputs, and improve the future generation process (either by finetuning models, or improving the UX to ensure less goal misunderstanding).

Single output ratings: Simple thumbs up/down ratings of outputs, and (optionally) the ability for users to provide additional detail about why they left a rating.
Follow-up interaction analysis: Analyzing a user’s usage of an output, and their follow-up interactions with it, to ascertain whether or not they use it (through observed behavior, rather than self-reported satisfaction).
Proxy measures of satisfaction: Some user interactions may imply satisfaction. For example, Midjourney generates multiple variants of an image for any given creation request. It may be possible to infer a user’s preferred “choice”, for example when they request to upscale a specific generated image in higher-quality, but not others.
Root cause analysis of dissatisfaction: In the event of change requests/follow-up questions… tie these back to the user’s initial prompt to ascertain whether or not it may have been underspecified, or to see if the model just failed to generate the desired kind of output.

Post-Output Iteration

After a deliverable has been produced, post-output iteration helps users take it the final mile to make the outputs usable. To do this, good UX facilitates follow-on manipulation of outputs, and intelligently exploration of potential change options.

Regeneration

Retry buttons are common fixtures in generative AI tools today, especially useful for quickly giving something another shot when a prompt may have been left initially vague, and a user is looking for generative assistance in exploring a possibility space, rather than looking to hone subsequent generation attempts in a more specific direction.

Apps like Claude and ChatGPT let users regenerate an answer without providing any new information, supporting them flicking left/right between outputs. This UX isn’t great, as it doesn’t support side-by-side comparison (diffing, covered elsewhere).

Similarly, Midjourney allows “redoing” an entire generation attempt (in addition to supporting “subtle” and “major” variant generation of individual outputs) — but in the Midjourney feed, these do not subsequently appear grouped alongside the original prompt, but separately (due its reverse-chronological ordering). Rendering user inputs and AI-generated outputs as nodes on a canvas provides a much more traceable way of exploring these relationships (see: multi-view outputs).

Applications may also offer a retry but modify capability, allowing for regeneration attempts to be supplemented by some additional direction or information. This may take the form of a simple text/voice prompt providing unstructured comments, or more contextual feedback (inline markup, or “point to select” comments). This latter approach involves letting users leave feedback on specific parts of an output (e.g. annotations on an image or process map, or comments on specific parts of generated text) which can help direct AI’s attention to improving those parts of an output that require attention, without inadvertently modifying elements which are already good and do not need further change.

While whole-output text and voice instructions are suitable for certain kinds of things (e.g. style mapping, or blank slate generation: "now convert the scene from summer to winter"), they can be overly-blunt instruments for making more targeted changes. For example, requiring a user to describe in words changes to images such as "Remove the blemish from the person’s face" might not always result in the best outcome: with a model misidentifying a "blemish", or proceeding to solve the problem by giving the person a new face entirely. Providing specific user interface affordances (e.g. an image region selector) that allows users to highlight a specific spot, and then provide a descriptive prompt regarding the change they would like to see in that bounded region, is more likely to deliver the user's intended result without the risk of unintentionally changing other parts of an image.

Continuation

Continuation is the provision of sensible next steps and “follow-on” suggestions to users.

OpenAI’s new image generation does a very good job of this. For any individual output the model will always suggest at least 3 potential relevant/targeted changes to the image that could be made (e.g. “add texture to the background”, “make the foreground elements glow”, “remove the man-made objects from this scene”).

Outside of a text-to-image context, for example if generating a list (e.g. of “business growth strategies) for a user, having a “suggest more items” button at the bottom of the list can similarly function to allow users to continue their search. This might be clicked indefinitely, or at a certain point, when good-fit recommendations are exhausted, a system may instead prefer to “bail out” and provide users with non-AI alternatives, or suggest an alternative form of follow-up. In this example, if a user has exhausted all of the sensible suggestions a system can offer, alternative options such as a “Talk to a human expert” or “Search the web for more options” may be provided. Complementary (but alternative) tasks or topics to explore can also be suggested.

Common, recurring “quick actions” should be provided as continuation options. For example, when providing text-based answers to user’s questions, consider including commonly-sought after modifiers such as “use simpler language”, “make this more concise”, “expand upon this in more detail”, or "make the tone more [casual/formal]".

Exploration

Exploration (aka. elaboration) helps users discover parts of outputs in more detail. Allowing users to highlight specific parts of a prior-generated output for more focused follow-up can allow users to deep-dive into specific areas that may be of more interest.

This can be exposed to users in a wide variety of ways:

Hover for more: hovering over a word may result in a Wikipedia-style "infocard" or "tooltip" appearing, with a generated description or "further reading" links.
Click to explore: keywords in outputs may contain "speculative links" to "pages" which don't yet exist, but can be generated on demand (and may open full-screen, or in a slideover).
Context menu options: selecting a fragment of text (e.g. a few words, or a sentence) and right-clicking may reveal context options, including ones that facilitate further exploration of the content (e.g. an “Explain this to me” option).

These UX affordances allow users to conduct research without having to manually type in follow-up questions, or drag a chat thread off-topic.

Narrowing

Narrowing helps users sort and filter outputs into more useful forms, restructuring them and reducing their size.

Sorting by some parameter allows results to be ordered in a particular way (e.g. “by distance from my current location”). Users may provide an unstructured prompt to a model requesting results be sorted, or inline controls may be provided, such as sort buttons in a table (where results are outputted tabularly), or follow-up continuation prompts suggested (where they are not).

Filtering by some parameter is useful for discarding results that aren't of interest to a user (e.g. “only show restaurants with a rating of four stars or more”). Filters can be applied in the exact same way as sorts.

Ranking and truncating combines both sorting and filtering to force an AI model to identify the "top n" number of options that best fit the user's intent, given all known information, and display them in order of fit (e.g. "show me the 3 options"). Ranking and truncating list-based outputs is a common request of users, and therefore well-suited to continuation prompt recommendation.

Editing

Editing lets users modify outputted information.

Direct editing lets users modify outputs inline without requiring a “regeneration” process. For example, if one line of text simply needs amending, a user may be allowed to edit this directly, prior to use, rather than requiring that they copy and paste it out into another application in order to edit it before sharing.

The ability to undo any AI-generated change or action, reverting to a previous state easily, is also helpful in enabling users to try to use AI to unstick themselves, but then revert that action should it not work as expected. Undoing an action should also expunge it from context (for example in the case of a conversational chat thread with an AI). The ability to undo changes is useful in the event an action is accidentally triggered, as well as when an AI system's output is not wholly as intended.

Exporting

An AI’s output may or may not be “consumed” within the application in which it is created. Allowing users to easily export outputs while preserving any metadata, formatting, or additional information associated with them, can aid users in deriving value from AI-generated work.

For example, ChatGPT allows copying as markdown. This can be pasted into Google Docs or any other markdown-supportive editing application with its original formatting and styling preserved.

In our own usage of GPT-4o’s image generation capabilities and Midjourney, we regularly find ourselves downloading images, and further-upscaling them using third-party software like Topaz Gigapixel. From there we then export into Photoshop for final tweaking and editing, before exporting for use in documents, on the web, or in Figma (within which assets may be transformed again as part of user interface designs). This example shows how pipelines around generated asset production could benefit from a greater degree of integration.

Combined Approaches

In practice, post-output iteration typically combines multiple approaches. For example, a user might: (1) ask for a list of restaurants near them, sorted by distance; (2) request that the list be filtered to only include restaurants rated four stars or higher; (3) ask that the list be extended as it’s now too short; (4) identify a Chinese restaurant they like the look of, and (now feeling inspired) ask for more Asian options nearby, perhaps at a lower price point; (5) send the resulting list to a chat group for friends to review. This process of repeatedly expanding and narrowing results is sometimes referred to as “accordion editing”, and it can be observed across many uses of LLMs. For example, in the case of AI-assisted copywriting, an LLM may be asked to add some missing content, and then requested to condense the resulting output to reduce some text’s size back down to target length.

Capability Nescience

“Goal misunderstanding” may also occur when users ask for tasks to be completed which are beyond the comprehension or capabilities of a system.

Sometimes users will ask “can you do X?” – “can you help me create a recipe?”, “can you remind me when my sister’s birthday is approaching?”, “can you access the web to incorporate real-time information as well as data in your training set?”

Other times users won’t ask first, but will simply task a system with doing something it may or may not be able to do (e.g. “set a timer for 5 minutes”).

Ensuring a system is aware of its own limitations is pre-requisite to assisting users who ask for the impossible. A system may be unable to help a user because they are asking for something which cannot be outputted due to:

technical constraints: e.g. a video output is requested, but only text can be generated;
conflicts with system boundaries: e.g. a user requests potentially harmful output which the system refuses to assist with;
lack of relevant context: e.g. a summary of the current day's news is requested, but the model has no network access and a training cut-off date in the distant past;

In all of these cases, whether a system lacks access to the available data or tools, or whether it is technically or ethically constrained, handling impossible requests without causing user frustration boils down to two key elements.

Education

While educational material can be provided to users ahead of time, before they use a product, no assumption should be made that it has ever actually been consumed, and users may not even be aware of it.

Handling unsatsifiable requests: If a user makes a request that cannot be satisfied, communicating the reason for this is the first step in helping them move forward.

Answering questions about capability boundaries: Users may also ask models about their capabilities directly. These may be direct requests for information, or indirect questions — for example, a user asking “Can you help me come up with a series of Star Trek-themed cocktails?” probably isn't just asking the system about the kinds of tasks it can be used for, but would also actually like an answer to their question. Tell them about Captain Janeway’s Irish Coffee, and Par’Mach on the Beach, if possible.

Redirection

In the event a request cannot be handled, suggesting alternative queries to a user rather than simply showing them an error message may help them continue to use a system in a way that results in some value.

Context

Context misunderstanding occurs when AI fails to parse information in context correctly, independent of its understanding of the user’s goal. This may take many forms, for example...

Data type conflation: confusing US dollars with Canadian dollars in a price comparison task);
Attribution error: incorrectly ascribing information contained within context to an incorrect source or origin;
Oversight: failing to identify a particular relevant piece of information within the provided context.

These aren’t issues that better UX design of AI-enabled applications can solve, but they are problems that better information architecture of source information (which may be subject to AI inference) can help address. If you’re mainly conducting inference over your own data, it’s a good reminder to ensure that your internal docs and information are all clearly labeled and well-structured.

On the backend at HASH, a large amount of our engineering work has been dedicated to reducing the incidence of context misunderstanding. Some of the techniques we use to tackle these issues include:

Data types: reducing the kind of data type conflation that many LLMs (in particular lower-parameter models) are prone to, by incorporating “data types” into inference tasks as a first-order primitive.
Claims: we minmize the misattribution of information and reduce hallucinations by grounding information gathered via our AI inference process in “claims”. Claims are individual facts our system extracts from the documents and other files it processes, along with metadata about where they originate, and whether or not they can be trusted (a “confidence” score). These are built into AI inference tasks as a first-order primitive, alongside data types.
Multi-angle review: multiple models evaluate source material from slightly different angles to minimize the risk of “oversight”. One model seeks to catalog all claims within material, whether or not relevant to the original prompt. Another analyzes the page for specific relevant claims. A third reviews the outputs of both to decide what to use.

2. Inclusion Failures

Inclusion failures occur when information that is relevant to a query is not accessible, found, prioritized for inclusion in context, or otherwise used effectively.

Accessibility

Accessibility issues occur when existing information that would be helpful in completing a task is not accessible by an AI worker or platform. At minimum, this requires backend support, but greatly benefits from frontend interfaces for setting up and configuring solutions.

World Access

World access refers to letting AI systems connect to the outside world, obtain new information for themselves, and potentially taken certain kinds of actions. For example, providing agents with network access, the ability to run search queries, or web-browsing capabilities. More exotic things like providing agents with the ability to control real-world sensor deployments or surveillance equipment (e.g. satellite tasking, or video camera movement) may also be achieved.

Personal Access

Personal access lets AI systems "see through the eyes" of a user, as if they're them. It may involve:

Fetching or syncing information from external applications or datastores in order to make it available for querying and inclusion as context;
Installing endpoint agents on user’s devices which are able to access information on their behalf, and catalog user data in real-time.

HASH solves these problems through integrations which provide one-way (read-only) or bidirectional (read and write) access to data, and plugins (such as the HASH Browser Extension), which can be installed locally on a user's device.

Facilitating effective personal access requires overcoming various authentication and authorization challenges which are discussed in more depth under the “Identity failures” section later on.

Discovery

Discovery failures occur when relevant information within an accessible pool which is retrievable and could be placed into task-specific context is not. In contrast to accessibility failures, which occur because a system cannot access information, discovery failures refer to a system's failure to effectively utilize sources that are available to it.

Discovery failures are primarily addressed on the backend, for example by adopting...

Semantic chunking: information is split into semantically-meaningful segments from which embeddings are created.
Late chunking: whole documents and files are embedded, and latterly split into chunks, helping retain context (pronoun referents, named term definitions, etc.)

HASH automates the generation of optimized chunks for all information connected to it (ingested directly, or sourced from within integrated applications).

On the frontend, various controls can also be helpful in minimizing the incidence of discovery failures.

Source Pool Narrowing

Allow users to narrow the information search space to a smaller pool of data. When specifying “Goals” for “workers” (AI agents) in HASH, we ask user to choose what data sources (‘webs’) they want to pull from. While “All public HASH webs” (knowledge graphs) are an option, and “world wide web” can be provided access in addition… searches can also be limited to one's own web (existing graph), or individual others.

Context Cherry Picking

Make it easy to ensure specific relevant context is included alongside initial prompts, via simple UX affordances:

Allow users to drag/drop text from previous interactions.
Allow users to attach files. In HASH we have experimented not only with allowing users to upload files as part of chat flows, but also selecting specific elements of those files (once uploaded) to directly manipulate/narrow down within them what specific part is of relevance. This kind of capability is particularly useful when dealing with large financial or legal documents.

Context Focusing

Provide special emphasis, structure, or a search template to assist AI in extracting the most relevant information:

As a strongly-typed knowledge graph, within HASH we are able to let users further narrow their search for information down to entities of specific types (e.g. ‘companies’, ‘people’ or any kind of entity) which they want to pay special attention to.

We can also support filtering down to specific attributes of types are of interest (employment relationships, contact information, etc.)

Prioritization

Prioritization failures occur when a system fails to effectively judge and evaluate what information identified as potentially relevant should in fact be included in task-specific context.

Prioritization failures result in three kinds of errors:

Type 1 errors (false positives): the inclusion of useless, irrelevant information, resulting in the overfilling of context windows.
Type 2 errors (false negatives): the exclusion of available information which is in fact necessary to optimally address a user's prompt, resulting in the underprovision of context.
Redundancy errors: the inclusion of superfluous (repetitive or tautological) information in context, resulting in potentially helpful information being unhelpfully represented more than once, spamming context.

A range of backend approaches to ranking for relevancy will be covered in a future hash.dev blog post and linked to from here.

On the design side, various UX paradigms are available.

Context Cleaning

Allowing users to “purge” unwanted information from context is known as "context cleaning". Sometimes you want to make the AI take certain things into account, and remember them… while other times you want to be able to make it forget. Approaches include offering users:

A clean slate: Creating a new chat from scratch, to start from a blank slate (e.g. chats in sidebar of an app like ChatGPT). USE FOR: starting again from scratch, or exploring an entirely new topic.
“Branching” or “forking”: Taking some AI interaction at a given historical point, keeping everything prior, but then discarding subsequent information (in a new separated workspace, without destroying the old one). Branching is useful for allowing users to rescue themselves if interactions at some point diverged in an unhelpful direction.
Surgical “branching”: Forking a conversation while also surgically deleting specific instructions/messages from a thread.

Memory Management

Allowing users to manage temporarily (or permanently) purge specific or all global, cross-task “memories”. Some AI platforms store information about users for re-use/reprovision as context later. These are typically referred to as “memories”. Memories may be provided as context if a platform deems them relevant to a given prompt/task. Typically this feature is opt-in, or users are at least given the ability to opt-out, should they wish. Interfaces for managing individual memories can be exposed in a number of different forms.

Tabular interfaces: tables can be used to show the memories, facts and context that a system holds, and allow users to selectively prune it (ChatGPT-style).

Chat interfaces can even be used to allow users (with a bit less certainty and observability) to instruct systems to forget certain facts, or avoid using certain pieces of information (e.g. "My [family member] died; I don't need reminding of them").

Context Switching

Allowing users to “switch context” by toggling between different "profiles" (which allow users to maintain multiple separate sets of context and memory: e.g. one identity at work, and another at home) may also help address both discovery and prioritization failures.

Boundedness

Even when information is discovered and prioritized effectively, a model may not have a large enough context window to fit all of the important, relevant information necessary to accurately perform a task in a way that is aligned with a user’s expectations and goals. Rather than be a failure of prioritization, therefore, this can be considered a limitation of the underlying model.

The solutions to boundedness exist entirely on the backend. Assuming prioritization of available context items has already been undertaken effectively, and all of the requisite information needed to complete a task cannot be fit into context, various remedies might be attempted.

Relevancy Scoring

Relevancy-scoring context involves using a model to rank information in most-to-least important order, and truncate the output ultimately provided. Some information loss is expected, but by prioritizing inclusion of the most important information, as assessed, the hope is that whatever is excluded is of minimal impact.

Semantic Compression

Semantic compression of context involves restating or rephrasing information in a bid to improve its concision, increasing the density of information it conveys. By reducing the number of tokens require in order to communicate the same information, more information can be fit into a model's context window. Semantic compression requires an understanding of what information is salient, and what shorthand words/tokens communicate the exact same (or sufficiently similar) information to satisfy the desired intent. Semantic compression is only possible up to a point, and performing it both losslessly and without unintentionally altering content's meaning in subtle ways is challenging. Done incorrectly, ambiguities not present in original content may inadvertently introduced, resulting in lower-quality "misguided" outputs.

Task Decomposition

Decomposing tasks into multiple jobs, for example by using a sliding window approach to slice context up and divide its processing into sub-tasks, can provide another way of overcoming context window size limitations. As with semantic copmression, this may introduce other issues, and in highly interlinked problem spaces with lots of relationships between entities may not be suitable at all, resulting in incorrect answers. For certain kinds of tasks, however, (e.g. constraints-based research) a large amount of context, divided up into chunks, may each be provided to a separate model, resulting in a set of answers that can then be compared side-by-side and filtered down — e.g. to discover a common denominator.

Larger Context Windows

It may be possible to use LLMs which support larger context windows in service of a particular task or goal. At launch, GPT-3 had a context length of 2k tokens. Subsequently released GPT-4's context window ranged from 4 to 8k in length. Today, OpenAI’s largest publicly available models support 200k context. But other frontier models offer significantly larger windows and are capable of supporting the provision of far more context.

Since the release of Gemini 1.5 Pro, Google has provided models with context windows of up to 2m in length (three orders of magnitude — 1000x larger — than those of GPT-3 released just four years prior). Meta's Llama4 model also offers a 10m context window, although at least as of this post's last update its efficacy above 128k tokens was severely diminished. Further experimental models, which are not yet generally available, such as Magic’s LTM-2-Mini, claim support for up to 100m tokens.

In general, the more context that is used, the slower and more expensive models are to run. But switching out underlying models on an “as-needed” basis may help with certain categories of context-intensive tasks (e.g. large codebase optimization or refactoring). Some models which advertise large context windows (such as Llama4) may experience degraded performance above certain token counts, or exhibit interesting biases around recall, such as being more likely to "forget" information provided in the middle of a context snippet than towards its beginning or end.

3. Identity Failures

Identity failures occur when AI is unable to represent a user effectively, either due to a lack of permissions, credentials, protocols, or tools.

Authorization / Permission

An agent may not be permitted to act on behalf of a user. This can result in “inaccessibility” failures, wherein AI cannot view or use required information or tools.

To solve this problem in HASH, we developed an optional browser extension that users can install, which allows HASH workers to browse the web as if they are a user, utilizing the cookies in their browser. Specific websites can be whitelisted or blacklisted, allowing for granular authorization to be provided and denied. By default, agentic browsing occurs invisibly (or in a minimized browser window), but users can inspect an agent’s browsing activity at any time.

Authentication / Access

Even if authorized to act on behalf of a user, an agent may have no reasonable means of authenticating to others that this is the case, in order to identify themselves as a legitimate representative – which may be necessary in order to complete their task (e.g. make a reservation, obtain access to sensitive medical records, etc). As with other forms of identity failure, this can result in information inaccessibility.

While solved for agent-web-browsing through our browser plugin (described above), this remains a challenge in agent-to-agent and agent-to-offline (e.g. telephone) communications. Potential solutions may include allowing agents to provide a one-time, short-lived link to counterparties to verify that an agent is a particular representative of a particular person (atop a trusted, ID-verified service - e.g. Worldcoin).

4. Control Flow Failures

As LangChain’s Harrison Chase notes, “The more agentic an application is, the more an LLM decides the control flow of the application”. Flows in HASH can either be user-defined, or AI-defined:

User-defined flows are repeatable, fixed processes that are executed autonomously in HASH (on-demand, on a schedule, or in response to a trigger), and may even contain “AI steps”, but their control flow never changes. A follows B follows C. Until a user changes the flow’s definition.
AI-defined flows are driven by user-provided “goals”. They contain no fixed steps. They terminate when the user tells them to, or when the goal is otherwise met. Every action is decided by an AI “manager” agent that oversees the actions of other AI “workers”.

In this section we consider forms of failure that can occur in AI-defined flows specifically. Namely, when the AI is in charge of the “control flow”. Control flow failures may result from AI’s failure to effectively plan ahead in service of a goal, adapt its behavior according to need, stay focused on its goal, or use the tools at its disposal.

Planning

"Planning" refers to the development of multi-step, or multi-action flows for satisfying user's goals (while respecting their preferences).

While largely a backend problem, we expose proposed plans to users in HASH prior to their execution, and allow users to add, remove or refine steps before AI workers begin.

Adaptation

"Adaptation" refers to an ability to modify existing plans in response to new information, in service of an original goal.

In HASH, depending on its nature, a proposed adaptation may or may not require user input. This means some adaptations can be solely handled on the backend, coordinated by "manager" agents, while others involve some frontend interaction. Specifically, we solicit user direction on certain adaptations through reactive post-prompt clarification questions (as previously introduced in the “Inference Failures - Goals” section above).

Focus

We sometimes find agents working on the wrong things. When it occurs, this normally takes one of two forms:

Distraction: getting distracted en route to servicing a goal, ending up doing a completely different task, and no longer servicing the main aim.
Fixation: not knowing when to attempt a different strategy, or when to give up (when a human counterpart would sensibly have engaged in “early stopping” of their task, or switched approaches long before)

These problems are largely solveable on the backend through (i) Bayesian inference as a task is under way, regarding its likely probability of success as various avenues are explored, (ii) the introduction of cost functions for agents, and (iii) by introducing effective “manager” agents that have oversight over the various in-execution strands of work that are being done in service of a particular task at any given point in time, which are able to benchmark sub-agents and research approaches against one another, share information about successful strategies across agents, and redirect/pause the work of poorly performing sub-agents. We'll write more about this in a future hash.dev blog post.

However, one frontend innovation worth noting here is that we provide users with live views into the activity of agents, and the ability to terminate individual tasks within a workflow – effectively allowing AI-executed and overseen jobs to be actively managed by human users (when desired). For example, if an AI research worker goes down a particular tangent and is wasting a lot of resources identifying a particular fact which may not be that important, a human user of HASH may instruct the agent to ignore that particular field. This effectively allows for both early-stopping of sub-tasks, and mid-job specification refinement, without terminating an entire job.

Constraints

Where a system fails to infer or respect the constraints placed upon it, particularly in developing or executing a plan.

AI may correctly understand a user’s goals, and successfully develop a working plan to fulfil them – but in doing so fail to respect the constraints a user (or system designer) has attempted (or would attempt) to place upon it. Paperclip maximization is perhaps the most famous example of this.

Sensible constraints may be placed upon AI by backend system developers (e.g. limiting network access, setting maximum runtimes, etc.) But frontend UX/UI can also play a role in helping elicit not only goals but also constraints from users. For example, a tool may have access to a paid third-party API through which it conduct research, but an overall system-imposed cap on its monthly budget. Users may unwittingly create individual research jobs that utilise all of the available usage credits in a billing period. Interfaces that proactively solicit sensible constraints from users (e.g. a “max spend” input in this case) can help rein in AI, heading off potential issues that users may not have even thought of.

Resources

Queryable or invokable resources – “tools” – are extremely useful multipliers of an agent's capabilities. Properly used, tools help agents acquire information, solve problems, and answer questions they otherwise wouldn’t be able to tackle.

Agents may be given…

access to a set of tools developed to help them complete specific tasks: for example, the ability to “search the web” or “identify from search results the single most relevant listing”;
the ability to code and deploy tools which unblock themselves: HASH workers can, for example, dynamically write code which runs in secure E2B containers, and which are executed as part of flows. For example, writing a calculation function in order to compute a particular statistic or aggregate metric, or running some kind of data analysis over a document.

However, for all the promise of agentic tools, agents often misuse or underutilize tools at their disposal.

Tool misuse and tool underutilization are problems largely addressed on the backend, and are out of scope of this UX design post.

5. Delivery Failures

Delivery failures refer to issues that may arise when presenting AI generated outputs to users, showcasing them within a containing application, or assuring users of their integrity.

Output Format

Wherever possible, AI should be able to generate outputs in the format users request, and natively render these within its own user interface.

Structured Outputs

Structured outputs allow AI to generate outputs in accordance with specific formatting expectations or requirements — for example, as JSON. By coercing generated outputs into an expected format like markdown, KaTeX, or JSON (confirming to an expected schema), a wide variety of outputs — from rich text documents through to Mermaid diagrams and math equations — can be rendered natively by an application. All it has to do is add support.

Artifacts

Artifacts allow AI to program its own renderers to allow for the visualization of rich outputs. Code generation allows for a wide array of items to be rendered: websites mocked up, data plotted, 3D renderings generated, etc.

Code may be generated and run in-browser, displayed to users inline. For example, Artifacts in Claude, from Anthropic.
Code may be generated and executed server-side, behind the scenes, resulting in downloadable outputs. For example, a user may ask GPT-4o to generate a logo. GPT-4o generates images in raster format (e.g. as PNGs). In a follow-up step, the user may ask for the image to be converted into a vector format (e.g. SVG). To provide this to users, the AI will write and execute a simple script which uses a well-known existing Python library capable of performing this function, providing it the original image as an input, and providing the user with a download link to the resulting output.

Output Consistency

One of the downsides of existing applications’ approaches is that generated outputs tend to be inconsistent in nature, and fail to integrate with user’s existing data, or the wider applications in which they may be embedded. Claude Artifacts are quite literally sandboxed iFrames kept away from their containing application, and opinionated choices are made about style frameworks/technologies in AI-generated code (set to ‘smart defaults’ that look good by themselves, but typically clash when attempting to re-use or integrate outputted code in external codebases — e.g. through global variable pollution, dependency duplication, etc.) The same is true for "frontend code generators" such as v0 from Vercel.

Where opinionated choices are not provided to code generators by platforms, and also not specified by users, rather than default to good environment-agnostic practices models produce wildly inconsistent outputs and adopt erratic styles.

The Block Protocol, developed by HASH, is a standardized way of defining frontend interface components, called “blocks”. HASH is architected around the protocol, which means: data is stored in accordance with its graph model; APIs can be called through a standardized approach to services; and hooks – which expose UI components such as file selectors and query constructors – are available for blocks to utilize. This allows for inline “blocks”, truly integrated UI components, to be generated in response to user inputs.

Output Density

Outputs may become harder to review, especially as the chain of prompts, actions or events that were involved in their generation grows longer.

In part, this is the result of existing applications’ low information density - or inability to let users “zoom out” to see things relationally and in context.

For example, ChatGPT shows you one chat message at a time. In long threads, people regularly get lost while scrolling. This is especially true if a chat thread contains many iterations/variations on the same output (e.g. someone asking something to be rewritten slightly differently, or regenerated with one or two points changed each time, where “at a glance” outputs look substantially the same, but in fact have subtle differences).

But chat applications are not alone amongst generative AI tools in suffering from low information density. Midjourney, similarly, makes tracing derived generations difficult. Rather than situate variants of an image inline, jobs are shown in reverse-chronological order, making image lineage hard to trace.

For all iterative generative AI applications, we strongly recommend providing view options, and location indicators.

Multiple Views

Graph/branch views: Making it possible for users to view the lineage of generated assets, as they relate to their predecessors (and even initial inputs).
Toggleable views: Allowing users to toggle between traditional views, and branching/graph views, to aid in visually exploring outputs without being locked into one or the other permanently.

Location Indicators

Viewport indicators can be used to help users identify their current place within a document. For example, a “minimap” can assist users in identifying their current place within a large information space, enabling them to see how far up/down a conversation thread they are, or what spatial grid of a canvas they are currently zoomed in to. The below image showcases the minimap that appears to users of HASH canvases.

Trust & Confidence

Users may not trust the outputs of AI-generated models, even when accurate and correct. While most work to increase the accuracy of answers typically occurs on the backend, user faith can be reinforced by “showing the process” through which answers are generated.

Source Trees

Source trees show exactly what webpages, documents and other information sources were used in the generation of an answer.

For example, when using OpenAI's Deep Research, sources are shown in a right-hand sidebar that accompanies the user's request.

Reasoning Traces

Exposing the "chain of thought" of agentic workflows, or the internal reasoning of "thinking" models such as OpenAI o1 and Deepseek R1, can help reassure users of the common-sense integrity of generated outputs, providing a limited ability to introspect the steps taken to produce an answer. In addition, when streamed to users while in-progress, these can help keep users abreast in cases where runtime inference or multi-step agentic actions take longer than a few seconds to complete. HASH research and graph generation tasks can take many minutes or even hours, and OpenAI Deep Research tasks typically run in excess of 10 minutes apiece.

OpenAI keeps users abreast of Deep Research task progress by showing "Activity" in a sidebar (shown below). These regular progress reports are delivered alongside the source tree (previously shown).

Referencing

Inline References: Linkbacks to user-provided inputs and prior information shared or generated in an output session — ideally via inline quotations supported by jumplinks/preview modals/collapsible slideovers — should allow for in-app information to be transcluded and linked, for increased confidence in answers (although never simply quoted without reference to the output's original location, as users may fear hallucination).
External References: External material and context sourced should be similarly linked to from outputs, as appropriate.

Exposing Inputs & History

Allowing users to view the original prompt, context provided/utilized, and other information that produced an output can help observers assess its integrity.

Join our community of HASH developers

Browse open issues

Star us on GitHub Get in touch