Allowing users to build a shared, structured understanding of the world
February 20th, 2025
At its core, HASH is a platform that allows people to work together to generate, store and use data about the world. This blog post describes the goals and design decisions underpinning this.
When considering data about the world, we want to know its:
price
and quantity
, it should declare “I am an Order
”. What’s more, the price
should not just be a number, but declare “I am USD
”.price
property, I want to rely on the fact that it is not a negative number, and it is not text. If I am looking at an Order
, I expect it to be linked to a Customer
.People don’t always agree about how to represent the real world as data, and can have different requirements, expectations and tolerances for the same concept. Perhaps you allow a price
to be negative. Different people should be able to define the same concept in different ways.
But where there is agreement, we should share it. This means making every component of the system addressable and reusable wherever we can, so that we can make use of the overlap between our understandings of the world.
We also want to make the system intelligible and usable by non-technical domain aspects. This means providing UI and UX affordances, some of which are touched on below.
Finally, our view of both how to model the world and the facts about it can change over time. We need to be able to capture and inspect the history of each, including any points in time where our views diverged.
The post covers these two key areas in turn:
The world as perceived by humans consists of discrete things which have properties of their own, and also relations to other things. The Hat
has a color
of black
, and is Located In
the Bedroom
(it should be in the Hallway
, but my dog disagrees).
Our “type system” is made up of the following components:
Order
, Product
, Dog
, Room
. Each entity type consists of references to property types, which describe attributes about the thing itself, and link types, which describe its possible relations to other things – both the link itself and its permitted destinations, e.g. an Order
may require a Placed By
link which must have a Customer
as its destination.Price
property which has a permitted data type of USD
(or “USD
or EUR
" , or simply Currency
to allow any ‘child’ of Currency
– more on this later). Property types can also be property objects which are composed of nested property types.Negative Integer
is a number
which is a multiple of 1
and is less than 0
. Data types may also come with labels to make them more human-readable when the value is displayed, for example a label of “$” for USD
. They may also be something other than single values – for example, the RGB
data type is a tuple (fixed size list) of three numbers, with labels “r”, “g”, and “b’.Each component of the type system has its own identity, and its own version history. An entity type ‘has properties’ in the sense that it has a collection of references to property types (at specific versions).
This allows for the reuse of each component in other components. If I see the same Price
property on multiple different types of entity, I know that it both has the same semantic meaning, and has the same constraints on the data it will contain. If I encounter two different properties, but they both have a data type of URL
, I know what I can do with those values and what they mean.
In HASH, types belong to namespaces, which are associated with either an individual user or an organization. We call these namespaces ‘webs’. This allows different users and organizations to have different definitions of Order
, Price
, or anything else they wish, while still building them from components defined in other namespaces. Your Order
may use my Price
property – and at the version you choose, in case I change it in a way which no longer suits your purposes (for example, by placing additional constraints on its value – changing the semantic meaning of a type is strongly discourged).
Our multi-tenant agent-based simulation platform allowed simulations to be built in this composable way, with users able to import other users’ simulation behaviors and datasets in order to build their own, as well as to fork simulations and create merge requests back into the original.
We also want the system to be as open as possible – not locked in a proprietary silo. Types are identified by URL, and are retrievable from outside of HASH. Once created, a given version of a type is immutable (it will never change). This means that self-hosted instances of HASH, or indeed any other application wishing to make use of the type system, can define types which are composed of types hosted elsewhere, or simply directly use an entire collection of types from elsewhere.
Our ultimate goal through type sharing and composability is to make as much data as possible reliably usable by different people, and to encourage standardization of data models where possible. Differences of opinion on some aspects of how the world looks needn’t mean completely siloed views into it.
Modeling data often involves defining types which are more specific versions of other types — which may be thought of as ‘subtypes’ or ‘child types’.
For example, an Animal
type may have a Number Of Legs
property. We then create a Dog
which inherits from Animal
, and adds a Breed
property. Our Dog
inherits both semantic information (it is an Animal
) and structure (it has a Number Of Legs
property).
HASH allows for both entity types and data types to inherit from multiple other entity types or data types. We may later introduce this for property types too.
We also provide the concept of abstract
data types which are not meant to have values created for them directly, but instead provide semantic and value constraining information for their children. The abstract Length
data type inherits from Number
, which means that its children must also have numbers for values – but it does not make sense to create a value which is a Length
of 4
, as it has no real-world meaning without a unit. In fact, we have two layers of abstract types between Number
and usable length types (Imperial Length
and Metric Length
are children of Length
, and only their children, e.g. Meter
, can be assigned to values)
Any web (user or organization) may create a type which inherits from another web’s type. This allows users to create types which are specialized versions of someone else’s types, allowing their data to take advantage of tools and visualizations made for those types – with some exceptions, covered under ‘guarantees’ immediately below – while still tweaking how they represent their data. We also plan to allow users to raise change requests against other users’ types, where they feel their modifications would be generally useful.
This can create a complex graph of types, with entity types made up of property types, links to other entity types, and inheriting from entity types. We handle this in the UI by:
Our approach to inheritance is one of refinement – subtypes are more specialized versions of parents. For entity types, subtypes may add properties which do not exist on the parent. They may also introduce stricter or narrower conditions than those that already exist.
For example, a Dog
may make Number Of Legs
property required, where it was not on Animal
. A URL
data type may inherit from Text
but introduce constraints on what text is acceptable (in this case, a specific pattern which represents a valid URL).
Subtypes must not remove or loosen the expectations of the parent, for example make a required property no longer required, or set a maxLength
on a textual data type that is higher than the parent.
In other words, subtypes are guaranteed to be structurally compatible – they should contain valid values and relations for all the properties and links which the parent expects. They are not necessarily behaviorally compatible at the point an entity of a given type is used, meaning passing a subtype to some behavior which expects its parent may lead to Liskov substitution principle-like violations.
We do not know how users will use their data – HASH is an extensible, all-purpose data platform – and so we could only enforce behavioral compatibility by disallowing users from narrowing constraints when defining subtypes.
We believe refinement is how people most naturally think of subtypes – Dog
is a more specific version of Animal
, and any value on Dog
for a property which appears on Animal
will be valid for Animal
– but the lack of guarantees over behavioral compatibility does need keeping in mind when writing code which might be given a more specific version of the thing it was written for. For example, if a URL
value is passed to an editing component designed for its parent type, Text
, the user may enter text that doesn't conform to the URL
pattern – causing the update to fail.
When encountering data which is assigned one type, it is useful to know where it can be understood as or converted to another type. If I describe my data using a completely different set of types to you, there are likely to be places where we are referring to the same underlying thing, or values which can be converted for usage. I may even store comparable values as different types within my own data, and larger organizations are very likely to have a range of different representations of the same entity or value across different functions.
When saving entity data to HASH, for each property on the entity (e.g. width
) we save both the absolute value (e.g. 150
) and the data type (e.g. Millimeter
). Where the property expects a data type which has subtypes (such as Length
), users are prompted at the point of input to select the desired type – though we will introduce convenience features such as being able to select preferences for the default unit of length, currency and so on.
If I have a number of entities which have a width recorded in Millimeter
, and others as Meter
, Inch
, and so on, I should be able to operate on the Width
property as distance in the world. That means being able to say ‘sort by Length
’ and receive a list ordered by actual distance, not the number in the database – or query for ‘give me all entities with a Width
of less than 100 Miles
’, regardless of what unit the width is stored as.
To achieve this efficiently, we choose specific data types within a logical group as the ‘canonical’ type for that group. For example, Meter
is the canonical data type for every data type that inherits from Length
(chosen because it is the base SI unit for length). Each child of Length
defines a conversion to and from Meter
, and when we save any such value to the database, we save the canonical value alongside it. Any sorting or filtering operation can then use that canonical or normalized value, which already exists in the database and therefore comes with no performance penalty.
We can provide users with all of the possible conversion targets for any given data type, i.e. we can list all data types a given data type may be converted to, both directly and transitively (what are the conversion targets of its conversion targets, and so on). This enables UI features such as ‘Display as…’ options in table column headers.
Conversions among the Length
group are simple, fixed mathematical expressions. Next we will introduce calculations involving dynamic values – Currency
is a good example. We plan to allow conversion expressions to reference entity properties, such as USD -> EUR.rate
, and have that rate
property value be updated at a specific interval. This introduces an additional challenge in enabling performant querying, because our normalized value changes frequently.
The right answer depends on usage patterns and requirements, and data volume: if it is important that the values are as fresh as possible, the database must perform the conversions on the fly, making the query slower. If query performance is paramount, queries always use the latest normalized value, which may be out of date depending on how long it takes to update the calculated value – and the user informed of this fact. We may also allow users to query for values converted using a historical rate or a hypothetical rate, although these features introduce additional complexity.
As well as converting one value to another, we can also think of mapping meaning and mapping structure.
Mapping meaning is the act of declaring that some type is semantically ‘the same as’ another. For example, that my Given Name
property is semantically the same as your First Name
. This is sometimes known as ‘crosswalking’ between schemas, and we talk about it more in the HASH docs.
If properties are not just semantically the same but also expect identical values, then we can map between the entire structure. For example, my Person
entity type has a Given Name
and Email Address
property, and yours has a First Name
and Email Address
. If I map my Person
to yours, then it could be used in any code which expects your Person
(on the assumption that some utility function is transforming it into the required shape). As an alternative to mapping one type to another, we could also each declare that our Person
implements some set of properties (an interface), and declare any mappings required between that interface and our properties, where they differ. This is also relevant to the discussion of structural compatibility in the context of versioning below.
Many uses of data will have expectations about what that data looks like – including actions, analysis, monitoring tools, and so on. The goal of the type system is to both describe the data and provide strong guarantees about its structure and contents.
Types themselves are validated, both for their basic structure and to ensure that they do not have parent types which are incompatible with one another.
When entities are created, they are assigned one or more entity types. These types in combination will imply a schema the entity must conform to. This schema must be satisfiable, i.e. you cannot assign two entity types with incompatible requirements to an entity.
Data types may also inherit from other data types. Child data types may narrow the constraints of their parents, and in determining an entity’s schema, all of these references must be resolved to generate the full set of requirements and narrowest constraints implied by all of the types involved. Because the result of this is immutable for any given set of entity type versions, we can cache it, and don’t have to walk the type graph each time we want to deliver the schema for a given entity.
The entity’s schema covers both its own properties, and any relationships with other entities it may or must have.
The key aspects validated are that:
Email
properties), and if a list that any minimum or maximum length is respectedIn HASH, these constraints are enforced both at the UI level – users are provided with inputs and options that do not allow them to enter invalid values – and at the API level.
The constraints for a given type may change or be corrected over time. Which is why all types may have multiple versions.
Each specific version of an entity type refers to specific versions of link and property types, and property types refer to specific versions of data types as their permitted values.
This is necessary to provide consistent guarantees for the structure of entities and the graph they are a part of. It is especially important in the context of a multi-party system, where users may refer to each other’s types, and cannot have their expectations changing unexpectedly or implicitly when other types are updated.
How exactly to handle versioning is an ongoing challenge. The current approach is a simple incrementing integer, as outlined in the original RFC. There are downsides to this:
URL
data type, then update each property type that refers to it, then each entity type that refers to those, and then each entity type that inherits from or links to those entity types, and so on.It is not just types referring to other types that are affected by version changes – any code written to expect entities of specific types must also evolve in reaction to type changes, and any existing entities of those types migrated to the new schemas. Depending on how feasible this is to do quickly, and how many different users are relying on it, code may need to handle entities of a type across multiple versions.
An alternative to simple incrementing integers is to use semantic versioning, which does encode information about whether the change is ‘breaking’ or not. Types could then refer to version ranges of other types, and would not require updating to accommodate non-breaking changes. But breaking for whom, for what purpose? The RFC discusses cases where a change may be breaking for a consumer but not a producer of data, and vice versa.
We are still exploring this: one potential solution is to have a set of rules for what constitutes a breaking change, and accepting that some ‘non-breaking’ changes will require anticipating in any behavior written for types at an earlier version. For example, deciding that adding an optional field is not breaking – as it typically would not be in an API method – which means that code must be written to tolerate unexpected fields (as it must in any case to account using subtypes in place of their parents, as discussed above). Ideally these rules would be checkable so that we were not reliant on users choosing the correct semantic version increment, and instead automatically suggested it.
We originally devised the type system to support the Block Protocol – an open standard that specifies an interface between frontend components (“blocks”) and applications, such that any block conforming with the protocol can be used to read and write data in any supporting application, without either having any knowledge of the other. The type system is the part of the specification that governs how to describe data – there were also a series of ‘modules’ which specified operations that could be used to retrieve and edit data, call named third-party provider APIs, and so on.
Some use cases, such as many frontend visualization blocks, may be better suited to expectations which are based on structural typing rather than nominal typing, i.e. specifying the desired structure of data rather than caring about what it is called. Former HASH employee and Block Protocol contributor Maggie Appleton wrote about the Block Protocol and structured data in this blog post.
UI components and other code using data could only care that they receive things that have a name and an email, regardless of what they are. The HASH API supports querying entities based on structure rather than assigned type in order to locate these. If we also have structural mapping in place, then the data doesn’t need to have the properties, but only express how it can be converted to have them.
A structural compatibility approach could also be extended to the expected link destinations of entity types, e.g. to not say that an Order
must have a Placed By
link to a Customer,
but instead that it must have a Placed By
link to anything that has a Name
and an Email
. Similarly, as an alternative to saying that type inherits from a specific type, we could provide that it implements some set of features.
Defining data requirements as structures would be an additional option rather than a replacement for nominal typing. In many cases even frontend blocks care about the identity of the data they are querying for and working with, e.g. if the user is relying on them to only show them things of a certain type, not just of a certain shape.
We therefore might also offer users the choice of different approaches, depending on their needs:
User
), and within this
User v1.5.0
)User ^v1
)User *
)Whatever versioning approach is chosen, the fact that types may refer to types owned by other users or organizations complicates things further. To simplify the process of updating type graphs, when a user updates any type HASH provides them the option of updating all dependent types recursively so that everything refers to the latest version of everything else. But if there are dependent types which the user does not have permissions over, they cannot update them. If there are cycles in the type graph which cross permission boundaries, it is essentially impossible for all the types to be made to refer to the current version of each other without the system seeking agreement from multiple users to do so in a single transaction.
In our multi-tenant simulation platform, we allowed users to create requests suggesting changes to other user’s simulation or agent behavior (“merge” or “pull” requests). One potential solution to the ‘updating multi-tenant type graph’ problem is to do something similar in HASH, and allow users to create change requests which span multiple namespaces, which then must be agreed by people with the appropriate permissions over each.
How exactly are types expressed and stored? Given the extensive validation requirements, and the goal for types to be universally addressable and usable outside of HASH, JSON Schema was the natural starting point, as an established platform-agnostic standard for data validation with a healthy ecosystem.
JSON Schema does not fully meet our needs:
links
object to entity type schemas which describes the relationships it may have, but is not used in validating the entity object itself.label
for data types.Despite these wrinkles, it was far easier to start with JSON Schema’s validation features and adjust it to support our needs than to start with something designed for linked data (such as RDF) and add validation.
The basic JSON representation of types is described in detail in the initial RFC. The current meta-schema for an entity type is available here.
Users do not need to understand and are not exposed to the underlying representation of types.
We have cantered through the highlights of the type system, which describes the structure of things. Now some points of interest about the things themselves: entities.
Entities change over time, and we want to record and inspect what changed, and when.
‘The change happened at this time’ can mean different things: is it when the actual thing in the world changed, or is it when the values in the database changed? These are two different timelines – or temporal axes – along which we can think about changes.
In fact, there are three temporal axes used in temporal databases in various combinations:
Decision time allows for a lag between when someone made a decision to change something, and the change being recorded in the database, for example:
As well as these ‘when did the user actually click something’-style notions of decision time, it can also be used to refer to when some real-world decision to change something about the world happened, or when some real-world decision to change our view of the facts of the world occurred (e.g. when a consensus was reached that something is true that we previously thought false).
HASH is currently a bi-temporal system. Each version of an entity has a transaction time and a decision time interval. We have both of these to support proper auditing of database changes (transaction time), while allowing for features such as offline and collaborative editing where there may be a delay between the decision to change and the recording of it (decision time). We go into more detail on how we implemented bi-temporal versioning in Postgres in this blog post by HASH Rust developer Tim Diekmann.
Any timestamp which is reported by a client, such as the time of changes during offline editing, cannot be trusted, as the server has no way of verifying it – though it does disallow decision times in the future, and could place some limit on how far in the past a decision is allowed to be presented as, to prevent abuses such as users pretending that they decided to change something three months ago.
Users are encouraged to represent valid time via properties on entities or links as required. For example, an Employed By
link might have Start Date
and End Date
properties representing the period over which someone was actually employed. If different spans of valid time are required for a specific property of an entity, users can either (1) have this property be represented by a list of objects, each of which has the value itself alongside Applies From
and Applies Until
properties, or (2) split this property out into its own entity, and record the valid time as properties on the link to it.
Our current approach places the burden on the user to design their data models to account for values where real-world valid time is important and does not correspond satisfactorily with when values are changed in the system. This includes:
We may yet decide that native support for valid time is required, rather than leaving it to users to deal with it in their data model. This brings with it significant implementation and user experience complexity:
Each property will have an implied time period over which it was valid in the real world. When editing a property, users need to be able to specify whether they are setting a value that is valid from now, setting a value that was or will be valid for some other time period, or changing the valid time period for an existing value.
One potential mitigation for this is to treat valid time as equivalent to decision time unless the user specifically decides otherwise (though this means that if they try and query for an entity’s properties as they were in the real world in e.g. 2005, there would be no value as the record had not yet been created).
When viewing the history of an entity, users must decide which temporal axis they are exploring. To obtain a single-dimensional list of entity versions, all except one temporal axis must be fixed to a point in time, which allows for the history along the remaining axis to be explored.
For example, fixing all except the transaction time to ‘now’ would show the history of changes in the database to an entity’s attributes which have a valid and decision time of now, i.e. the database’s changing record of what the entity’s properties are in the real world now as decided now (or at least not superseded), but not, for example, any name that a Person
had in the past in the real world. Fixing all but the valid time to ‘now’ would show the latest database record and latest decision of the entity’s real world properties over time (including in the future), but not the history of any updates to the database or of decisions to that understanding.
The blog post describing the current implementation of bi-temporal versioning gives an indication of the technical complexities that would come from introducing a third axis.
As well as when an entity changed, we also want to know who made the change, and ideally what information informed their decision to do so. Collectively, we call this the ‘provenance’ of entity versions, and individual properties with them.
Some of this information is determined by the system, and is therefore trustable. Some of it relies on user report, and therefore can be misrepresented.
The provenance fields we currently support are:
createdById
: the actor that created the version, which may correspond to a user account or a bot account. This is determined by the system, and is trustable.actorType
: user
, machine
, or ai
. Species whether the creator was a user, or – if a bot – whether it was a dumb machine process, or used AI inference. Currently system-determined, as AI inference only runs on trusted servers.origin
: where the change was made from, for example from a migration script, API, web app, browser plugin – and which server, user agent, etc. Partially trustable: users can misrepresent which external client they are using.sources
: allow specifying the sources used in generating an update (both at the record level, and for each individual property), including e.g. the source’s type, title, URL, authors, and date accessed. We record these when using AI to infer entity properties from web or document sources.A confidence
score can also be assigned as metadata to any property, representing the confidence that its value is accurate.
To make this information easy to inspect, in HASH each entity has a history tab which shows:
These provenance fields also apply to updates to types, as well as to entities.
In a system with multiple users, where it’s possible to link to and use data anywhere in the system, we need to control who can do what. Even within a team where users completely trust each other, it can be useful and safer to limit who can do what, defining areas of ownership and avoiding accidental changes to resources.
HASH has an authorization system which allows for users to set a variety of permissions over their webs, types and entities. These can be assigned to various principal actors: to an entire web, to specific roles or teams within a web, or to specific users.
The HASH application itself relies on types and entities (e.g. User
, Organization
, Member Of
) which are defined in the same way as user-created types and entities, and the custom permission system is therefore built with both the application and its users’ needs in mind.
Available permissions include:
Finance
team in a web to view and edit Invoice entitiesPermissions can have more complex conditions attached to them based on predicates beyond ‘what type of resource is it, and what is its identity’. Consider:
Project
with a team in another organization. Simply sharing the Project
entity is not good enough, because that will just be a collection of metadata properties. We also need to be able to express that certain other types of entities linked to the Project
in certain ways are also shared (e.g. Document
, Timeline
). The policy must have conditions which describe not just an entity but its relation to other entities.HR
team within an organization, who should be able to view Interview Scores
, unless the linked Candidate
is themselves.Channel
, where the goal is that they should be able to see Messages
in that channel, but not those created before they joined. This brings in timestamps on top of graph relations.Poll
. This involves permissions based on checking the count of entities (Vote
s linked to the Poll
) which the user has created.Differential user access to entities, including their very existence, the value for some of their properties, and their relations to other entities, requires that the system:
The types provide guarantees about the data in the database, but these may then be violated by what is available to a given user.
This therefore requires consideration of the interplay between the permission model, data model and consuming code for a specific set of types. For example:
Fraud Investigation -> Concerns -> Person
relationship and a user able to create another Fraud Investigation
where they cannot see one already linked to the Person
and reject their attempt to create it on the basis that one already exists.At the time of writing, we are migrating from the Google Zanzibar-inspired SpiceDB as a store and checker of authorization policies, to a largely custom implementation which makes use of policies written in AWS’s Cedar. This is driven by the need for performant graph queries in a system where a user may be able to see many entities, but also not see many other entities. SpiceDB is currently only suited to two approaches to ending up with a list of only those entities a user can see in response to a query, both of which have problems:
We instead need to select the relevant authorization policies based on the user’s query (e.g. only those which apply to the user and any web and team roles they have), and then generate SQL conditions from those policies which include the relevant attributes or relationships an entity must have to be visible. This is what we are implementing, whereby Cedar provides the policy syntax and returns only the necessary policies based on the policy set and variables (e.g. actor, resource type) we provide it with, and we convert the returned policies into query conditions. Cedar is run in our Graph API through their native Rust bindings, which means we also remove the latency in querying an external authorization service.
There are other things we use in order to build a trusted, auditable data platform designed to reflect a complex, uncertain world. For example, records of ‘Claims’ (and competing counterclaims) about an entity, which form the basis for judgements about its attributes. We will cover some of these in a future post.
Get notified when new long-reads and articles go live. Follow along as we dive deep into new tech, and share our experiences. No sales stuff.