Dev Blog/Our Approach, HASH

Collaborative data modeling

Allowing users to build a shared, structured understanding of the world

February 20th, 2025

Ciaran MorinanCTO, HASH

Introduction

At its core, HASH is a platform that allows people to work together to generate, store and use data about the world. This blog post describes the goals and design decisions underpinning this.

When considering data about the world, we want to know its:

  1. Meaning: we should know what the data is about – it should come with semantic information. If I see an object that has two properties, price and quantity, it should declare “I am an Order”. What’s more, the price should not just be a number, but declare “I am USD”.
  2. Structure: the shape and type of data should match our expectations. If I encounter a price property, I want to rely on the fact that it is not a negative number, and it is not text. If I am looking at an Order, I expect it to be linked to a Customer.

People don’t always agree about how to represent the real world as data, and can have different requirements, expectations and tolerances for the same concept. Perhaps you allow a price to be negative. Different people should be able to define the same concept in different ways.

But where there is agreement, we should share it. This means making every component of the system addressable and reusable wherever we can, so that we can make use of the overlap between our understandings of the world.

We also want to make the system intelligible and usable by non-technical domain aspects. This means providing UI and UX affordances, some of which are touched on below.

Finally, our view of both how to model the world and the facts about it can change over time. We need to be able to capture and inspect the history of each, including any points in time where our views diverged.

The post covers these two key areas in turn:

  1. Types: how to describe the world.
  2. Entities: the facts about it.

Types

The world as perceived by humans consists of discrete things which have properties of their own, and also relations to other things. The Hat has a color of black, and is Located In the Bedroom (it should be in the Hallway, but my dog disagrees).

Our “type system” is made up of the following components:

  1. Entity types: types of things, such as Order, Product, Dog, Room. Each entity type consists of references to property types, which describe attributes about the thing itself, and link types, which describe its possible relations to other things – both the link itself and its permitted destinations, e.g. an Order may require a Placed By link which must have a Customer as its destination.
  2. Property types: provide a semantic description of an attribute, and specify its permitted data types. For example, a Price property which has a permitted data type of USD (or “USD or EUR" , or simply Currency to allow any ‘child’ of Currency – more on this later). Property types can also be property objects which are composed of nested property types.
  3. Data types: describe a piece of data, including its meaningful name, its primitive type (e.g. ‘number’) and any further constraints that it must comply with. For example, Negative Integer is a number which is a multiple of 1 and is less than 0. Data types may also come with labels to make them more human-readable when the value is displayed, for example a label of “$” for USD. They may also be something other than single values – for example, the RGB data type is a tuple (fixed size list) of three numbers, with labels “r”, “g”, and “b’.
  4. Link types: describe a relationship between two things. Links may also have their own attributes, which provide more information about the relationship. In fact, under the hood link types are simply entity types with a special identifier, and have all the same features.

Composability

Each component of the type system has its own identity, and its own version history. An entity type ‘has properties’ in the sense that it has a collection of references to property types (at specific versions).

This allows for the reuse of each component in other components. If I see the same Price property on multiple different types of entity, I know that it both has the same semantic meaning, and has the same constraints on the data it will contain. If I encounter two different properties, but they both have a data type of URL, I know what I can do with those values and what they mean.

Multiparty type building

In HASH, types belong to namespaces, which are associated with either an individual user or an organization. We call these namespaces ‘webs’. This allows different users and organizations to have different definitions of Order, Price, or anything else they wish, while still building them from components defined in other namespaces. Your Order may use my Price property – and at the version you choose, in case I change it in a way which no longer suits your purposes (for example, by placing additional constraints on its value – changing the semantic meaning of a type is strongly discourged).

Our multi-tenant agent-based simulation platform allowed simulations to be built in this composable way, with users able to import other users’ simulation behaviors and datasets in order to build their own, as well as to fork simulations and create merge requests back into the original.

Universally addressable

We also want the system to be as open as possible – not locked in a proprietary silo. Types are identified by URL, and are retrievable from outside of HASH. Once created, a given version of a type is immutable (it will never change). This means that self-hosted instances of HASH, or indeed any other application wishing to make use of the type system, can define types which are composed of types hosted elsewhere, or simply directly use an entire collection of types from elsewhere.

Our ultimate goal through type sharing and composability is to make as much data as possible reliably usable by different people, and to encourage standardization of data models where possible. Differences of opinion on some aspects of how the world looks needn’t mean completely siloed views into it.

Inheritance

Modeling data often involves defining types which are more specific versions of other types  — which may be thought of as ‘subtypes’ or ‘child types’.

For example, an Animal type may have a Number Of Legs property. We then create a Dog which inherits from Animal, and adds a Breed property. Our Dog inherits both semantic information (it is an Animal) and structure (it has a Number Of Legs property).

HASH allows for both entity types and data types to inherit from multiple other entity types or data types. We may later introduce this for property types too.

We also provide the concept of abstract data types which are not meant to have values created for them directly, but instead provide semantic and value constraining information for their children. The abstract Length data type inherits from Number, which means that its children must also have numbers for values – but it does not make sense to create a value which is a Length of 4, as it has no real-world meaning without a unit. In fact, we have two layers of abstract types between Number and usable length types (Imperial Length and Metric Length are children of Length, and only their children, e.g. Meter, can be assigned to values)

Extending across webs

Any web (user or organization) may create a type which inherits from another web’s type. This allows users to create types which are specialized versions of someone else’s types, allowing their data to take advantage of tools and visualizations made for those types – with some exceptions, covered under ‘guarantees’ immediately below – while still tweaking how they represent their data. We also plan to allow users to raise change requests against other users’ types, where they feel their modifications would be generally useful.

This can create a complex graph of types, with entity types made up of property types, links to other entity types, and inheriting from entity types. We handle this in the UI by:

  • Showing all the information needed to understand a single entity type on its page (e.g. showing the expected data types for each property inline, rather than having to walk through references)
  • Indicating which properties and links on an entity type are inherited from which parent
  • Providing a separate graph visualization showing the connections between types

Guarantees

Our approach to inheritance is one of refinement – subtypes are more specialized versions of parents. For entity types, subtypes may add properties which do not exist on the parent. They may also introduce stricter or narrower conditions than those that already exist.

For example, a Dog may make Number Of Legs property required, where it was not on Animal. A URL data type may inherit from Text but introduce constraints on what text is acceptable (in this case, a specific pattern which represents a valid URL).

Subtypes must not remove or loosen the expectations of the parent, for example make a required property no longer required, or set a maxLength on a textual data type that is higher than the parent.

In other words, subtypes are guaranteed to be structurally compatible – they should contain valid values and relations for all the properties and links which the parent expects. They are not necessarily behaviorally compatible at the point an entity of a given type is used, meaning passing a subtype to some behavior which expects its parent may lead to Liskov substitution principle-like violations.

We do not know how users will use their data – HASH is an extensible, all-purpose data platform – and so we could only enforce behavioral compatibility by disallowing users from narrowing constraints when defining subtypes.

We believe refinement is how people most naturally think of subtypes – Dog is a more specific version of Animal, and any value on Dog for a property which appears on Animal will be valid for Animal – but the lack of guarantees over behavioral compatibility does need keeping in mind when writing code which might be given a more specific version of the thing it was written for. For example, if a URL value is passed to an editing component designed for its parent type, Text, the user may enter text that doesn't conform to the URL pattern – causing the update to fail.

Conversion and crosswalking

When encountering data which is assigned one type, it is useful to know where it can be understood as or converted to another type. If I describe my data using a completely different set of types to you, there are likely to be places where we are referring to the same underlying thing, or values which can be converted for usage. I may even store comparable values as different types within my own data, and larger organizations are very likely to have a range of different representations of the same entity or value across different functions.

Data types

When saving entity data to HASH, for each property on the entity (e.g. width) we save both the absolute value (e.g. 150) and the data type (e.g. Millimeter). Where the property expects a data type which has subtypes (such as Length), users are prompted at the point of input to select the desired type – though we will introduce convenience features such as being able to select preferences for the default unit of length, currency and so on.

If I have a number of entities which have a width recorded in Millimeter, and others as Meter, Inch, and so on, I should be able to operate on the Width property as distance in the world. That means being able to say ‘sort by Length’ and receive a list ordered by actual distance, not the number in the database – or query for ‘give me all entities with a Width of less than 100 Miles’, regardless of what unit the width is stored as.

To achieve this efficiently, we choose specific data types within a logical group as the ‘canonical’ type for that group. For example, Meter is the canonical data type for every data type that inherits from Length (chosen because it is the base SI unit for length). Each child of Length defines a conversion to and from Meter, and when we save any such value to the database, we save the canonical value alongside it. Any sorting or filtering operation can then use that canonical or normalized value, which already exists in the database and therefore comes with no performance penalty.

We can provide users with all of the possible conversion targets for any given data type, i.e. we can list all data types a given data type may be converted to, both directly and transitively (what are the conversion targets of its conversion targets, and so on). This enables UI features such as ‘Display as…’ options in table column headers.

Dynamic conversions

Conversions among the Length group are simple, fixed mathematical expressions. Next we will introduce calculations involving dynamic values – Currency is a good example. We plan to allow conversion expressions to reference entity properties, such as USD -> EUR.rate, and have that rate property value be updated at a specific interval. This introduces an additional challenge in enabling performant querying, because our normalized value changes frequently.

The right answer depends on usage patterns and requirements, and data volume: if it is important that the values are as fresh as possible, the database must perform the conversions on the fly, making the query slower. If query performance is paramount, queries always use the latest normalized value, which may be out of date depending on how long it takes to update the calculated value – and the user informed of this fact. We may also allow users to query for values converted using a historical rate or a hypothetical rate, although these features introduce additional complexity.

Crosswalking

As well as converting one value to another, we can also think of mapping meaning and mapping structure.

Mapping meaning is the act of declaring that some type is semantically ‘the same as’ another. For example, that my Given Name property is semantically the same as your First Name. This is sometimes known as ‘crosswalking’ between schemas, and we talk about it more in the HASH docs.

If properties are not just semantically the same but also expect identical values, then we can map between the entire structure. For example, my Person entity type has a Given Name and Email Address property, and yours has a First Name and Email Address. If I map my Person to yours, then it could be used in any code which expects your Person (on the assumption that some utility function is transforming it into the required shape). As an alternative to mapping one type to another, we could also each declare that our Person implements some set of properties (an interface), and declare any mappings required between that interface and our properties, where they differ. This is also relevant to the discussion of structural compatibility in the context of versioning below.

Versioning and validation

Validation

Many uses of data will have expectations about what that data looks like – including actions, analysis, monitoring tools, and so on. The goal of the type system is to both describe the data and provide strong guarantees about its structure and contents.

Types themselves are validated, both for their basic structure and to ensure that they do not have parent types which are incompatible with one another.

When entities are created, they are assigned one or more entity types. These types in combination will imply a schema the entity must conform to. This schema must be satisfiable, i.e. you cannot assign two entity types with incompatible requirements to an entity.

Data types may also inherit from other data types. Child data types may narrow the constraints of their parents, and in determining an entity’s schema, all of these references must be resolved to generate the full set of requirements and narrowest constraints implied by all of the types involved. Because the result of this is immutable for any given set of entity type versions, we can cache it, and don’t have to walk the type graph each time we want to deliver the schema for a given entity.

The entity’s schema covers both its own properties, and any relationships with other entities it may or must have.

Validation in detail

The key aspects validated are that:

  • entities may have only the properties listed in their schema, and must have any that are specified as required
  • each property contains either a single instance or a list of instances of values complying with that property’s schema, as specified by the entity type (e.g. an entity type may require a list of Email properties), and if a list that any minimum or maximum length is respected
  • each instance of a property value complies with one of the data types that the property schema specifies, which may specify
    • if a number: the minimum, maximum, whether those are exclusive, and a multiple
    • if a string: a minimum length, maximum length, and specific patterns (e.g. for URLs, timestamps, email addresses, UUIDs, etc)
    • lists or tuples (fixed-size lists) of values, each of which may have their own constraints
    • enums (a predetermined set of acceptable values)
    • constants (a single acceptable value, the same as a single-value enum)
  • an entity has only those links to other entities which are specified in its schema, and that
    • the destination entity for each link is one of those allowed by the schema (‘anything’ is also an option)
    • any minimum or maximum number of each links are respected

In HASH, these constraints are enforced both at the UI level – users are provided with inputs and options that do not allow them to enter invalid values – and at the API level.

The constraints for a given type may change or be corrected over time. Which is why all types may have multiple versions.

Versioning

Each specific version of an entity type refers to specific versions of link and property types, and property types refer to specific versions of data types as their permitted values.

This is necessary to provide consistent guarantees for the structure of entities and the graph they are a part of. It is especially important in the context of a multi-party system, where users may refer to each other’s types, and cannot have their expectations changing unexpectedly or implicitly when other types are updated.

How exactly to handle versioning is an ongoing challenge. The current approach is a simple incrementing integer, as outlined in the original RFC. There are downsides to this:

  • the type version tells you nothing about the scale of the change, or what it might break if you upgrade to it – a new version might simply change the type’s description, or it might remove all properties from the previous version.
  • in a highly connected type graph, any change to a type requires a cascade of updates to other types if we want them to use its latest version – e.g. if I update the URL data type, then update each property type that refers to it, then each entity type that refers to those, and then each entity type that inherits from or links to those entity types, and so on.

It is not just types referring to other types that are affected by version changes – any code written to expect entities of specific types must also evolve in reaction to type changes, and any existing entities of those types migrated to the new schemas. Depending on how feasible this is to do quickly, and how many different users are relying on it, code may need to handle entities of a type across multiple versions.

Other approaches to versioning

An alternative to simple incrementing integers is to use semantic versioning, which does encode information about whether the change is ‘breaking’ or not. Types could then refer to version ranges of other types, and would not require updating to accommodate non-breaking changes. But breaking for whom, for what purpose? The RFC discusses cases where a change may be breaking for a consumer but not a producer of data, and vice versa.

We are still exploring this: one potential solution is to have a set of rules for what constitutes a breaking change, and accepting that some ‘non-breaking’ changes will require anticipating in any behavior written for types at an earlier version. For example, deciding that adding an optional field is not breaking – as it typically would not be in an API method – which means that code must be written to tolerate unexpected fields (as it must in any case to account using subtypes in place of their parents, as discussed above). Ideally these rules would be checkable so that we were not reliant on users choosing the correct semantic version increment, and instead automatically suggested it.

Structural compatibility

We originally devised the type system to support the Block Protocol – an open standard that specifies an interface between frontend components (“blocks”) and applications, such that any block conforming with the protocol can be used to read and write data in any supporting application, without either having any knowledge of the other. The type system is the part of the specification that governs how to describe data – there were also a series of ‘modules’ which specified operations that could be used to retrieve and edit data, call named third-party provider APIs, and so on.

Some use cases, such as many frontend visualization blocks, may be better suited to expectations which are based on structural typing rather than nominal typing, i.e. specifying the desired structure of data rather than caring about what it is called. Former HASH employee and Block Protocol contributor Maggie Appleton wrote about the Block Protocol and structured data in this blog post.

UI components and other code using data could only care that they receive things that have a name and an email, regardless of what they are. The HASH API supports querying entities based on structure rather than assigned type in order to locate these. If we also have structural mapping in place, then the data doesn’t need to have the properties, but only express how it can be converted to have them.

A structural compatibility approach could also be extended to the expected link destinations of entity types, e.g. to not say that an Order must have a Placed By link to a Customer, but instead that it must have a Placed By link to anything that has a Name and an Email. Similarly, as an alternative to saying that type inherits from a specific type, we could provide that it implements some set of features.

Defining data requirements as structures would be an additional option rather than a replacement for nominal typing. In many cases even frontend blocks care about the identity of the data they are querying for and working with, e.g. if the user is relying on them to only show them things of a certain type, not just of a certain shape.

We therefore might also offer users the choice of different approaches, depending on their needs:

  • where the identity of an entity is important, use nominal typing (User), and within this
    • refer to exact versions for the strongest guarantee over structure (User v1.5.0)
    • use version ranges to accommodate non-breaking changes (User ^v1)
    • use wildcards where only identity is important, and any structure is fine (User *)
  • where structure but not identity is important, declare the expected properties and links.

Updating a cross-web type graph

Whatever versioning approach is chosen, the fact that types may refer to types owned by other users or organizations complicates things further. To simplify the process of updating type graphs, when a user updates any type HASH provides them the option of updating all dependent types recursively so that everything refers to the latest version of everything else. But if there are dependent types which the user does not have permissions over, they cannot update them. If there are cycles in the type graph which cross permission boundaries, it is essentially impossible for all the types to be made to refer to the current version of each other without the system seeking agreement from multiple users to do so in a single transaction.

In our multi-tenant simulation platform, we allowed users to create requests suggesting changes to other user’s simulation or agent behavior (“merge” or “pull” requests). One potential solution to the ‘updating multi-tenant type graph’ problem is to do something similar in HASH, and allow users to create change requests which span multiple namespaces, which then must be agreed by people with the appropriate permissions over each.

Representation

How exactly are types expressed and stored? Given the extensive validation requirements, and the goal for types to be universally addressable and usable outside of HASH, JSON Schema was the natural starting point, as an established platform-agnostic standard for data validation with a healthy ecosystem.

JSON Schema does not fully meet our needs:

  • it is designed to describe constraints on discrete pieces of data, not relationships between them. We add a specially-cased links object to entity type schemas which describes the relationships it may have, but is not used in validating the entity object itself.
  • it is not designed for modelling inheritance, and we have to make some adjustments to allow for inherited properties without allowing arbitrary additional properties.
  • we require some metadata for types that meant adding non-standard keywords, such as label for data types.

Despite these wrinkles, it was far easier to start with JSON Schema’s validation features and adjust it to support our needs than to start with something designed for linked data (such as RDF) and add validation.

The basic JSON representation of types is described in detail in the initial RFC. The current meta-schema for an entity type is available here.

Users do not need to understand and are not exposed to the underlying representation of types.

Entities

We have cantered through the highlights of the type system, which describes the structure of things. Now some points of interest about the things themselves: entities.

Multi-temporal versioning

Entities change over time, and we want to record and inspect what changed, and when.

‘The change happened at this time’ can mean different things: is it when the actual thing in the world changed, or is it when the values in the database changed? These are two different timelines – or temporal axes – along which we can think about changes.

In fact, there are three temporal axes used in temporal databases in various combinations:

  • valid time: the time at which something was true in the world
  • transaction time: the time at which something was recorded in the database
  • decision time: the time at which a user decided to make the change

Decision time allows for a lag between when someone made a decision to change something, and the change being recorded in the database, for example:

  • so that users can make changes offline and have a better record of their work over time (rather than 4 hours of work being recorded as happening in a single instant once they come back online)
  • where changes are managed by some intermediary process, such as a collaborative editing server resolving conflicts from multiple users, which occasionally sends updates to the database in batches

As well as these ‘when did the user actually click something’-style notions of decision time, it can also be used to refer to when some real-world decision to change something about the world happened, or when some real-world decision to change our view of the facts of the world occurred (e.g. when a consensus was reached that something is true that we previously thought false).

Temporality in HASH

HASH is currently a bi-temporal system. Each version of an entity has a transaction time and a decision time interval. We have both of these to support proper auditing of database changes (transaction time), while allowing for features such as offline and collaborative editing where there may be a delay between the decision to change and the recording of it (decision time). We go into more detail on how we implemented bi-temporal versioning in Postgres in this blog post by HASH Rust developer Tim Diekmann.

Any timestamp which is reported by a client, such as the time of changes during offline editing, cannot be trusted, as the server has no way of verifying it – though it does disallow decision times in the future, and could place some limit on how far in the past a decision is allowed to be presented as, to prevent abuses such as users pretending that they decided to change something three months ago.

Users are encouraged to represent valid time via properties on entities or links as required. For example, an Employed By link might have Start Date and End Date properties representing the period over which someone was actually employed. If different spans of valid time are required for a specific property of an entity, users can either (1) have this property be represented by a list of objects, each of which has the value itself alongside Applies From and Applies Until properties, or (2) split this property out into its own entity, and record the valid time as properties on the link to it.

Valid time challenges

Our current approach places the burden on the user to design their data models to account for values where real-world valid time is important and does not correspond satisfactorily with when values are changed in the system. This includes:

  • use cases that require precision over when real-world changes happen, and changes cannot be reflected in the system fast enough
  • the ability to set time periods which started a significant time in the past (e.g. 1850)
  • the ability to correct historical inaccuracies in time spans (although this could be accommodated by allowing arbitrary decision times to be submitted)
  • the ability to set a future time at which a value will apply in anticipation of a change

We may yet decide that native support for valid time is required, rather than leaving it to users to deal with it in their data model. This brings with it significant implementation and user experience complexity:

  • Each property will have an implied time period over which it was valid in the real world. When editing a property, users need to be able to specify whether they are setting a value that is valid from now, setting a value that was or will be valid for some other time period, or changing the valid time period for an existing value.

    One potential mitigation for this is to treat valid time as equivalent to decision time unless the user specifically decides otherwise (though this means that if they try and query for an entity’s properties as they were in the real world in e.g. 2005, there would be no value as the record had not yet been created).

  • When viewing the history of an entity, users must decide which temporal axis they are exploring. To obtain a single-dimensional list of entity versions, all except one temporal axis must be fixed to a point in time, which allows for the history along the remaining axis to be explored.

    For example, fixing all except the transaction time to ‘now’ would show the history of changes in the database to an entity’s attributes which have a valid and decision time of now, i.e. the database’s changing record of what the entity’s properties are in the real world now as decided now (or at least not superseded), but not, for example, any name that a Person had in the past in the real world. Fixing all but the valid time to ‘now’ would show the latest database record and latest decision of the entity’s real world properties over time (including in the future), but not the history of any updates to the database or of decisions to that understanding.

The blog post describing the current implementation of bi-temporal versioning gives an indication of the technical complexities that would come from introducing a third axis.

Provenance

As well as when an entity changed, we also want to know who made the change, and ideally what information informed their decision to do so. Collectively, we call this the ‘provenance’ of entity versions, and individual properties with them.

Some of this information is determined by the system, and is therefore trustable. Some of it relies on user report, and therefore can be misrepresented.

The provenance fields we currently support are:

  • createdById: the actor that created the version, which may correspond to a user account or a bot account. This is determined by the system, and is trustable.
  • actorType: user, machine, or ai. Species whether the creator was a user, or – if a bot – whether it was a dumb machine process, or used AI inference. Currently system-determined, as AI inference only runs on trusted servers.
  • origin: where the change was made from, for example from a migration script, API, web app, browser plugin – and which server, user agent, etc. Partially trustable: users can misrepresent which external client they are using.
  • sources: allow specifying the sources used in generating an update (both at the record level, and for each individual property), including e.g. the source’s type, title, URL, authors, and date accessed. We record these when using AI to infer entity properties from web or document sources.

A confidence score can also be assigned as metadata to any property, representing the confidence that its value is accurate.

To make this information easy to inspect, in HASH each entity has a history tab which shows:

  • when the entity was created, archived, or its types changed
  • which property values were added or changed, and the sources used to determine the value.

These provenance fields also apply to updates to types, as well as to entities.

Authorization

In a system with multiple users, where it’s possible to link to and use data anywhere in the system, we need to control who can do what. Even within a team where users completely trust each other, it can be useful and safer to limit who can do what, defining areas of ownership and avoiding accidental changes to resources.

HASH has an authorization system which allows for users to set a variety of permissions over their webs, types and entities. These can be assigned to various principal actors: to an entire web, to specific roles or teams within a web, or to specific users.

The HASH application itself relies on types and entities (e.g. User, Organization, Member Of) which are defined in the same way as user-created types and entities, and the custom permission system is therefore built with both the application and its users’ needs in mind.

Permissions in detail

Available permissions include:

  • Who can create, view, edit, and archive entities and types, e.g.
    • only allow web admins to create types in a web
    • only allow the Finance team in a web to view and edit Invoice entities
  • Who can view and edit specific properties on entities, e.g.
    • allow a user to see but not edit their own username
    • allow users to see each other’s name, but not their email
  • Who can manage roles, members and teams within an organization (and teams)
  • Who can grant permissions over various resources (meta-permissions)

Permissions can have more complex conditions attached to them based on predicates beyond ‘what type of resource is it, and what is its identity’. Consider:

  • an organization wishing to share a Project with a team in another organization. Simply sharing the Project entity is not good enough, because that will just be a collection of metadata properties. We also need to be able to express that certain other types of entities linked to the Project in certain ways are also shared (e.g. Document, Timeline). The policy must have conditions which describe not just an entity but its relation to other entities.
  • a member of a HR team within an organization, who should be able to view Interview Scores, unless the linked Candidate is themselves.
  • a user who joins a Channel, where the goal is that they should be able to see Messages in that channel, but not those created before they joined. This brings in timestamps on top of graph relations.
  • users who are permitted to cast a single vote in a Poll. This involves permissions based on checking the count of entities (Votes linked to the Poll) which the user has created.

Accounting for access

Differential user access to entities, including their very existence, the value for some of their properties, and their relations to other entities, requires that the system:

  1. functions correctly despite data being invisible or non-editable to users
  2. does not reveal information about the existence of data to users which are not authorized to know that it exists.

The types provide guarantees about the data in the database, but these may then be violated by what is available to a given user.

This therefore requires consideration of the interplay between the permission model, data model and consuming code for a specific set of types. For example:

  • code written for specific types should handle the absence of properties and linked entities which the data model describes but the user may not see. This is possible if the same actor controls all of the data model, applicable permissions and code, and may not be if someone is trying to reuse code written with the data model but not some stricter set of permissions in mind.
  • entity validation should either obfuscate or avoid situations which reveal secret information, e.g. we should not have a ‘at most one’ secret Fraud Investigation -> Concerns -> Person relationship and a user able to create another Fraud Investigation where they cannot see one already linked to the Person and reject their attempt to create it on the basis that one already exists.

Implementation

At the time of writing, we are migrating from the Google Zanzibar-inspired SpiceDB as a store and checker of authorization policies, to a largely custom implementation which makes use of policies written in AWS’s Cedar. This is driven by the need for performant graph queries in a system where a user may be able to see many entities, but also not see many other entities. SpiceDB is currently only suited to two approaches to ending up with a list of only those entities a user can see in response to a query, both of which have problems:

  1. Pre-filtering, where a list of all resources a user can view is requested from SpiceDB and then used as part of a database query. This is unsuitable as the list may be extremely big.
  2. Post-filtering, where the results of a database query are sent to SpiceDB to check which a user can see. This is unsuitable as a request for the first e.g. 10 entities that a user can see may involve 1,000 cycles through ‘query and check’ until we find 10 that the user can access.

We instead need to select the relevant authorization policies based on the user’s query (e.g. only those which apply to the user and any web and team roles they have), and then generate SQL conditions from those policies which include the relevant attributes or relationships an entity must have to be visible. This is what we are implementing, whereby Cedar provides the policy syntax and returns only the necessary policies based on the policy set and variables (e.g. actor, resource type) we provide it with, and we convert the returned policies into query conditions. Cedar is run in our Graph API through their native Rust bindings, which means we also remove the latency in querying an external authorization service.

That’s not all, folks...

There are other things we use in order to build a trusted, auditable data platform designed to reflect a complex, uncertain world. For example, records of ‘Claims’ (and competing counterclaims) about an entity, which form the basis for judgements about its attributes. We will cover some of these in a future post.

Get new posts in your inbox

Get notified when new long-reads and articles go live. Follow along as we dive deep into new tech, and share our experiences. No sales stuff.

Join our community of HASH developers