Dev Blog/Block Protocol, HASH

The long-term value of structured data

Structured data's evolving, persistent value in a world with large machine learning models

August 29th, 2023

Dei VilkinsonsCEO & Founder, HASH

Relevancy in a new world

Advocates for a machine-readable semantic web have long argued in favour of augmenting the human-readable content on webpages with machine-parsable content, by adding structured data markup (e.g. Microformats/RDFa) or mapping (JSON-LD).

And yet we now live in a world in which 'foundation' models, such as large language (LLM) and multimodal (LMM) models, are able to thrive when faced with messy data, courtesy of their impressive ability to find patterns within unstructured soup.

With the original vision of the semantic web arguably now well on its way to being fulfilled - at least partially - by these new models, it's worth asking what the point of continuing to structure data actually is.

While structured data was always supposed to make the web "machine readable", there was always an incentives problem. Why give that away for free, if you only care about people reading your website? It was generally always more effort to add structured data markup than it was worth. Even when Google began favouring websites that provided structured data to the search engine (allowing them in turn to feature elements of pages as 'rich snippets' directly in search results pages) adoption remained limited and patchy, as did the kinds of entities that could be described using the shared structured data vocabulary hosted at schema.org.

We created the Block Protocol because we believed (and still believe) that block-based interfaces in content management systems can make it as easy, or even easier, to capture information in a cleanly-typed and structured fashion than it can be to manually insert it from scratch. For example, the Mapbox-enabled Address block with its search and autofill capabilities makes it quicker to enter a location, store it as structured data, and serve JSON-LD on a webpage than it would be to simply enter a full address out manually by hand as unstructured text. In addition, you (optionally) get to insert a nice map! A win-win all round.

Does that matter any more? For the semantic web: probably not. The semantic web's vision of an internet that is both human and machine-readable will happen now. Foundation models will structure the unstructured web for us... or at least the parts of it that we or they can access. This is what those large models excel at: ascertaining patterns and structure, without the need for markup or mapping. Whether more slowly over time, or in an accelerating fashion (the jury remains out!), these models will continue to improve, with their present tendency to hallucinate diminishing, and with improvements to their transparency through source citing, better context retrieval, provenance reflection, and multi-modal augmentation.

But the value of structured data isn't going away. That, after all, is what foundation models are doing. They're learning the structure of things, and then calculating answers to specific questions, in accordance with those understood structures. We humans rely on structure. Our thoughts, plans are actions are predicated upon it. So the real question is not "does structured data matter?", but "does having pre-structured data continue to yield benefits?"

Decreasing value

In the semantic web use-case, there will be decreasingly little to be gained from publishing your own structured data, and rapidly diminishing need. While Google continues to favour those who publish structured markup in its search ranking algorithms there will be some incentive to do so, but publisher self-declaration will no longer be the main (or perhaps even best) way for search engines and other scrapers to classify and extract structured data from a webpage.

Foundation models are, and will increasingly be, used to classify scraped data according to existing schemas, and discover new ways to categorize data. This information will be used to produce the neatly structured datasets that power search engines' own "rich snippets", providing more information directly to users on search results pages, and eliminating the need for them to click through onto providers' websites.

This is true even in a world in which search engines don't use user's search queries as direct prompts for foundation models. Today, doing so costs more and takes longer than running a traditional search query. The average Google search result returns to the user in under half a second, while many responses to GPT-4 prompts can take upwards of 5 seconds to generate. The continued use of pre-structured data by search engines enables them to deliver results more efficiently.

But let's not be lazy. Let's imagine a future in which time-to-results and marginal costs do decrease dramatically. Compute and model runtime costs have come down, existing models have been optimized, and new models have been developed. What then? And what about other use cases?

Enduring value

Right now, one of the things structure affords is the ability to browse in an undirected, but logically coherent fashion. It aides in open-ended exploration of problem spaces and information, and helps us draw meaning from information, learning incrementally. While we may be able to explore a topic with a large language model as our aide, its answers might guide us down particular paths it (or its creators) are biased towards. This is a real risk with the kind of RLHF (reinforcement learning with human feedback) used to produce many foundation models today. Conversely, being able to see raw structured information for yourself and find your own patterns and meaning in it might take a lot longer than the 5 seconds that an OpenAI query might take to return, but the bias you bring to the information will be your own. Combining forms of analysis may prove commonplace.

While large models learn well from unstructured data, they still learn better from structured data, and in particular high-quality, curated sets of data. It's one of the reasons that high-quality corpuses (such as that of Stack Overflow) are weighted so highly in the training texts that large language models use. And as the internet becomes increasingly full of convincing-sounding machine-generated content that can be hard or impossible to distinguish from stuff written by actual human experts, it may be that training new foundation models will become more difficult in the future. Others have already likened the web's content today to "low-background steel" (steel produced prior to first test detonations of nuclear weapons on earth), which has unique properties. And while machine-generated synthetic data has many benefits, its presence in too large a quantity, or its covert presence in datasets, may degrade the quality (precision) and diversity (recall) of generative models that are trained on them. HASH ensures that the provenance of information in a user or organization's web (their graph of data) is clear, and generated outputs are (internally at least) marked as such.

Business software, and software in general, both currently operate atop structured data. Your ERP, accounting software, CRM, CMS, knowledge-base and communications software all use them. The tools you use for unstructured note-taking rely on structured databases and schemas to store that information under the hood. Literally all consumer and business software is built atop information structures and structured data. That's just how software works. And absolutely all of these apps and services are today connected by a smörgåsbord of highly structured protocols that underpin operating systems as we know them, and the very internet itself.

Explicitly created and defined structures will continue to have relevance in a world of powerful foundation models which are capable of inferring structure on demand.

Pre-structured data will:

enable applications to interface with each other without requiring foundation models to perform a 'reconciliation' step, transforming one's outputs to another's inputs
provide safety and confidence through verification and validation: even where reconciliation may be conducted by foundation models, having example structured data will enable model-produced translations to be tested
provide a cost benefit: data does not need to be structured at the point of need, it already exists
provide a time benefit: data is instantly available, again not requiring computation on the fly
act as an always-available fallback in the event of foundation model unavailibilty or disconnection

These features of pre-structured data, combined with foundation models, will unlock and mainstream new technologies and best practices.

Increasing value

With thanks to both foundation models and pre-structured data, we may finally see certain frontiers unlocked. One thing we're betting on at HASH is that simulation will finally come of age, as a widely-used tool for everyday decisionmaking.

HASH began life building agent-based modeling software. Back in 2019, we encountered two main problems in convincing folks to adopt predictive, "generative" AI and simulations in their businesses...

The shape of data

The first big blocker we encountered was the shape of folks' existing data. Of all the folks we engaged with, generally speaking none had data easily available in nice, neat agent-mappable form. The closest we encountered were enteprises who'd deployed software like Palantir's and developed "dynamic ontologies", but even these rarely represented their entire organization's set of information in a semantically meaningful way. In the last couple of years, the term "semantic layer" has become vogue term in some corners of Silicon Valley, and is used to refer to the ideal of a business whose data is enitrely neatly mapped to real-world "entities". This remains an ambition for many, as opposed to a realized state of affairs, and has taken a bit of a backburner to more hypey AI trends of late. Nevertheless, we now believe that foundation models will help organizations quickly 'type' their data and convert what they have to 'entities' suitable for use in many contexts, including simulation.

The cost of coordination

The second major blocker to mainstream development and adoption of simulation is one of coordination. Simulations have historically required handoff between:

data engineers and analysts who make the data available which is required to instantiate 'agents' in a simulation, and hydrate them with 'properties' that describe them accurately;
programmers who're able to encode subject matter expertise in code that provides 'behavioral logic' to the agents, and define the environment within which they reside;
domain experts, who understand how the target system functions, and can actually provide the subject matter expertise and model design required;
data scientists, who run the actual experiments required to derive real insight from probabilistic models; and
decision makers, who are understandably deeply skeptical of whiz-bang simulation models that promise to predict the future. It's "shit in, shit out" with these things, and with so many steps and people involved, by the time a model has been built and experiments run it's entirely fair (and a good idea) to have questions about model integrity.

The number of different stakeholders, and amount of multi-party communication involved in a single iteration loop required to make simulation models accurate and useful often proved killer to getting simulations off the ground. The average model has historically required many such iterations.

Now, with the picture around data structuring and availability much improved by foundation models (as outlined above), we believe such models' further potential lies in enabling domain experts to bypass programmers and iterate on simulation logic directly. Behaviors within simulations are generally relatively simple, standalone things. Natural language behavior development seems eminently achievable using current-generation models alone (e.g. GPT-4 or fine-tuned GPT-3.5).

Structure and simulation

The role of structured data in all of this is quite simple.

Predictive simulations of real-world things are extremely easy to get wrong. Many interesting systems that users typically want to simulate end up being "complex systems" in some respect, demonstrating some degree of nonlinearity due to emergent or adaptive behaviors. This is both because (a) any system involving a human is inherently complex, and because (b) "closed systems" (the only ones we can fully model without doubt, where all parameters and environmental variables are known) quite simply do not exist in the real world. Simulations are simplifications of the real world: downsized, abstracted, or analogical.

Complex systems are typically extremely sensitive to their initial conditions. Ensuring an "as close to accurate" initial simulation starting point is therefore key. Pre-structured data, especially if tested and reviewed, or produced in accordance with a well-understood process or pipeline, can improve confidence in this. The more volatile a simulation target is, the greater the degree to which realtime or timely data availability matters in turn.

Structured data can be used for verification and validation, in addition to improving the accuracy of a simulation's initial state. Structured data can be used to backtest simulation experiments, as well as to test the accuracy and correctness of individual behaviors themselves. In a world where behaviors and whole simulation models may be generated by foundation models, having a concrete picture of reality to compare to provides humans with an improved sense of a model's likely correctness.

More for humans than machines

All this serves to highlight an interesting possibility. If to date structured data has existed for the benefit of machines, is it possible that in the future its primary use will in fact be to us, humans, in verifying and validating the outputs of generative AI models: simulation, language, multimodal, or other?

If it is, user experiences for capturing structured data (and interpreting it in the most helpful new ways) will be critical, and at the moment they leave a lot to be desired.

We believe blocks of the sort found in many "all-in-one workspaces" are a sensible unit for capturing and working with structured data at a micro scale, but by themselves they're not enough. A complementary new, entity-centric type system is required. We're working on the Block Protocol, a fully open-source standard for these things, alongside our product work at HASH.

Structured data may be of a certain 'type' if it follows a set of rules established by that type. We think a globally composable, multitenant type system, in which any user can publish a type and any other user can consume it, with types being modularly referenceable and reusable within one another requires a ton of things, which we've been building out. But at a high-level, it contains three kinds of types:

Entity Types

Entities (semantic 'things' in a database) have entity types.

Entity types describe the properties and type(s) of any entity, including both special-cased link entities and ordinary, plain old standard entities.

A link entity corresponds to the relationship, or 'thing' that connects other entities. For example, two standard Person entities may be connected by a Married To link entity.

A standard entity is absolutely anything that isn't a link entity. You can use it to refer to physical things, concepts, ideas, events, or anything else you like.

If you're familiar with graph theory, you can think of standard entities as vertices, and link entities as edges.

Property Types

Properties contain data associated with some particular aspect of an entity.

Property types describe the possible values that can be associated with a property, by way of reference to data types and other property types.

Data Types

Data types describe a space of possible valid values (e.g. string, number or boolean). They allow for validation logic to be shared amongst different kinds of property types which represent semantically different things.

At the moment, data types are fixed and defined in the specification, but we're working towards unlocking user-defined custom data types (e.g. numbers within a certain range, or strings matching a certain pattern) beyond those already found in the Block Protocol's graph module specification. Read the draft RFC to learn more.

What else?

Entity types (including those that describe both standard entities and link entities) can inherit from one another, resulting in a composable type system. The HASH type editor, available as a standalone open-source library as well as embedded within HASH makes it easy to create and edit these types, allowing them to be used internally or made publicly available for others to standardize on, extend themselves, and so on. Even where entity type definitions differ, common use of property types (which retain common semantic meaning across their use with different kinds of entities) and data types (which allow for the validation of inputs) aide interoperability and crosswalking.

Defining types in the first place, and the act of typing data, minimizes the potential for errors, improves the readability or intelligibility of data, and can enable the optimized processing of information. Information's entity or property type can be used to ascertain context and meaning, while data types are able to confirm its basic plausability (if not actual correctness).

Building a simulation or other model from a set of strongly typed data equally improves reliability.

In all agent-to-agent communications (human-human, machine-machine, and human-machine), ontologies of types enable improved mutual understanding by formalizing and standardizing definitions of things and streamlining communication around them.

Asking for model outputs to be provided in accordance with sets of user-created types helps keep generative models focused, and improves the usability of their outputs. The fact that OpenAI raced to release function calling before making fine-tuning generally available with its GPT-3.5 and GPT-4 models was indicative of the demand for structured outputs from generative models, and the difficulty current-generation foundation models have (absent fine-tuning) to reliably adhere to even explicit system prompts and user instructions.

In a world where it is easier than ever to create new blocks and simulations with natural language code generators, the potential for a proliferation of standalone software that is difficult to rely on or interconnect in predictable and verifiable ways becomes huge.

We believe that the Block Protocol type system will help avert a future in which low-quality, siloed user interfaces operating on hard-to-translate data becomes commonplace.

The world needs trustworthy software, and a better way to capture and use structured data is more important now than ever.

Join our community of HASH developers

Browse open issues

Star us on GitHub Get in touch

The long-term value of structured data

Relevancy in a new world

Decreasing value

Enduring value

Increasing value

The shape of data

The cost of coordination

Structure and simulation

More for humans than machines

Entity Types

Property Types

Data Types

What else?

Join our community of HASH developers

Resources

Projects

Get Involved

The long-term value of structured data

Relevancy in a new world

Decreasing value

Enduring value

Increasing value

The shape of data

The cost of coordination

Structure and simulation

More for humans than machines

Entity Types

Property Types

Data Types

What else?

Get new posts in your inbox

Join our community of HASH developers