Dev Blog/Our Approach, Technologies

Embeddings as Encodings

Correctly conceptualizing and handling vectorization in knowledge graphs

January 26th, 2026

Dei VilkinsonsCEO & Founder, HASH

Background

Embeddings are now a default building block in modern data services, powering semantic search, retrieval-augmented generative AI (RAG), clustering, deduplication, recommendations, anomaly detection, and more.

In fact, if you’re building AI-native products, you’re almost certainly storing vectors somewhere... but the how and where vary wildly.

The moment embeddings sit alongside a knowledge graph, a deceptively simple question becomes operationally important: are embeddings part of an entity, or are they metadata about an entity?

This arose internally at HASH in the way such questions often do: a few engineers, a few competing intuitions, and some spirited debate. We've since converged on a framework that’s less about word choice and more about building systems that remain reliable under model changes, scale, and security constraints.

The resulting best practice is straightforward: Embeddings are encodings of entities (derived representations of them). Embeddings therefore require content-grade access control, with the added consideration of metadata-grade lifecycle management. These principles resolve most downstream design debates.

“Derived” doesn’t automatically mean “metadata”

A common argument goes: embeddings are computed from entity content, therefore they’re metadata.

However, that’s not quite right. Embeddings represent the content. Not in a way that may be intelligible to you or I, but in a way that tells us far more than the created_at date of a million entities of different types.

Plenty of things are derived from “original data” or “content”, yet are simply different encodings of it:

  • a JPEG derived from RAW
  • an MP3 derived from WAV
  • a thumbnail derived from a full-resolution image
  • a translated document derived from an original

These are all different representations of information — often lossy, often dependent on an algorithm/codec/model, but fundamentally still encodings of the same underlying content.

Embeddings fit better in this bucket than as descriptive “metadata”, at least as most people (including developers) intuitively use the term (for example, to refer to file authors, timestamps, tags, etc.)

So why think about embeddings as “metadata” at all? The value isn’t, of course, in the label, but in the required systems discipline that typically comes with handling metadata: provenance, versioning, refresh policies, and explicit separation from canonical truth.

A practical taxonomy for graph/AI systems

In knowledge graphs the term “entity” usually means more than a blob of text. It’s a stable identifier plus a set of claims (attributes + relationships), ideally with provenance. In HASH, an entity is even more than this.

A durable way to structure this world consists of at least three layers:

1. Canonical claims

The graph’s explicit record of what is asserted:

  • Company: HASH
  • hasWebsite(hash.dev)
  • employs(Person: …)
  • provenance: where the claim came from, when, confidence, etc.

This is the layer you can audit, reason over, and reconcile.

2. Representations of those claims

Alternate encodings used for consumption or computation:

  • rendered HTML
  • translations
  • compressed media formats
  • thumbnails
  • ...and of course embeddings.

These are often lossy, often recomputable, and usually dependent on a specific model/codec. They are valuable, but they are not the canonical record.

3. Metadata about claims and representations

The lifecycle and provenance data that keeps everything sane:

  • model name/version used to compute a representation
  • which fields were embedded
  • when it was generated
  • input fingerprint
  • retention policy, ACL references, lineage

This is “data about the data”—and it’s what prevents the system from devolving into unexplainable artifacts.

Because embeddings are computed representations of canonical claims they live in layer 2, instead of layer 3 (alongside the metadata about claims).

What embeddings are (and are not)

In HASH, an embedding is a machine-oriented encoding of an entity, or some projection of entity attributes, into a vector space optimized for similarity operations.

Two properties matter:

  1. Model-relative and purpose-relative: different embedding models disagree about what "similarity" should mean—because they optimize different objectives and weight different signals. Even within one model, embeddings for retrieval vs classification vs clustering can behave differently depending on prompt templates, truncation, and field selection. Different HASH users have variable use-cases, disparate needs, and separate models may serve specialized functions for some categories of users.
  2. Not human-legible: unlike translations (which may be understood by bilingual individuals), embeddings are not inspectable in a semantic way by human-beings. That doesn’t make them “outside” the content; it makes them opaque representations whose semantics are defined by the producing model and the downstream similarity metric.

For either one, never mind both of these reasons, embeddings should not be treated as canonical truth about an entity. Embeddings provide powerful computational indices, not a ground-truth semantic substrate.

Treat embeddings like materialized views over entities

For operational scalability, it's vital to ensure that embeddings are derived artifacts you can regenerate, and those generated by different models/occupying incompatible vector spaces are not mixed.

In practice, this looks like storing embeddings with enough context to make them explainable and safe.

Embeddings ought to be versioned, able to be migrated without mutating entities, and their origin(s) should be tracked (in other words, the model used to compute them is known and remains accessible).

Provenance to store alongside embeddings

Needs will vary between systems, and for many designs the following will be overkill, but appropriate base provenance for embeddings may include:

  • Entity/Attribute IDs: what an embedding represents (potentially an array of multiple entities/specific attributes, with version identifiers if appropriate)
  • Input fingerprints: a hash of the exact input bytes used, to ensure cross-linked entity/attribute IDs are faithful
  • Model identifiers and versions: what embedding model(s) produced the vectors
  • Usage documentation: any information required to interpret the vector mechanically, and its intended/suited purpose (retrieval, deduplication, clustering within a domain, etc.)
  • Timestamp: when the vector was generated.

If those fields aren’t present, or can't be looked up, it can become difficult to answer questions that matter in production:

  • “Why did search results change after the model upgrade?”
  • “Which entities haven’t been re-embedded since last week?”
  • “Are we comparing vectors from different models by accident?”
  • “Can we A/B test embedding model migrations safely?”

This is what we mean by “metadata-grade lifecycle management”: embeddings are treated as first-class derived representations with lineage, not as mysterious numbers floating in a vector database.

Protect embeddings like content

One of the most obvious conclusions that stems from using embeddings is the need to treat them with the same level of security and protection as entities in their ordinary form.

If someone shouldn’t have access to an entity’s underlying data, they should not have access to embeddings (or any encodings) derived from it.

Embeddings carry content signal. In fact, that's the entire reason vector-driven semantic search works.

Even though embeddings aren’t human-readable, they can still leak membership (whether a record was part of an indexed corpus), attribute inference (whether something resembles a sensitive category), clustering and correlation information (which entities are “near” each other)... and may at least pose proximate reconstruction risks in certain threat models (e.g. vec2text enabling original meaning to be roughly reverse-engineered, albeit without certainty).

The correct approach is therefore not to treat “vector store access” as a separate, looser security domain, but to gate embedding read access behind the same ACLs as the underlying entity attributes; and if permissions are field-level, bind embeddings to the specific projection of fields embedded.

In practice, this often means storing multiple embeddings per entity. A description embedding might power semantic search, while a name-based embedding serves deduplication. Each projection carries its own permission boundary. HASH supports multiple embeddings per entity for this reason.

This is easier said than done. Many architectures separate their vector store (Pinecone, Weaviate, Qdrant, etc.) from their primary datastore, creating a sync problem: when entity permissions change, vector access must update accordingly. When embeddings are regenerated, old vectors must be invalidated atomically. When field-level permissions differ, you may need separate vector collections per permission boundary.

HASH sidesteps this by treating embeddings as first-class graph artifacts subject to the same query-time access control as entities themselves — no separate vector ACL layer to keep in sync.

Meanwhile many RAG solutions marketed as commercially-ready today do not handle this, leading to potential data leakage and confidential information exposure.

Even classic metadata can be sensitive depending on context (e.g. relationships, timestamps, communication graphs, and associations). Risk is measured by the ability to infer things that weren't intended to be disclosed, rather than any label. But universally we can say embeddings are high inference potential, and should therefore be protected with the same care as underlying data directly.

Employing embeddings within graphs

In HASH, and in general within graph-backed AI systems, we recommend letting embeddings propose and graphs decide.

This separation prevents “soft similarity” from silently being mistaken for “hard truth”.

Embeddings are excellent at generating candidates:

  • “these two entities might be duplicates”
  • “this document might relate to this project”
  • “this chunk might answer this query”

But turning candidates into durable relationships or claims should flow through graph-native checks:

  • schema constraints (such as SemType)
  • provenance and source reputation/trust
  • disambiguation logic
  • human review (where appropriate)

HASH provides all of these things natively.

How we apply this at HASH

In HASH deployments, embeddings are treated as:

  • derived representations attached to explicit entity/attribute projections
  • versioned and recomputable
  • accompanied by the provenance needed to support migration, comparison, and debugging
  • protected by the same access control rules as the underlying graph data

Hosted environments may choose to keep only one active embedding set at a time for cost reasons, but the architecture assumes embeddings are replaceable artifacts and avoids baking a single vector space into the identity of an entity.

The takeaway

Teams get stuck when they try to force embeddings into a binary: “content” or “metadata”.

The operationally correct framing is more nuanced, and more useful:

Embeddings are representations (encodings) of entity data. They are not canonical claims in the knowledge graph, but they combine a need for content-grade security (same permissions as underlying data) with metadata-grade lifecycle management (provenance, versioning, refresh). Their best role is to propose candidates, with the graph providing the auditable backbone for what becomes truth.

If you build with embeddings like this, model migrations stop being existential, retrieval becomes explainable, and security doesn’t depend on hoping vectors are “just metadata.”

With thanks to hashist Tim Diekmann and community contributor Bilal Mahmoud for helping develop the key insights within this post.

Get new posts in your inbox

Get notified when new long-reads and articles go live. Follow along as we dive deep into new tech, and share our experiences. No sales stuff.

Join our community of HASH developers

Browse open issues
Star us on GitHubGet in touch