Correctly conceptualizing and handling vectorization in knowledge graphs
January 26th, 2026
Embeddings are now a default building block in modern data services, powering semantic search, retrieval-augmented generative AI (RAG), clustering, deduplication, recommendations, anomaly detection, and more.
In fact, if you’re building AI-native products, you’re almost certainly storing vectors somewhere... but the how and where vary wildly.
The moment embeddings sit alongside a knowledge graph, a deceptively simple question becomes operationally important: are embeddings part of an entity, or are they metadata about an entity?
This arose internally at HASH in the way such questions often do: a few engineers, a few competing intuitions, and some spirited debate. We've since converged on a framework that’s less about word choice and more about building systems that remain reliable under model changes, scale, and security constraints.
The resulting best practice is straightforward: Embeddings are encodings of entities (derived representations of them). Embeddings therefore require content-grade access control, with the added consideration of metadata-grade lifecycle management. These principles resolve most downstream design debates.
A common argument goes: embeddings are computed from entity content, therefore they’re metadata.
However, that’s not quite right. Embeddings represent the content. Not in a way that may be intelligible to you or I, but in a way that tells us far more than the created_at date of a million entities of different types.
Plenty of things are derived from “original data” or “content”, yet are simply different encodings of it:
These are all different representations of information — often lossy, often dependent on an algorithm/codec/model, but fundamentally still encodings of the same underlying content.
Embeddings fit better in this bucket than as descriptive “metadata”, at least as most people (including developers) intuitively use the term (for example, to refer to file authors, timestamps, tags, etc.)
So why think about embeddings as “metadata” at all? The value isn’t, of course, in the label, but in the required systems discipline that typically comes with handling metadata: provenance, versioning, refresh policies, and explicit separation from canonical truth.
In knowledge graphs the term “entity” usually means more than a blob of text. It’s a stable identifier plus a set of claims (attributes + relationships), ideally with provenance. In HASH, an entity is even more than this.
A durable way to structure this world consists of at least three layers:
The graph’s explicit record of what is asserted:
Company: HASHhasWebsite(hash.dev)employs(Person: …)This is the layer you can audit, reason over, and reconcile.
Alternate encodings used for consumption or computation:
These are often lossy, often recomputable, and usually dependent on a specific model/codec. They are valuable, but they are not the canonical record.
The lifecycle and provenance data that keeps everything sane:
This is “data about the data”—and it’s what prevents the system from devolving into unexplainable artifacts.
Because embeddings are computed representations of canonical claims they live in layer 2, instead of layer 3 (alongside the metadata about claims).
In HASH, an embedding is a machine-oriented encoding of an entity, or some projection of entity attributes, into a vector space optimized for similarity operations.
Two properties matter:
For either one, never mind both of these reasons, embeddings should not be treated as canonical truth about an entity. Embeddings provide powerful computational indices, not a ground-truth semantic substrate.
For operational scalability, it's vital to ensure that embeddings are derived artifacts you can regenerate, and those generated by different models/occupying incompatible vector spaces are not mixed.
In practice, this looks like storing embeddings with enough context to make them explainable and safe.
Embeddings ought to be versioned, able to be migrated without mutating entities, and their origin(s) should be tracked (in other words, the model used to compute them is known and remains accessible).
Needs will vary between systems, and for many designs the following will be overkill, but appropriate base provenance for embeddings may include:
If those fields aren’t present, or can't be looked up, it can become difficult to answer questions that matter in production:
This is what we mean by “metadata-grade lifecycle management”: embeddings are treated as first-class derived representations with lineage, not as mysterious numbers floating in a vector database.
One of the most obvious conclusions that stems from using embeddings is the need to treat them with the same level of security and protection as entities in their ordinary form.
If someone shouldn’t have access to an entity’s underlying data, they should not have access to embeddings (or any encodings) derived from it.
Embeddings carry content signal. In fact, that's the entire reason vector-driven semantic search works.
Even though embeddings aren’t human-readable, they can still leak membership (whether a record was part of an indexed corpus), attribute inference (whether something resembles a sensitive category), clustering and correlation information (which entities are “near” each other)... and may at least pose proximate reconstruction risks in certain threat models (e.g. vec2text enabling original meaning to be roughly reverse-engineered, albeit without certainty).
The correct approach is therefore not to treat “vector store access” as a separate, looser security domain, but to gate embedding read access behind the same ACLs as the underlying entity attributes; and if permissions are field-level, bind embeddings to the specific projection of fields embedded.
In practice, this often means storing multiple embeddings per entity. A description embedding might power semantic search, while a name-based embedding serves deduplication. Each projection carries its own permission boundary. HASH supports multiple embeddings per entity for this reason.
This is easier said than done. Many architectures separate their vector store (Pinecone, Weaviate, Qdrant, etc.) from their primary datastore, creating a sync problem: when entity permissions change, vector access must update accordingly. When embeddings are regenerated, old vectors must be invalidated atomically. When field-level permissions differ, you may need separate vector collections per permission boundary.
HASH sidesteps this by treating embeddings as first-class graph artifacts subject to the same query-time access control as entities themselves — no separate vector ACL layer to keep in sync.
Meanwhile many RAG solutions marketed as commercially-ready today do not handle this, leading to potential data leakage and confidential information exposure.
Even classic metadata can be sensitive depending on context (e.g. relationships, timestamps, communication graphs, and associations). Risk is measured by the ability to infer things that weren't intended to be disclosed, rather than any label. But universally we can say embeddings are high inference potential, and should therefore be protected with the same care as underlying data directly.
In HASH, and in general within graph-backed AI systems, we recommend letting embeddings propose and graphs decide.
This separation prevents “soft similarity” from silently being mistaken for “hard truth”.
Embeddings are excellent at generating candidates:
But turning candidates into durable relationships or claims should flow through graph-native checks:
HASH provides all of these things natively.
In HASH deployments, embeddings are treated as:
Hosted environments may choose to keep only one active embedding set at a time for cost reasons, but the architecture assumes embeddings are replaceable artifacts and avoids baking a single vector space into the identity of an entity.
Teams get stuck when they try to force embeddings into a binary: “content” or “metadata”.
The operationally correct framing is more nuanced, and more useful:
Embeddings are representations (encodings) of entity data. They are not canonical claims in the knowledge graph, but they combine a need for content-grade security (same permissions as underlying data) with metadata-grade lifecycle management (provenance, versioning, refresh). Their best role is to propose candidates, with the graph providing the auditable backbone for what becomes truth.
If you build with embeddings like this, model migrations stop being existential, retrieval becomes explainable, and security doesn’t depend on hoping vectors are “just metadata.”
With thanks to hashist Tim Diekmann and community contributor Bilal Mahmoud for helping develop the key insights within this post.
Get notified when new long-reads and articles go live. Follow along as we dive deep into new tech, and share our experiences. No sales stuff.