How not to build a Process Foundation Model

Why classic AI model architectures fail to handle the unique properties of process graphs

May 27th, 2026

Dei VilkinsonsCEO & Founder, HASH

This is a post about how not to build a process foundation model — because almost every architecture you might first reach for to build one turns out to be the wrong primary tool for the job. The reasons are surprisingly specific, and they shape what we think the right architecture actually looks like.

A process foundation model (PFM) reasons effectively about real-world processes — workflows, cases, supply chains, regulations, lab protocols — across organizations and domains, including those which are new to the model, upon which it has not specificially been trained. Or at least it would, had anybody succeeded in building a PFM. To date, nobody has.

The missing piece, we argue, is a representation-learning objective designed for partially-observed, typed, temporal graph state. Our solution to this problem is Dynamic Graph-JEPA ("DG-JEPA"), which functionally extends Graph-JEPA, from HASH researcher Geri Skenderi. The "maximal" version of the PFM hybrid architecture we plan on exploring incorporates a mix of transformers, SSMs, GNNs/graph transformers, Petri nets, and RL — each with their own specific job — all complementing DG-JEPA (which sits at the core of our foundation model's architecture).

What is a PFM?

A process foundation model (PFM) is an AI model capable of reasoning about processes across a wide spectrum of domains, including — crucially — in a zero-shot fashion in new or novel environments that it has not specifically been trained on before.

The PFM we’re developing at HASH works by modeling process state: what objects exist, what obligations are open, which actions are enabled, which resources are constrained, which rules apply, which hidden decision variables likely explain a branch, and what futures are feasible.

This is in contrast to traditional symbolic and diagrammatic process representation, which seeks to “mine” the shape of processes (representing them in BPMN form, or sometimes as Petri nets) from raw event logs. However, event logs themselves do not explicitly contain this state — only observations from which state may be inferred — requiring a neural component on top to extract the true “latent” process state.

Our research aims to show that DG-JEPA is capable of unlocking this latent process state for a wide range of tasks including:

  • Predictive monitoring: forecasting next activities, remaining time, likely outcomes, and SLA breach risks.
  • Process discovery: inferring control-flow, object lifecycles, resource dependencies, and variants.
  • Conformance and anomaly detection: detecting deviations from policies, expected behavior, or discovered norms.
  • Counterfactual reasoning: estimating what changes if an approval rule, staffing level, routing policy, or system integration changes.
  • Simulation and planning: generating feasible future continuations or interventions, not just plausible-looking traces.

In addition, we believe DG-JEPA will effectively support transference — allowing generalization from one organization’s “invoice approval” to another’s “AP authorization” — even when labels, timing, and event granularity differ.

Why are PFMs important?

Process problems — optimization, automation, resiliency, compliance — have huge economic value, both to individual organizations and at an aggregate societal level (enabling more efficient use of resources, and more robust systems).

But at a time of rapid change such as this — with the impacts of existing AI models already rippling through the economy and “ceteris paribus” assumptions no longer holding true — enabling this kind of process adaptability is not only valuable to businesses, but to the full array of society’s institutions. Unless regulators, governments, and individuals themselves can keep up, they risk being left behind, and at the very least exposed to large risks that may threaten their wellbeing and security.

Why hasn’t a PFM already been built?

For all the value at stake, three things have made PFMs particularly hard to build until now.

1. The data hasn’t existed in the right shape. Foundation models learn from large bodies of well-structured training data — language for LLMs, images for vision models, code for code models. Process data, by contrast, is fragmented across enterprise systems (ERP, CRM, ITSM, electronic lab notebooks, paper trails), recorded with inconsistent vocabularies, missing the latent state that explains why events occurred, and rarely shared across organizational boundaries. Without a substrate of typed, temporal, provenance-aware graphs that span all of this, no architecture can succeed.

2. The right architecture hasn’t existed in the right shape. Existing AI model architectures — transformers, diffusion, graph neural nets, state space models, and more — each have characteristics that lend surface-level hope to claims that they may “solve” process modeling. However, digging into each, it quickly becomes apparent that none of them, by itself, has the right inductive biases, training objective, or representational substrate for the job.

3. The right symbolic infrastructure hasn’t existed in the right shape. To model processes well, even neural methods need symbolic process libraries (Petri nets, conformance checkers, process simulators) to constrain learning, verify outputs, and provide formal guarantees that a purely-learned model cannot. Without this substrate, learned representations cannot be safely composed with rules, policies, and constraints.

We’ve spent the better part of the last seven years building all three: a typed, multi-temporal knowledge graph layer (hgres); an open standard for the type system on which it depends (SemType); and an open-source symbolic process modeling environment (Petrinaut). The remaining piece is the neural architecture that sits on top — and that’s what this post is about.

We believe that "Dynamic Graph-JEPA" (DG-JEPA) holds the key. It extends Graph-JEPA — a representation-learning architecture introduced in 2023 by HASH researcher Geri Skenderi — to support the temporal, multi-object, partially-observed structure of real-world processes. And we think a hybrid approach that integrates other existing model families with DG-JEPA represents the most promising path to a useful, widely-generalizable PFM, fast.

To explain why, it’s helpful first to walk through why classical approaches are insufficient — and then to show how each one finds a productive role inside a DG-JEPA-anchored hybrid.

What’s wrong with…

In what follows, we walk through the major model families that an engineer or researcher might, as we did, reach for first when sketching out a process foundation model. For each, we explain what it offers, why it falls short on its own, and (where it can) what it might meaningfully contribute as a component of a hybrid PFM.

None of these architectures are “bad”. Each has been engineered for a particular job. But none — by themselves — have the right inductive biases, training objective, or representational substrate to learn reusable, generalizable latent process state at foundation-model scale.

We’ve grouped the thirteen architectures into four families. Use the links below to jump to a specific one:

Sequence-centric models

The most natural first move when modeling time-shaped data is to reach for a sequence model. But processes are not sequences — they are partially-observed, multi-object, branching, concurrent graphs that happen to unfold in time.

Transformers

Transformers in general are powerful and should not be dismissed. They may play various useful roles, for example in encoding text labels, comments, documents, and field names; modeling long-range event histories; doing in-context adaptation over support examples; serving as graph transformers over nodes rather than raw event tokens; supporting natural-language interfaces to the process model.

However, as a backbone for a process “foundation” model, transformer-based approaches fail in various ways.

Transformers expect sequences, requiring process graphs to be serialized. However, many process dependencies are non-sequential: two branches may run in parallel; a later event may depend on a resource allocation rather than the immediately previous event; a document node may constrain several activities; an exception handler may be causally important despite being rare.

This makes serialization lossy. For example, when two activities happen in parallel, any sequence model sees two different orderings of the same underlying behavior. Transformers can learn to handle this if given enough examples and clever positional encodings, but they must infer graph relations that graph representations could otherwise encode directly.

This problem worsens with object-centric processes. An event can belong simultaneously to an order, invoice, vendor, shipment, and payment run. A single sequence forces a choice of “case,” whereas an object-centric graph can represent all relevant lifecycles and their interactions.

Transformers transfer poorly across process representations. Transformers trained over serialized logs may entangle semantics with local naming, system-specific event schemas, and organization-specific conventions. Berti & van der Aalst (2025) observe that process settings vary in activity vocabularies, naming conventions, and temporal scales, causing many per-log models to require retraining and revalidation. A graph-latent objective has a better chance of learning the invariant role of a substructure, independent of its local label.

In processes, next-token prediction optimizes the wrong thing. Autoregressive transformers are trained to predict the next token. For text, this objective is surprisingly rich because language encodes huge amounts of world structure. For process logs, next-event prediction can be much shallower.

Transformers reward surface predictability. A model can achieve high next-activity accuracy by learning frequent local transitions — e.g. submitreview, reviewapprove, and approvenotify — while failing to learn deeper invariants of much greater importance: whether a review was actually required by policy; whether a single person performed incompatible duties; whether an approval was skipped, delayed, or delegated; which object state changed; what future actions are actually enabled; and whether an unusual trace is a valid exception or a true violation, to name but a few.

Claimed process “foundation” models to date generally assume stable activity vocabularies and distributions. Any genuinely reusable model requires cross-log generalization and adaptation to local vocabularies and time scales.

Transformers are weakly typed by default. A process token can represent many different things: for example approve, manager, invoice, amount > 10k, timestamp delta = 4 days, policy violation, or resource unavailable. A transformer can consume all of these as tokens or embeddings, but it has no intrinsic distinction between control-flow relation, object-state relation, resource relation, data dependency, compliance constraint, temporal interval, or causal dependency. Type embeddings, special tokens, and structure-aware attention masks can all be added… but at that point we find ourselves already moving toward a graph-aware architecture.

Long context is insufficient for graph-shaped reasoning. Transformer models with large context windows help when relevant evidence is far away in a sequence, but they do not automatically solve graph structure. Dependencies may not be “far” in token distance, but structurally indirect. For example, invoice payment may depend on invoice object state, PO object state, goods receipt object state, vendor master data, approval policy, segregation-of-duties rule, and payment run calendars. These are not simply earlier tokens, but typed relations among entities.

Decoder-only generation can produce invalid processes. Language models may generate traces that look plausible, such as create invoiceapprove invoicepay invoice, but plausibility does not imply validity. A process may also require the right approval threshold, a matching purchase order, a goods receipt, separation between requester and approver, available budget, correct object lifecycle state, no missing join token, no impossible loop, no authorization violation — or any other prerequisite you can imagine.

Autoregressive decoding can be constrained, but then validity is coming from an external grammar, verifier, or process engine — not from the transformer objective itself. At HASH we already use LLM-based assistants to help domain experts develop Petri net-based models of real-world processes, and have invested significant energy in developing systems for accurately eliciting these real-world constraints and requirements from domain expert end-users (including through our new brunch.ai research prototype). However, manual expert modeling is not always feasible — and certainly not at the scale needed for an actual foundation model.

Where transformers still fit inside a PFM: Transformers remain valuable as feature extractors for unstructured inputs — text labels, comments, documents, policies, and free-text fields — and as in-context adapters over support examples drawn from related logs. Their outputs become typed node and edge features on the process graph, where the DG-JEPA objective can then organize them by latent process role rather than surface vocabulary. The transformer does what it is good at (language); DG-JEPA does what it is not (structured, partially-observed process state).

Recurrent and state-space models

Sequence models — such as state space models (SSMs), recurrent neural nets (RNNs), and temporal convolutional networks (TCNs) — are appealing for process modeling because process execution is fundamentally temporal.

Models such as the S4 SSM (Gu, Goel & Ré, 2022) were specifically designed for long-range sequence modeling and show strong results on long-sequence benchmarks, while Mamba (Gu & Dao, 2024) introduces selective state-space mechanisms with linear scaling in sequence length and input-dependent state updates, addressing certain transformer inefficiencies on long sequences.

These models are good candidates for various aspects of our process models: per-object lifecycle encoding; long-running cases; resource-load time series; online monitoring; event streams with millions of events; and continuous-time or irregularly sampled dynamics.

But as the primary architecture, they inherit the core weakness of sequence models: compressing a process into a single stream, with a sole hidden state required to remember which objects exist, which branches are active, which policies apply, and which resources are constrained. This is a severe bottleneck for multi-object, branching, concurrent workflows.

Where SSMs fit inside a PFM: Sequence models are useful below the graph-encoder layer. Per-entity timelines — case histories, object lifecycles, resource workloads, sensor streams — can be summarized by an SSM into a vector that becomes a feature on the corresponding graph node, so DG-JEPA does not have to reconstruct long histories from a flat event prefix. The graph encoder then handles structure across entities, while the SSM handles depth within each entity.

Time-Series Foundation Models

Recent time-series foundation models (TSFMs) compellingly demonstrate that pretraining across diverse time series can yield zero-shot or few-shot forecasting behavior:

  • TimeGPT (Garza, Challu & Mergenthaler-Canseco, 2024) is framed as a foundation model for time-series forecasting;
  • Chronos (Ansari et al, 2024) tokenizes time-series values and trains transformer-family models with cross-entropy;
  • and Lag-Llama (Rasul, Ashok et al, 2024) uses a decoder-only transformer for probabilistic forecasting.

These show promise in addressing the narrow slice of process modeling which relates to duration forecasting, queue-length forecasting, workload prediction, SLA risk, seasonal demand, and resource availability. However, TSFMs do not naturally model control-flow choices, object lifecycles, approval constraints, policy violations, graph-structured dependencies, or multi-entity interactions.

A process foundation model may benefit from including time-series encoders for resource and performance signals. But a pure time-series foundation model would miss the symbolic and relational structure that makes a process a process.

How DG-JEPA might use TSFMs: Pretrained TSFMs are good candidates for the per-entity time-series features fed into DG-JEPA — duration distributions, queue lengths, resource workloads, SLA risk curves, seasonal demand. Their forecasts become numeric features attached to graph nodes (cases, resources, queues), enriching DG-JEPA’s view of the system’s continuous-time dynamics without forcing it to re-learn forecasting from scratch. DG-JEPA, in turn, provides the symbolic and relational backbone that a pure TSFM would otherwise lack.

Graph-centric models

The natural second move is to give up on serialization and use a graph-shaped model. That fixes the structural mismatch — but introduces a new problem: which objective should the graph model be trained on?

Graph Neural Nets

Unlike transformers, at least, Graph Neural Nets (GNNs) have the right substrate… but unfortunately the wrong objective.

Message passing in GNNs is local. A standard GNN updates a node by aggregating information from neighbors. After k layers, a node has information from roughly its k-hop neighborhood. But in real-world processes many important properties are not local. For example, it may be important to know whether a payment was approved by someone other than the requester, whether all of the required upstream objects have reached the correct state, whether something is the first rework loop or a repeated one, whether an activity is enabled under the process model, or whether an earlier policy exception renders a later action valid. This is where GNNs struggle.

Layers can be added, but work on GNN expressivity shows that many popular GNNs are bounded by Weisfeiler-Lehman-style graph distinctions, rendering them unable to distinguish certain simple graph structures (Xu et al, 2019), while deep graph convolutional networks can lose expressive power as representations converge toward limited graph spectral information — one version of the over-smoothing problem (Oono & Suzuki, 2021).

Process graphs are heterogeneous and heterophilic. Many GNNs work best when neighboring nodes are semantically similar. Process graphs, on the other hand, are often heterophilic:

  • userperformsactivity
  • activityupdatesinvoice
  • invoicebelongs tovendor
  • activityconstrained bypolicy
  • policyviolated byevent

Neighbors are not necessarily “similar”, merely in some sense complementary. The model, therefore, has to understand typed relations. A resource node, policy node, event node, and object node may all interact, but their embeddings should not collapse into the same semantic space.

A hetero-GNN or relational GNN can help, but then the pre-training objective becomes crucial. Without a good self-supervised task, the model may just learn shallow edge-type correlations.

GNNs fail to account for temporal evolution. Real-world processes are dynamic. The same object changes state over time. The same resource’s workload changes over time. The same activity can mean different things depending on phase.

Temporal Graph Networks (Rossi et al, 2020) explicitly address dynamic graphs represented as timed events by combining memory modules with graph-based operators. That is much closer to process data than a static GNN. But even a TGN trained only on next-edge or next-event prediction may still learn local dynamics rather than reusable process abstractions.

How DG-JEPA improves on GNNs: A heterogeneous or relational GNN can serve as part of DG-JEPA’s encoder stack — it has the right typed-relation substrate to combine event, object, resource, policy, and Petri-net nodes without collapsing them into a single embedding space. What DG-JEPA adds is the missing self-supervised objective: predict the latent representation of masked subgraphs from context. That objective explicitly demands learning about distant and hidden structure, sidestepping the locality and over-smoothing limits of plain message passing and shifting the model from shallow edge-type correlations toward process semantics.

Graph Transformers

Graph transformers are valuable because ordinary message-passing GNNs can struggle with long-range dependencies, over-squashing, and rigid local neighborhoods. Graph transformers add global attention and structural encodings so distant but relevant nodes can interact.

Graphormer demonstrated that transformers can work well for graph representation learning when graph structure is encoded through centrality, spatial, and edge encodings (Ying et al, 2021). Similarly, GraphGPS combines positional/structural encodings, local message passing, and global attention; also addressing scalability concerns by decoupling local real-edge aggregation from global transformer attention (Rampášek et al, 2023).

This undoubtedly means graph transformers are useful for process graphs. A payment event may need to attend to an invoice, purchase order, goods receipt, vendor risk node, approval policy, approver workload, and Petri marking; and graph transformers are a strong candidate for propagating information across that structure. But by themselves graph transformers have no answer to the question of “what should the representation be trained to mean?”

Without the right training objective, even a plain graph transformer applied to a process graph may learn only shallow correlations: frequent labels, common paths, source-system conventions, local neighborhoods, or link plausibility. It might learn that invoice_receivedapprovalpayment is common, but not that the middle subgraph is an approval gate that consumes a pending-review state, satisfies a policy, and enables payment.

A graph transformer is an architecture, not an objective. By itself it does not distinguish between next-event prediction, masked-node-label reconstruction, missing-edge prediction, outcome classification, future trace generation, Petri-net marking prediction, or hidden-approval-subgraph inference. Each of these objectives produces different latent spaces — and a process foundation model needs a latent space organized around process roles, state, constraints, and feasible futures, not merely graph proximity or next-event probability.

Because processes are often only partially observed, and event logs generally don’t record why a case branched, whether an event was missing, whether a path is an exception or a violation, what policies were considered, or what constraints caused a particular delay, a graph transformer trained only on event logs (i.e. process traces as they exist today) has no special reason to infer those hidden variables. It simply encodes what is present.

Without a solution like DG-JEPA, graph transformers fail as process foundation models in several ways:

  1. They overfit surface vocabulary. If the loss is next-activity prediction or masked-label reconstruction, the model is rewarded for reproducing local event names, not for learning that a subgraph is an approval gate, join, or exception handler.
  2. They learn plausibility rather than validity. A continuation can be statistically common but invalid under a Petri-net marking, policy, object lifecycle, or segregation-of-duties rule. Attention does not automatically encode enabledness, token flow, or conformance.
  3. They lack a missing-middle objective. A process foundation model should infer unobserved structure: missing approval, hidden blocker, late-arriving document, resource queue, or unobserved policy constraint.
  4. Full graph attention can be expensive on large process KGs. Scalable graph transformer work exists, but even GraphGPS frames scalability as requiring careful modular design rather than naïve all-to-all attention. DG-JEPA is compatible with this: it can train on sampled context/target subgraphs, using graph transformers where global attention is useful, without requiring every training step to attend over the entire enterprise graph.

Where graph transformers fit inside a PFM: Graph transformers are DG-JEPA’s encoders. The criticisms above are not of the architecture, but of the objectives that graph transformers are most often trained on. By pairing graph transformers with the JEPA objective — predict the latent representation of masked process subgraphs from context, optionally extended with Petri-marking, enabledness, and conformance targets — we preserve their global attention and structural-encoding power while organizing the latent space around process roles, constraints, and feasible futures rather than next-event probability or link plausibility.

Graph Embeddings

Graph embedding means to map nodes, edges, relations, subgraphs, or whole graphs into vectors so they can be used by downstream ML systems.

Early methods such as DeepWalk (Perozzi, Al-Rfou & Skiena, 2014) and node2vec (Grover & Leskovec, 2016) learn node embeddings by sampling random walks over a graph and treating those walks like sentences; the model learns vectors that preserve graph neighborhood structure. DeepWalk explicitly treats truncated random walks as sentence-like contexts, while node2vec generalizes this with biased random walks that explore different notions of neighborhood.

Other methods, such as LINE (Tang et al, 2015), embed very large information networks by preserving local and global network structure, targeting tasks like visualization, node classification, and link prediction.

The vectors that graph embeddings learn are typically able to preserve graph proximity, link plausibility, or local structural similarity.

Knowledge graph embeddings, meanwhile, focus on “typed triples”: one entity connected in some way to another entity. Models such as TransE, DistMult, ComplEx, and RotatE learn vectors for entities and relations, then score whether a triple is plausible. RotatE, for example, models relations as rotations in complex space and is designed to capture patterns such as symmetry, antisymmetry, inversion, and composition (Sun et al, 2019). ComplEx uses complex-valued embeddings to model symmetric and antisymmetric binary relations (Trouillon et al, 2016).

This kind of approach allows easily identifying similar nodes, predicting single missing nodes and edges, and identifying which process fragments may be near each other. Knowledge graph embeddings are additionally tempting because it is easy to represent process data in the form of triples: “event X performed by user Y”, “product A delivered to warehouse B”, or “activity 1 requires role 2”.

However, traditional knowledge graph embedding methods underrepresent a whole range of things that matter in a process-specific context, including time, execution order, branching, concurrency, quantitative duration, resource capacity, process soundness, object lifecycle state, and counterfactual interventions… to name but a few.

Graph embedding objectives are proximity- or link-oriented, while process understanding is state-, time-, and constraint-oriented. A random-walk embedding learns which nodes co-occur in graph neighborhoods. While that can be useful, graph co-occurrence is not the same as process semantics. Two approval gates in different organizations may be far apart in the graph and have different labels, but play the same latent process role. Conversely, multiple nodes may be close in the graph because they co-occur often, but one may be valid, the second a policy violation, and the third an exception path.

By themselves, graph embeddings struggle to answer important questions a process foundation model requires understanding of, such as: “What hidden process state explains this case?”, “Which transition is enabled?”, “Which object lifecycle phase are we in?”, “Was this behavior valid or merely common?”, “Which future process fragment is feasible?”, and “What intervention would change the outcome?”

Beyond knowing whether a triple is simply plausible, we require the ability to reason about whether an execution is feasible, policy-compliant, and causally coherent.

For us, some of the more interesting work in graph embeddings takes the form of models like GraphSAGE, which depart from the purely transductive learning of vectors seen during training (Hamilton, Ying & Leskovec, 2018) — a poor fit for operational graphs which constantly evolve with new cases, events, documents, resources, object instances, schemas, and systems. Instead, models like GraphSAGE learn an inductive function that generates embeddings for unseen nodes from node features and sampled neighborhoods, rather than learning only per-node lookup vectors. While helpful, from a foundation-model perspective, even an inductive embedding model needs to be trained to preserve the right thing — and if the objective is neighborhood preservation, link prediction, or node classification, the model may learn good graph features but still fail to understand the latent process state.

How DG-JEPA utilizes graph embeddings: Typed, multi-temporal hgres graph structures and their embeddings (representing ontologies/types, knowledge/data, and processes/Petri nets) remain useful for candidate retrieval, nearest-neighbor search, entity resolution, schema alignment, link prediction, source-event normalization, cold-start features, fast approximate matching, process-fragment clustering, and Petri-transition candidate generation. They may also serve as initialization features inside Graph-JEPA, providing node, relation, text, temporal state, and Petri-state embeddings.

In fact, Graph-JEPA produces graph embeddings. The difference is what the embeddings are trained to mean. Ordinary graph embeddings ask whether elements of a graph can be mapped into vector spaces that preserve graph structure, while Graph-JEPA asks whether a context graph embedding can predict the latent representation of hidden process structure. This masked target can be semantically meaningful, and include formal execution semantics, encoding what the context actually implies about hidden executable process structure in our learned vector.

Generative models

A third intuition is to model the joint distribution of process traces with a generative model — diffusion, autoencoder, or flow-based. The trouble is that “what is a plausible process trace?” turns out not to be the right question for a representation we want to reuse downstream.

Diffusion

Diffusion models are excellent at generating high-dimensional samples. But a process foundation model is not just an “image generator for workflows” — it also requires the ability to reason about valid, executable, interpretable, counterfactual, and reusable process-state representations.

Diffusion models learn to denoise, but not necessarily to understand process state. A DDPM-style model gradually corrupts data and learns to reverse that corruption. This has been highly successful for image synthesis and related generative tasks. For graphs, discrete diffusion methods such as DiGress generate categorical node and edge attributes by progressively editing graphs and training a graph transformer to reverse the noising process (Vignac et al, 2023).

This is useful if the task is simply to generate a realistic-looking process graph (which may have utility in some cases). But for our purposes — training a process foundation model — we care more about whether, given partial evidence, we’re able to infer what process state, constraints, bottlenecks, and feasible interventions explain it.

Diffusion fails to account for hard global constraints. A denoising objective may spend capacity modeling surface distributions: common edges, labels, durations, and variants… but it does not necessarily force the model to learn the latent obligations or causal dependencies that matter for process intelligence.

A generated process is not valid merely because it resembles training graphs. It may also need to satisfy other properties: soundness of control flow; no dead transitions; correct AND/XOR join semantics; object lifecycle consistency; authorization rules; resource constraints; temporal constraints; data dependencies; and policy compliance.

Diffusion works through intermediate corrupted states. For images, intermediate states can be noisy but still live in a continuous space. For processes, intermediate states may be nonsensical: half an approval rule, a dangling join, an impossible object transition, or an event that updates an object before it exists.

A graph diffusion model can add constraints or repair steps, but those repairs become a separate symbolic or search problem. Methods like DiGress improve discrete graph generation using marginal-preserving noise and graph-theoretic features, but process validity is not just graph-statistical validity; it is domain, object, policy, and execution validity.

Diffusion models carry impractical sampling costs. Many process-model use cases are “online”: they may score a running case, retrieve similar variants, detect an anomaly, estimate remaining time, recommend a next action, or explain a bottleneck cause. While a JEPA encoder can produce an embedding in a single forward pass, diffusion models typically require iterative denoising — or distillation — to get fast sampling. That is less attractive when the primary need is representation, retrieval, monitoring, or decision support.

Where diffusion might be useful inside a PFM: Diffusion methods may still play a role downstream of DG-JEPA — for example, as conditional generators producing candidate future continuations of a case given its current DG-JEPA latent state, with Petri-net constraints and conformance checks acting as a validity filter. The generative model proposes; the symbolic layer disposes. What we avoid is using diffusion as the primary representation-learning objective, since denoising surface graph structure is not what we want our latent space to mean.

Variational and graph autoencoders

Like JEPA, autoencoders learn latent variables — but they target reconstruction objectives that are imperfectly fit to our process modeling challenge:

Variational autoencoders (VAEs) learn latent variables by optimizing a variational objective and reconstructing data from a latent representation. In other words, they train the latent space to preserve the information needed to reconstruct/generate an observed input x under a likelihood model. This effectively optimizes for reproducing logs as observed, including label quirks, timestamps, source-system artifacts, frequent boilerplate events, and organization-specific namings.

Variational graph autoencoders (VGAEs) extend this idea to graph-structured data, learning latent representations of node embeddings useful for reconstructing graph links, often via a simple link decoder (Kipf & Welling, 2016). While valuable for link prediction tasks, this is a poor default objective for process graphs. An edge in a process graph is not merely a missing citation link or social tie (as it may be in a knowledge graph), but represents critical information such as “event consumes token from place,” “activity updates object state,” “policy constrains transition,” “actor performed step,” “document satisfies precondition,” or “object lifecycle synchronizes with another object.” The relevant question is rarely just “should this edge exist?”, but “what process state made a transition enabled, valid, delayed, exceptional, or impossible?”

Masked graph autoencoders (MGAEs) such as GraphMAE improve graph self-supervised learning by masking and reconstructing node features — better than full reconstruction, but still overindexing on raw feature identity rather than latent process role (Hou et al, 2022). For processes, the raw masked feature is often the wrong supervisory target. Reconstructing APPR_INV_L2, approve_invoice, manager signoff, or VendorPaymentAuthorization teaches the model to recover local vocabulary, but does not necessarily teach the model that all of these may instantiate the same latent role: an approval gate that enables downstream payment under a policy constraint.

GraphMAE2’s authors themselves identify a related limitation: masked feature reconstruction depends on the discriminability of input features and can be vulnerable to feature disturbance — which is one reason they add latent representation prediction as a regularizer (Hou et al, 2023). That limitation is especially severe in process data, where feature identity is often unstable, under-specified, or semantically misleading. A node labeled “review” might mean fraud review, manager approval, clinical triage, quality assurance, compliance audit, or exception handling, depending on the surrounding graph.

How DG-JEPA solves for this: Instead of asking the model to reconstruct the visible attributes of a masked node or edge, Graph-JEPA targets predicting the embedding of a masked subgraph, given the embedding of its context. Explicitly proposed for predicting latent representations of masked subgraphs from context subgraphs, and is proven to be able to learn semantic graph-level representations (Skenderi et al, 2025). This follows the broader JEPA principle as elucidated in prior publications like I-JEPA: predict target representations rather than reconstruct raw observations, with masking designed to force semantic rather than pixel-level prediction (Assran et al, 2023). For a process foundation model, that difference is crucial. The target can be an entire hidden process fragment, not a feature vector, and the model can infer the latent role of the hidden structure from context. For example, given an invoice, amount, requester, goods receipt state, policy threshold, actor history, and current Petri marking, the model should learn to predict the latent representation of “high-value approval gate before payment release.” It should not merely learn to reconstruct the string APPR_INV_L2.

Neural ODEs and normalizing flows

Neural ordinary differential equations (ODEs) and normalizing flows deal with continuous transformations of continuous states. In contrast, a process foundation model must represent: discontinuous, partially observed, graph-structured discrete events; typed objects; branching; joins; concurrency; missing observations; policies; and executable constraints.

Fundamentally, that means ODEs/flows may be useful inside a process model, but are unlikely to be the right foundation-model core.

Neural ODEs are good for smooth continuous dynamics. While normal neural networks have discrete layers, neural ODEs replace discrete layer updates with continuous dynamics. This is attractive when the thing being modeled really evolves smoothly through time: physical systems, biological signals, continuous sensor streams, latent trajectories, or irregularly sampled time series may all be examples of this. In process modeling, neural ODEs could help with things like queue pressure over time, resource load, SLA risk accumulation, continuous patient vitals, machine/asset degradation, and waiting-time dynamics. For example, after an approval request, the “risk of SLA breach” may increase continuously as time passes without a response. But it is not the whole process:

  • Processes typically consist of discrete events. For example, an invoice is received, nothing happens for several days, it is reviewed and either approved or rejected, rework may or may not be required, following which an invoice may be resubmitted and/or sent on for payment. This is not a smooth motion, but a series of discrete jumps between typed states. Process states often change abruptly, or are binary: from “pending approval” to “approved”, from “non-compliant” to “compliant”, from “case open” to “case closed”, and so on.
  • Pure Neural ODEs struggle with branching. From the same apparent state, many futures may be possible (for an invoice we might consider approval, rejection, escalation, timeout, manual override, or withdrawal). ODEs provide deterministic trajectories — unless extended with latent variables or stochastic dynamics — but process futures are often not just noisy versions of one trajectory; rather, they are discrete alternatives caused by policies, actors, missing documents, object states, or hidden decisions. Emerging proposals such as Augmented Neural ODEs explore adding extra dimensions to state, potentially accommodating merging and splitting trajectories (Dupont, Doucet & Teh, 2019).

Normalizing flows are generative models that start with simple distributions (usually Gaussian noise), to which a sequence of invertible transformations are applied, resulting in an increasingly complex distribution (Rezende & Mohamed, 2016). Because these transformations are invertible, if the model maps z to x, it can also map x back to z. That invertibility lets the model compute exact likelihoods using the change-of-variables formula, supporting efficient and exact sampling and density evaluation (Kobyzev, Prince & Brubaker, 2021). For process modeling, normalizing flows could help with narrow subproblems like duration, waiting-time and resource-load distributions; continuous risk variables; and likelihood-based numeric anomaly detection. Like neural ODEs, though, normalizing flows are a poor fit for modeling processes overall:

  • Processes are often not naturally invertible. A process often deliberately loses information. Many different histories can lead to the same process state, and if a model is required to be invertible, it must preserve enough information to reconstruct which path occurred. For many downstream process tasks, that information may be irrelevant — process abstractions will typically collapse histories into the same latent role (e.g. “payment is now validly enabled”) rather than preserving every incidental route by which that became true.
  • One current state may have many possible futures. A flow maps one latent point to one data point through a bijection. It can represent a distribution, but the structural semantics of branching are not first-class. In processes, branches are not just distributional modes; they are often governed by typed conditions: for example, amount > threshold, document missing, risk flag present, actor lacks authorization, resource overloaded, or policy exception granted. A process foundation model needs to know why a branch is possible, not merely assign probability mass to an array of outcomes.
  • Process graphs are variable-sized and heterogeneous. Flows, on the other hand, are most natural for fixed-dimensional continuous vectors. While a process graph can be embedded into a fixed vector and a flow run over it, the flow is then modeling the embedding distribution as opposed to the process structure itself — with graph semantics already compressed away by another model.
  • Finally, common ≠ valid. For something like deviation and anomaly detection, exact likelihoods sound attractive. But “low probability = bad” is not necessarily a safe assumption. A rare event can be valid and important: emergency override, safety escalation, or fraud-prevention hold. Meanwhile, frequent events like missing documentation or missing authorization may be invalid. Likelihood-based models risk learning only what is common, as opposed to valid, without a natural ability to encode the rules, markings, enabledness, object states, and policies that determine validity.

How DG-JEPA might be leveraged: Flows or neural ODEs may play a useful role inside the model: for example, modeling continuous waiting-time distributions conditioned on a Graph-JEPA process state. For example: P(waiting time until approval | current process state)

Other paradigms

A final set of approaches sits outside the families above — using tabular learning, contrastive objectives, classical symbolic mining, or reinforcement learning. Each has a productive role in a PFM; none is the PFM itself.

Tabular Foundation Models

Tabular foundation models (TFMs) may be useful for local decision points or structured attributes, but they are not a sufficient substrate for a process foundation model. Berti & van der Aalst (2025) contrast event logs with plain tables: event logs are ordered event sequences with timestamps and attributes, and process models need to respect ordering, timing, and local activity vocabularies.

Many process-mining baselines used for narrow prediction flatten prefixes into rows — for example, Case ID, Last Activity, Elapsed Time, Amount, Resource, and Next Activity — but this flattening destroys sequence, graph, object, and constraint structure: at odds with our aim to develop a foundation model.

Where TFMs fit inside a PFM: TFMs are a reasonable choice for encoding structured attributes attached to graph nodes — per-case features, per-vendor features, per-document metadata, normalized prefix summaries. Their representations become node-feature inputs to DG-JEPA. The crucial distinction is that the tabular foundation model handles each row’s features, while DG-JEPA handles the process state that connects them.

Contrastive Learning

Contrastive learning trains an encoder by pulling together representations of “positive” pairs and pushing apart “negative” pairs. It learns mainly by defining what should be considered the same and what should be considered different.

Contrastive graph learning can learn useful representations by making augmented views of the same graph close, and different graphs far apart. GraphCL, for example, studies graph augmentations for unsupervised graph representation learning and reports strong transfer and robustness (You et al, 2021). For some kinds of data, including certain types of graphs, this approach may be tenable.

In images, where cropping, color jitter, or blur often preserves object identity, augmentation-enabled contrastive learning may facilitate high-performance with relatively little labeled data (Chen et al, 2020).

But these kinds of augmentations, in the context of processes, carry dangers. In a process graph, dropping an edge or node can change meaning completely. Removal of a resource node may make a segregation-of-duties check impossible. Dropping an approval edge may render a process inescapably non-compliant. And eliminating a join edge may result in a process becoming unsound.

Where invariances are obvious, contrastive learning is powerful. Where invariances are the problem, so is contrastive learning — requiring domain-specific augmentations, with the risk that bad augmentations not only introduce training noise, but may also teach incorrect invariances altogether. Processes, in the face of augmentation, are essentially fragile compared to images.

Why DG-JEPA is preferable: A Graph-JEPA-based approach does not require inventing positive and negative augmented pairs. It asks the model to predict masked latent targets from context, more directly aligned with process understanding: “given what I can see, what hidden process structure should be there?”

Classical Process Mining

To be clear, as with other architectures explored in this post, we think symbolic process mining is a valuable tool. That is — in fact — why we’ve spent a great deal of time working on Petrinaut, our open-source process modeling IDE and simulation engine, as well as integrations in HASH, the HASH browser extension, and the HASH desktop daemon, which all serve to collect entity and event information from which processes can be mined.

Symbolic process mining has something neural methods often lack: explicit semantics. Formalisms like Petri nets (the basis of Petrinaut) can represent soundness, reachability, alignments, deviations, and compliance — allowing us to formally verify various properties of processes represented as “nets” without having to actually run them (for example, ensuring that they are free from livelocks or deadlocks, or guaranteeing liveness). Conformance checking allows us to directly compare event logs to our nets, enabling easy detection of drift between observed and modeled behaviors.

However, symbolic representations of processes are brittle. Real logs are noisy, incomplete, sometimes too coarse, other times overly granular, and full of chaotic activities. Work on hybrid process models argues that discovered process models can be formal or informal, and proposes mixing formal and informal elements when evidence is incomplete or standard constructs do not fit (van der Aalst et al, 2017). Other process-discovery work shows that chaotic activities can heavily impact discovered model quality, and that simple frequency-based filtering does not solve the problem (Tax, Sidorova & van der Aalst, 2018).

So symbolic process mining is probably not enough for a foundation model. But it is extremely valuable as a companion to other methods. Some readers will recognize this as a flavor of the “neuro-symbolic” tradition; we tend to avoid that label, in favor of being specific about which symbolic layer (Petri nets, conformance, soundness) and which learned layer (JEPA-style latent prediction on typed temporal graphs) we mean.

How DG-JEPA complements symbolic process mining: Graph-JEPA learns representations, infers missing state, generalizes across messy logs, while the symbolic layer enforces constraints, checks soundness, explains violations, and validates generated variants. Furthermore, because Graph-JEPA learns to predict embeddings of masked subgraphs, it may effectively be made Petri-aware. The target encoder can embed not only raw event nodes, but also symbolic process semantics: pre-marking, transition fired, tokens consumed, tokens produced, enabledness, alignment quality, and conformance state. The context encoder must then predict this latent formal structure from partial evidence. That directly trains the model to infer executable process state, while VAE/VGAE/MGAE objectives mainly train it to reproduce observed graph features or links.

Reinforcement Learning

In reinforcement learning (RL), an agent observes a state, chooses an action, receives a reward, and moves to a new state. The usual goal is to learn a policy that maximizes expected cumulative reward over time, directly answering the question “What should we do?”

RL is hugely relevant because process improvement is ultimately an intervention problem: given the current state of a case, resource, queue, policy, or system, what should we do next? Escalate, reroute, automate, request a document, add capacity, change a threshold, trigger a reminder, or wait?

Sequential decision-making is commonly formalized as Markov Decision Process (MDP) optimization, and RL consists of a whole family of methods for learning policies when the dynamics or rewards are not fully known in advance. Model-based RL surveys describe this setup as the integration of learning and planning: first learn or use a model of environment dynamics, then use it to plan or improve policy behavior (Moerland et al, 2022).

The problem with RL is that, in real-world processes, the current process “state”, the intervention or “action” taken, and the corresponding impact or “reward”, are all rarely cleanly observed.

Most enterprise process data is observational and not experimental (with exceptions around the data stored in things such as Electronic Lab Notebooks). Enterprise event logs, where they exist, only record what people and systems did under historical policies, and do not record what would have happened under unchosen actions. This causes classic off-policy and confounding problems — for example, escalated cases may have worse outcomes not because escalation harms outcomes, but because only difficult cases are escalated in the first place.

RL assumes agents act from a given state. But in real process data, the complete relevant state is almost always only partially observed. A log may show approval_requested, reminder_sent, or approval_completed, but not directly show that an approver was overloaded, a document was missing, a policy threshold required second approval, a case was waiting on an external response, a manager was unavailable, or a risk flag changed the path. Processes are best understood as partially observable decision problems (POMDPs), providing us with a framework for decision-making when the true state is not directly observed and the agent must act under uncertainty from observations or belief states (Lauri, Hsu & Pajarinen, 2023).

Process rewards are delayed, sparse, and multi-objective. Real-world process outcomes rarely map to single clean rewards, instead combining everything from monetary cost, quality, risk, throughput, customer satisfaction, time to completion, employee workload, compliance, fairness, safety, and auditability, through to anything else an actor may conceivably wish to optimize for. Many rewards are additionally delayed, as actions may have long-term downstream effects. A bad routing decision today may result in a compliance problem only weeks, months or years later. Staffing changes may improve SLAs but result in higher employee attrition or team burnout. Automations may reduce upfront cost and improve throughput but ultimately increase exception risk, result in brand damage, or incur monetary penalties. Because of this, RL faces a reward-specification problem. Optimizing the wrong reward can produce superficially good but in reality undesirable policies.

RL must account for hard constraints, not only soft rewards. RL treats bad outcomes as negative rewards. But many process constraints should not be “discouraged”; they should be impossible. For example, it may make sense from a cost and time perspective, but with safety and compliance in mind, a biopharmaceutical company should never find itself in a position where an RL algorithm recommends shipping a product out to customers before quality release has been completed. RL algorithms may choose unsafe actions if reward models are wrong, penalties are too small, or situations are out-of-distribution. While there are solutions to this — pioneered in domains such as energy grids, self-driving cars, and air traffic control — they all involve trade-offs.

How DG-JEPA enables RL inside a PFM: Dynamic Graph-JEPA does not solve causal identification by itself, but it gives RL three things that are otherwise hard to obtain from observational process data. First, a usable belief state for the partially-observed process — latent case complexity, blockedness, resource pressure, policy regime, missing-document state — acting as a world-model pretraining layer that can be combined with proposed actions to predict future subgraphs and turning the problem into a tractable POMDP. Second, a reward-free pre-training signal: DG-JEPA learns from the graph itself, so downstream RL modules are free to optimize different reward functions on the same latent state without re-learning what “process state” means, sidestepping the reward-specification problem. Third, a clean composition with the rest of the hybrid PFM architecture: Petri nets carve out the set of symbolically feasible actions before RL ever chooses among them. Graph-JEPA learns the state; Petri nets define what’s allowed; RL selects the policy.

Summary

Each architecture above gets part of the job right, and another part wrong:

ArchitectureWhy it fails as a PFM coreWhere it fits in a hybrid PFM
TransformersForce serialization; reward surface predictability; weakly typed.Feature extractors for text labels, comments, documents, policies; in-context adapters.
Recurrent and state-space modelsCompress process into a single hidden-state stream — a bottleneck for multi-object, branching, concurrent workflows.Per-entity timeline encoding (cases, objects, resources, sensors) fed into the graph encoder.
Time-Series Foundation ModelsMiss the symbolic and relational structure that makes a process a process.Per-entity time-series features (durations, queues, SLA risk, seasonal demand) attached to graph nodes.
Graph Neural NetsLocal message passing; over-smoothing; need the right pre-training objective.Heterogeneous/relational GNNs inside DG-JEPA’s encoder stack.
Graph TransformersArchitecture without an objective; overfit surface vocabulary; learn plausibility, not validity.The actual DG-JEPA encoders — paired with the JEPA objective.
Graph EmbeddingsProximity-/link-oriented; process understanding is state-, time-, and constraint-oriented.Candidate retrieval, entity resolution, schema alignment, initialization features.
DiffusionDenoise surface structure ≠ understand latent state; expensive online; miss hard constraints.Conditional generators for candidate futures, with Petri-net validity filters.
Variational and graph autoencodersReconstruct visible attributes rather than latent process role.Largely superseded by JEPA-style objectives in our setup.
Neural ODEs and normalizing flowsBuilt for smooth, continuous, invertible dynamics; processes are discrete, branching, non-invertible.Continuous waiting-time, risk-accumulation, and density-modeling subproblems conditioned on DG-JEPA state.
Tabular Foundation ModelsFlattening prefixes destroys sequence, graph, object, and constraint structure.Encoding structured attributes attached to graph nodes.
Contrastive LearningAugmentations that preserve image identity destroy process meaning.Largely superseded by JEPA-style objectives in our setup.
Classical Process MiningBrittle against noisy, chaotic, incomplete real-world logs.Petri-net constraints, conformance checking, formal verification of generated and observed behaviour.
Reinforcement LearningObservational data; partial state; sparse, multi-objective, delayed rewards; hard constraints.Downstream intervention search over the DG-JEPA latent state, gated by Petri-net feasibility.

Across all of these: each architecture captures one slice of what a process foundation model needs, and gets another slice dangerously wrong. Sequence models capture time but lose structure; graph models capture structure but lack the right objective; generative models reconstruct surfaces rather than infer latent state; symbolic methods enforce validity but break on noise; RL acts on a state it cannot observe. What is missing is not another architecture, but a pre-training objective designed for partially-observed, typed, temporal graph state — one that lets each of these architectures do what they are good at, and stop being asked to do what they are not.

Why Dynamic Graph-JEPA?

We’ve previously written about Graph-JEPA in more detail in Graph-Based World Models, and about its "dynamic" temporal extension in Towards Process Foundation Models.

At a very high level, Graph-JEPA-based architectures are well-fit to processes because:

  • Exact labels are often unstable. One company says “approve invoice”, another uses the term “AP approval”, and a third might call it “vendor payment authorization”. Even within companies, different people may use different terms — or at different times the same person might use more than one of these terms interchangeably. Token-level objectives may overfit these strings, while a Graph-JEPA objective instead learns that a certain hidden subgraph plays the role of an approval gate, escalation path, reconciliation step, exception handler, or handoff, even when the labels differ.
  • Processes have missing middle structure. In real logs, you often observe events but not the latent state: why a case branched, which dependency blocked progress, what unobserved document or policy mattered, what resource constraint caused delay. JEPA-style models trained to infer masked or future subgraphs can learn compressed latent variables that explain observed behavior (much closer to “understanding a process” than generating next-event tokens).

But, to be clear, our proposed solution is not “just Graph-JEPA”.

DG-JEPA extends standard Graph-JEPA, but even then what it ultimately does is provide a process-state foundation-model objective. This can then be composed alongside various of the model architectures explored and critiqued — but not in fact dismissed — above. This hybrid architecture may ultimately combine graph transformers (or GNNs) for relational structure; sequential transformers for language, long context, and in-context support examples; state space models for long event streams and continuous dynamics; symbolic process mining (e.g. Petri nets) to supply validity, conformance, and explainability; and RL to reason about interventions once good state variables exist.

These different model families may interact through an enriched process graph and the DG-JEPA latent state:

  1. Transformers enter early, converting ambiguous labels, schemas, policy text, documents, and free text into semantic features attached to graph nodes and edges.

  2. SSMs enter before and during graph encoding, maintaining time-evolving state for each case, object, resource, queue, and system, so that DG-JEPA does not have to infer long histories from a flat event prefix.

  3. Graph transformers are the actual DG-JEPA encoders: they propagate information across events, objects, actors, resources, documents, policies, Petri-net elements, and temporal edges to produce context and target embeddings.

  4. Petri nets interact in two directions: they enrich the input graph with formal execution state and provide auxiliary supervision and constraints, while DG-JEPA helps align noisy events to transitions, infer missing middle structure, and propose repairs to discovered nets.

  5. RL sits downstream of Graph-JEPA rather than upstream of it: it treats the learned latent process state as the state representation for intervention search, proposes actions such as escalation, rerouting, automation, staffing, or policy changes, and evaluates them through simulators and Petri-net constraints before any deployment.

    In addition to helping answer “best next action” questions, useful RL-related signals may include which latent variables are decision-relevant, which bottlenecks are actionable, which process states are controllable, which interventions change outcomes, which constraints bind in practice, and which simulator predictions fail after deployment.

    These signals can be fed back into representation learning. For example, if RL repeatedly finds that staffing level affects resolution time only when a hidden rework loop is active, then “rework-loop-active” becomes an important latent state variable, and Graph-JEPA can be fine-tuned or probed to represent that variable more explicitly — making the relationship bidirectional. Graph-JEPA provides RL with its state, but RL outcomes may reveal which state dimensions matter for control.

DG-JEPA is, therefore, the representation-learning core; while transformers, SSMs, GNNs, Petri nets, simulators, and RL each occupy a specific point in the data pipeline.

What this looks like in practice

A useful way to picture all of this is from the perspective of a single running case — say, an invoice winding its way through a procure-to-pay process at a multinational enterprise.

  1. Raw events arrive from the ERP, the workflow engine, email systems, and the HASH desktop agent. They are noisy, partially typed, and use organization-specific labels — exactly the surface variability that catches out a transformer trained to predict the next event.
  2. Sequential and tabular encoders (transformers, SSMs, TFMs) convert local features — text, attachments, durations, amounts — into typed node and edge embeddings on a unified, multi-temporal process graph (the kind hgres was built to store). Each encoder is doing what it is good at: the transformer handles language; the SSM compresses long per-entity timelines; the TFM encodes structured attributes — and none is being asked to model the whole process. Semantic and structural filtering may also occur locally, on a user’s device, in a privacy-preserving way (for instance, identifying and discarding irrelevant PII).
  3. Symbolic process mining aligns observed events to Petri-net transitions, derives current markings, and surfaces conformance violations. Where alignment is ambiguous, the graph is annotated with hypotheses rather than forced into one interpretation — closing the gap a pure neural model would otherwise leave open.
  4. DG-JEPA encodes the resulting enriched graph into a latent process state — a vector that captures not just what has happened, but what role the case is currently playing: which obligations are open, which resources are constrained, which futures are reachable, and which hidden variables most likely explain the observed branch. This is the layer that would silently fail under a pure GNN (locality and over-smoothing), a graph transformer trained on next-event prediction (surface vocabulary), or a diffusion model (denoising surface structure rather than inferring latent state).
  5. Downstream heads read from this latent state for monitoring, anomaly detection, retrieval, simulation, and (when explicitly enabled) RL-based intervention recommendation under Petri-net safety constraints. RL operates on a state it could not have learned by itself, and on an action set already filtered for symbolic feasibility — addressing all four of the RL pitfalls raised earlier.

None of the individual ingredients are new. What’s new — we hope — is the way they fit together, anchored by a representation-learning objective that is actually aligned with what process foundation models need to learn.

Will we use all of these models and methods in our DG-JEPA-based attempt to construct a PFM? Not all at once. We intend to start with the core DG-JEPA objective and add complementary components incrementally, evolving the architecture as the work matures.

Priority areas of research

We're approaching PFM development as a stack of small, decomposable bets — each one attackable with controlled experiments, its own success gates, and with fallbacks (should our first approaches not pan out). We see six priority areas of open research:

  1. Objective design — choosing masks, horizons, target subgraphs, and Petri/temporal auxiliaries that force DG-JEPA to learn a genuinely predictive latent process state.
  2. Temporal and multi-object representation — preserving valid time, transaction time, late corrections, concurrency, and long-range dependencies without leaking hindsight into training.
  3. Representation fusion — combining text/schema features, SSM embeddings, Petri markings, and graph encoders without representation collapse or modality overfitting.
  4. Scaling without losing the rare-but-important — sampling enterprise bitemporal graphs at training time while still retaining the paths (exceptions, anomalies, recoveries) that matter most.
  5. Evaluation infrastructure — building transfer-and-robustness benchmarks and ablations that demonstrate concrete advantage over sequence, GNN, graph-transformer, OCPM, and Petri baselines — and not just over conveniently chosen baselines on conveniently chosen tasks.
  6. Intervention validity — ensuring that RL- and simulation-driven recommendations reflect causal effects and hard constraints, rather than confounded historical correlations.

For each, we describe the problem, our planned approach, research success criterion, and architectural fallbacks.

Objective design

Our central conjecture of that JEPA-style masked latent prediction on typed temporal graphs produces a more reusable representation than next-event, link-prediction, or feature-reconstruction objectives. Everything downstream of that conjecture depends on it being true — so it needs to be tested before any architectural elaboration is layered on top.

We plan to train families of DG-JEPA objectives that progressively mask events, object lifecycle phases, joins, handoffs, delay regions, exception paths, Petri firings, and future suffixes, comparing them against next-event, feature-reconstruction, link-prediction, graph-transformer-from-scratch, and Petri-only baselines on the same input substrate. Success is judged not by loss curves but by linear probes over the learned representations: do they recover known latent process state — enabled actions, missing dependencies, branch type, conformance status, bottleneck class, feasible future subgraphs — at materially higher fidelity than the baselines do? Where early masks teach shortcuts (predicting source-system identifiers, exploiting label leakage), we’ll tighten the mask distribution, strip identifier features, and weight targets toward Petri/temporal supervision rather than raw labels.

Temporal and multi-object representation

Real-world processes are bitemporal: facts arrive late, prior events get corrected, and any model that ignores the difference between “what was true at time t” and “what we now know was true at time t” will learn to peek at the future during training. Most prior work skirts this problem by treating event logs as flat sequences over a single time axis. For a foundation model, we cannot afford to.

This data gap is itself one of the central reasons no PFM exists yet — the bitemporal analogue of the broader data-shape problem set out earlier. Public process-mining datasets — BPI Challenge logs, OCEL corpora — all record a single time axis per event; none of them capture when a fact was first known versus when it was later corrected. That isn’t a minor data-cleaning issue. It is exactly the kind of "data not existing in the right shape" gap that, until very recently, made even contemplating a model like this impossible. hgres — our typed, multi-temporal, provenance-aware graph substrate — is the infrastructure we’ve spent the past seven years building specifically to close it. Now that the substrate exists, we intend to be the first through the door it has opened. Our bitemporal training and evaluation sources are accordingly hgres-recorded production data, industrial partner audit logs (which many enterprise systems maintain natively), and Petrinaut-generated synthetic logs with controlled correction patterns — complementing the public non-bitemporal corpora used elsewhere in our benchmark.

On top of those data sources, the modeling risk is whether we can actually exploit the bitemporal structure without leaking hindsight back into the model. We’ll construct explicit "as-known-at-time-t" training views and matched later-corrected target views, then compare three input regimes on the same downstream tasks: single-case flattening, object-centric views, and full bitemporal multi-object graphs. The judgment criterion is twofold: hindsight-leakage detection (does performance collapse when late corrections are hidden during training?) and head-to-head performance on tasks involving late events, concurrent branches, cross-object joins, and long-range dependencies. If full bitemporal graphs prove too noisy or too large to train on directly, we’ll fall back to typed neighborhood extraction around cases, objects, resources, policies, and Petri markings — preserving bitemporal semantics where it matters most.

Representation fusion

A DG-JEPA-anchored hybrid pulls in transformer embeddings for text, SSM embeddings for long timelines, Petri markings for symbolic state, and graph encoders for relational structure. The risk is not that any one of these modalities is wrong; it is that, mixed naively, the model leans on whichever modality is easiest to overfit (schema strings, source IDs, timestamps, Petri labels), and DG-JEPA stops learning process state at all.

We’ll train and probe each modality separately before composing them, then run controlled early-, late-, and gated-fusion variants, with leave-one-modality-out ablations. The judgment criterion is whether the model can still recover latent process state when individual modalities are masked at evaluation time, and whether removing the DG-JEPA objective itself causes the representation to collapse into the cheapest available modality. If fusion is unstable, the fallback is a modular architecture: DG-JEPA owns the graph-state representation, and other modalities enter only through adapters or downstream task heads — buying robustness at the cost of some end-to-end optimization.

Scaling without losing the rare-but-important

Enterprise process graphs are large, but most of their structure is repetitive. The interesting paths — anomalies, exceptions, compliance violations, recoveries — are by construction rare. Naïve graph sampling (random walks, uniform ego-graphs) systematically under-represents exactly the paths a PFM most needs to learn from.

We’ll use stratified bitemporal graph sampling: common paths for coverage, rare conformance and anomaly paths for sensitivity, and Petri-guided sampling around token deficits, disabled transitions, joins, and rework loops. Scaling experiments will vary graph size, temporal horizon, object count, mask size, and Petri-supervision density, mapping out the empirical scaling behavior — including whether rare-path recall improves disproportionately under Petri-guided sampling, and whether scaling is monotonic in graph and supervision density. If global sampling proves too expensive, we’ll move to hierarchical views — event neighborhoods, object lifecycles, case subgraphs, and process-level summaries — recomposed at inference time rather than fitted into a single training pass.

Evaluation infrastructure

Public process-mining benchmarks largely test next-event prediction on a single log — which is precisely the wrong question for a foundation model, and rewards exactly the behavior we critiqued earlier: overfitting local vocabularies. The benchmark we’d be willing to be judged by — one that explicitly tests transfer, robustness, missing-structure inference, and constraint awareness — does not exist at scale.

We plan to build it. The benchmark will cover zero- and few-shot transfer across domains, missing-state inference, conformance-aware prediction, future-subgraph prediction, bottleneck diagnosis, fragment retrieval, and robustness to renamed labels, missing events, late corrections, and schema shifts. Every claim of advantage will be tested against sequence models, GNNs, graph transformers, OCPM-native methods, Petri-only baselines, graph autoencoders, and hybrid variants. (The specific evaluation axes we’ll measure on are described in the next section; this risk is about the benchmark infrastructure that makes those measurements meaningful.) Negative results will be treated as informative: a loss to a sequence model on cross-domain transfer would tell us next-token prediction is more useful than we currently think; a loss to a Petri-only baseline on conformance would tell us we’re under-using symbolic structure. Each negative result narrows the design space.

Intervention validity

The most consequential PFM use cases — should we escalate this case? add capacity here? change this policy? — are causal questions, not predictive ones. But most enterprise event logs are observational, confounded, and incomplete. A model that learns historical correlations and then proposes interventions on the basis of them is, in the worst case, dangerous.

RL is therefore deliberately placed downstream of DG-JEPA and Petri filtering: action proposals come from feasible-action sets defined by the Petri net, consequences are first evaluated in counterfactual simulation, and offline RL uses conservative estimators rather than aggressive exploration. We separate predictive validity (does the latent state forecast accurately?) from causal validity (would an intervention move outcomes in the predicted direction?), and stress-test the latter with causal probes, policy-change simulations, and backtests against natural interventions where they exist in the data. The judgment criterion is whether DG-JEPA-conditioned simulators predict outcomes of held-out natural interventions, and whether causal probes survive distribution shifts that ordinary predictive probes do not. If causal validity does not arrive in step with predictive validity — which is the expected case — high-risk recommendations remain human-reviewed until enough real-world outcome feedback supports deployment.

How we’ll know if it works

A foundation model is only a foundation model insofar as it transfers. We plan to evaluate DG-JEPA along several axes:

  • Cross-log zero-shot transfer (non-bitemporal). Train on a diverse mix of public event logs — including the BPI Challenge corpus across years and domains, and OCEL object-centric data — then evaluate on held-out logs from organizations and processes the model has never seen. Standard process-mining tasks (next-activity prediction, remaining-time forecasting, outcome prediction, conformance scoring) should work without fine-tuning.
  • Bitemporal robustness. Train and evaluate the bitemporal-specific capabilities on data sources where corrections actually occur and are recorded as such: hgres-recorded production data, industrial partners' audit logs, and Petrinaut-generated synthetic logs with controlled correction patterns. The absence of public bitemporal corpora is itself a feature of the landscape we discussed earlier — a gap our infrastructure was built precisely to fill.
  • Latent-state probing. Train linear probes over DG-JEPA embeddings to recover known latent process variables — Petri-net markings, enabledness flags, policy-regime indicators, blockedness — and compare against probes trained over baseline encoders.
  • Counterfactual reasoning. Hold out interventions in semi-synthetic process simulators (built with Petrinaut) and evaluate whether DG-JEPA-conditioned simulators predict intervention outcomes that survive falsification against ground truth.
  • Sample efficiency for downstream tasks. Compare DG-JEPA fine-tuning against from-scratch baselines on labelled prediction tasks at varying training-set sizes; the foundation-model promise is that less downstream data is needed.

We’ll release evaluation harnesses, reference implementations, and benchmark scripts alongside forthcoming research papers. If you’d like early access — to use, or to attack — please get in touch.

Frequently anticipated objections

Won’t a sufficiently large transformer just solve this? Probably not — at least, not efficiently. Scale buys a lot, but it does not buy the right inductive biases. A transformer trained on serialized event logs at any scale is still being asked to infer graph structure, object lifecycles, Petri-net markings, and policy semantics from token sequences. Worse, next-token prediction rewards surface predictability — exactly the wrong objective for learning latent process state. We expect transformers to be valuable inside a PFM (as discussed above), but not as the foundation-model core.

Why DG-JEPA and not standard Graph-JEPA? Graph-JEPA, in its originally-published form, targets static graph-level representations. Processes are dynamic: edges appear and expire, objects change state, resources go in and out of availability, and policies regime-switch. DG-JEPA extends Graph-JEPA to handle this temporal structure natively — treating time as a first-class part of both the input graph and the masking objective — while still inheriting Graph-JEPA’s core insight: predict the embedding of a masked subgraph from context, not its visible features.

Why not just collect more event-log data and apply existing methods? Process data is qualitatively different from text or images: it is partial, typed, multi-object, branching, concurrent, policy-governed, and almost always not causally identified. More data of the wrong shape does not solve the architectural problem. What we needed first was the substrate — typed, multi-temporal, provenance-aware graphs (hgres), open type infrastructure (SemType), and rich symbolic process libraries (Petrinaut). With that substrate in hand, the neural piece is the next step.

How is this different from object-centric process mining (OCPM)? OCPM is a closely-related and complementary line of work. Where OCPM focuses on the data model (treating events as belonging to multiple objects, rather than to one “case”), DG-JEPA focuses on the learning objective (representing latent process state). The two compose: an object-centric event log is exactly the right kind of input for DG-JEPA, and DG-JEPA’s outputs can in turn make OCPM tools more robust to noise, ambiguity, and missing structure.

Is this all just “neuro-symbolic” rebadged? The label has been worn thin, but the underlying observation — that learned representations and symbolic constraints are most powerful in combination — is right, and it is the spine of what we are building. We just prefer to be specific about which symbolic layer (Petri nets, conformance, soundness) and which learned layer (JEPA-style latent prediction on typed temporal graphs).

Where we go from here

A PFM is unusual as foundation models go: it requires not just architectural innovation, but a substantial substrate of high-quality typed, temporal, provenance-aware graphs, plus rich symbolic process libraries to ground it. We’ve spent the better part of the past seven years building exactly this substrate — and DG-JEPA is the representation-learning layer we believe sits naturally on top of it.

Over the coming months we’ll be sharing more: forthcoming papers from our research team, open-source releases of evaluation harnesses and reference implementations, and case studies from early industrial partners.

Get involved

If you’re a researcher working on self-supervised graph learning, dynamic/temporal graphs, world models, or process mining, and you’d like to collaborate (or critique what we’re doing!), please get in touch — we’re actively looking for academic and industrial research partners (and are hiring).

If you’re an engineer or scientist who finds these problems exciting, check out our open roles, as we expand our team of AI engineers in both London and Berlin.

If you’re an organization wrestling with process complexity — compliance, throughput, exception handling, or end-to-end optimization — and you’d like to be among the first to put a PFM to work, please reach out.

And if you simply think we’re wrong about any of this — we’d genuinely like to hear it. The fastest way to find out what’s actually broken in a research bet is to write it down clearly enough for someone to argue back.

Get new posts in your inbox

Get notified when new long-reads and articles go live. Follow along as we dive deep into new tech, and share our experiences. No sales stuff.

Join our community of HASH developers