What AI Models Actually Work for RTL Coverage Prediction

The naive framing of our problem is "can a model predict which tests improve coverage?" but that framing hides three genuinely hard sub-problems: how do you represent RTL structure as model input, how do you label training data when ground-truth coverage comes from long simulation runs, and how do you generalize across design blocks that share no architecture?

We spent about ten months running model experiments before landing on the approach that drives Photoniq today. This post is the honest account — what we tried, the metrics we used to evaluate, and what turned out to matter most. We're writing this because the published literature on GNN applications to EDA problems tends to skip the "why we abandoned approach X" part.

The RTL Representation Problem

Before choosing a model, you have to decide what the model sees. RTL source code is text, but it isn't prose — the semantics live in port connections, always/assign blocks, state machine transitions, and the overall module hierarchy, not in identifier names or comment tokens. Two very different representations are available: the gate-level netlist as a DAG, and the RTL source as a token sequence.

Both have real tradeoffs. Netlist DAGs are structurally precise — after synthesis, every combinational path and register element is a node with known fan-in and fan-out. But synthesis destroys a lot of information that coverage metrics actually care about. Coverage bins are defined at RTL abstraction (line, toggle, FSM state), and some of that structure doesn't survive to gate-level. You end up with a structurally clean graph that doesn't map naturally to where the coverage holes are.

RTL token sequences preserve the original abstraction but introduce all the problems of source-code modeling: variable naming inconsistencies, parametric generics that expand to wildly different hardware, and sequential scan over code that is fundamentally parallel/spatial in structure. A 10,000-line SystemVerilog module is a very long sequence with a lot of irrelevant tokens between the semantically important ones.

Attempt 1: GNN on Netlist DAGs

Our first serious attempt used a graph neural network with message-passing layers operating on post-synthesis netlist graphs. The intuition was that coverage holes cluster around specific graph substructures — deeply nested conditional logic, wide muxes, or state machines with sparse transition graphs.

We built a corpus of about 400 netlist/UCDB pairs from open-source RTL projects and ran 4-layer message-passing networks with node features derived from gate type, depth from primary inputs, and fan-out count. The model learned to predict which netlist subgraphs had low toggle/line coverage in the corresponding UCDB files.

Results: top-10 precision around 61% on held-out netlists from the same project families. Dropped to 38% when we moved to netlists from different design domains. The transfer problem was severe — a model trained on processor pipelines does not generalize to network-on-chip crossbars, even with the same gate-level node feature set. The graph structure carries topology but not semantics.

We spent several weeks trying architectural fixes: adding edge features for logic depth, pre-training on larger open netlist corpora, varying message-passing depth. None moved held-out precision above 44% on out-of-distribution designs. We're not saying GNNs on netlists can't work for any EDA problem — they're well-suited for timing analysis and congestion prediction where topology is the signal. For functional coverage, the abstraction mismatch was too costly.

Attempt 2: Transformer Sequence Models on RTL Tokens

The second direction was a transformer-based model that operated on tokenized SystemVerilog source. We took inspiration from code models (the kind trained on GitHub repositories), but those models are trained on general programming languages where identifier naming carries real signal. In RTL, signal names like data_in_q, i_dat, and d_in_pipe all mean roughly the same thing — there's no naming convention enforced at the language level.

We fine-tuned a smaller code-language model on an RTL-specific token corpus and added a coverage-prediction head that produced per-line coverage probability predictions. The token context window was set to 2048 tokens — enough to cover most individual always blocks but not full modules.

This approach was more promising on in-distribution evaluation (top-10 precision ~69%) but had a pathological failure mode: it learned naming patterns from our training corpus rather than structural patterns. When we tested against designs with unusual naming conventions or non-English comments, precision dropped sharply. The model was doing something closer to memorization of coverage patterns in "code that looks like training data" than learning generalizable structural features.

The windowing problem also hurt. Coverage outcomes for a given always block often depend on conditions set in a different module — the sequence model can't see the structural dependency unless both are in the same context window, which is rarely the case for complex designs.

What Actually Worked: RTL-Level Graph + Coverage Bin Context

The approach that stuck uses an intermediate representation — not synthesized gate-level, not raw token sequence, but an RTL-level structural graph built directly from the SystemVerilog AST. Each node represents a meaningful RTL construct: a module instance, an always block, a concurrent assertion, an FSM state, or an I/O port. Edges encode structural relationships: connectivity, hierarchy, data dependency derived from sensitivity lists, and control flow between FSM transitions.

The key insight was adding coverage bin context as node features. Rather than asking the model to predict raw coverage from structure alone, we give it the current coverage state from the UCDB file — which bins are open, which are closed, and what the cross-coverage matrices look like — and ask it to predict which test input classes will advance the open bins. This changes the problem from "structural prediction" to "coverage-gap completion" and dramatically improves generalization because the coverage bin schema is standardized regardless of design domain.

Message-passing layers on this RTL-level graph with 6 hops achieved top-10 precision of 88% on held-out designs from domains not represented in training. The graph has fewer nodes than gate-level netlist (AST-level constructs rather than individual gates), which makes message passing cheaper, and the coverage bin features anchor the prediction to the actual verification objective rather than structural proxies.

Training corpus size matters here: we needed around 1,200 RTL/UCDB pairs with meaningful coverage variance to avoid overfitting. That's part of why building the corpus was the bottleneck, not model architecture — gathering real RTL with real simulation coverage data is hard. Most open-source RTL has either no coverage data or trivial coverage from basic smoke tests.

The Constrained-Random Tie-In

One thing the model explicitly does not do: it doesn't write SystemVerilog constraint blocks for you. The output is a ranked list of input stimulus classes with predicted coverage gain, not a testbench. You take that list back to your constrained-random framework — UVM, cocotb, or straight SV — and write the actual constraints to hit those stimulus classes.

We explored generating constraint snippets directly as model output, but the error rate was too high for anything non-trivial. A model that confidently produces syntactically valid but semantically broken SV constraints is worse than no generation at all — it costs a DV engineer debugging time with no upside. The ranked-list output is more conservative but immediately actionable: "this 3-state sequence in the AXI handshake interface closes 14 open bins" is something a DV engineer can translate to constraints in under an hour.

Evaluation Methodology and What We Learned

Our evaluation framework deserves its own post, but briefly: we use a leave-one-design-out cross-validation scheme across design families (CPU, memory controller, network IP, custom accelerator). A model that achieves 90% precision on held-out designs from the same family it was trained on is not interesting — that's memorization. The signal is precision on completely unseen design classes.

We also track coverage gain per simulation hour as a secondary metric — predicting a vector is high-priority is only valuable if running that vector is computationally feasible. Some predicted stimuli require very long simulation traces to manifest the coverage event. The model has learned to downrank those even when the predicted coverage gain is high, because the cost-per-gain is unfavorable relative to alternatives.

The current model runs in production on designs up to 800K lines of RTL via the Photoniq API. Inference time is 40-120 seconds depending on design size and current UCDB state — fast enough to re-run after each simulation regression without disrupting the workflow.

What We Still Don't Solve Well

Assertional coverage is the hardest case. Embedded SVA properties in highly parameterized IP blocks create coverage bins whose semantic meaning isn't captured by the structural graph — the assertion fires or doesn't based on sequences that the model hasn't seen in training. Precision on assertional coverage bins is about 20 points lower than functional coverage bins. We're exploring training on assertion-specific sub-corpora, but the data availability problem is worse there.

Cross-hierarchy coverage — where a bin's coverage depends on stimulus paths that span three or more levels of module hierarchy — also degrades precision. The 6-hop message-passing limit was chosen as a practical tradeoff; deeper propagation improved precision on deep hierarchy designs by 4 points but doubled inference time. For most designs the tradeoff favors speed.

The honest summary: the model works well for the majority of functional coverage bins in standard RTL design styles. It works less well at the edges — deep assertions, parameterized protocol IP, and anything with unusual coding patterns. Those are exactly the cases where human DV expertise still pays off most, which is consistent with how we think about the tool's role — it handles the tractable majority so engineers can focus on the hard remainder.

The RTL Representation Problem

Attempt 1: GNN on Netlist DAGs

Attempt 2: Transformer Sequence Models on RTL Tokens

What Actually Worked: RTL-Level Graph + Coverage Bin Context

The Constrained-Random Tie-In

Evaluation Methodology and What We Learned

What We Still Don't Solve Well

Related Engineering Notes

The 85% Coverage Plateau: Why Random Simulation Runs Out of Ideas

How Photoniq Ranks Test Vectors: The Score Behind the List

RTL Coverage Metrics: What 100% Line Coverage Actually Means (And Doesn't)