Industry

Verification at AI Chip Startups: What's Actually Hard in 2025

Hiroshi Watanabe · 10 min read
Abstract custom accelerator chip die visualization

There's a specific type of conversation we keep having with teams building custom AI accelerator ASICs: they come from software or ML backgrounds, they understand their compute architecture deeply, and then they hit the verification phase and discover that the methodology playbook they've been following — primarily inherited from standard IP-block DV practices — doesn't map cleanly onto what they've built.

This post is a synthesis of those conversations and our own observations building Photoniq's coverage model on accelerator-class RTL. The verification challenges at AI chip startups are distinct, not just a harder version of the challenges at established chip companies. Understanding why requires looking at what's actually different about custom accelerator RTL compared to the SoC IP blocks that most DV methodology was designed around.

The Coverage Model Mismatch

Standard IP-block DV methodology was developed for reusable interface IP — AXI controllers, PCIe endpoints, DDR PHY front-ends. These blocks have stable interfaces, well-defined protocol specifications, and years of accumulated verification experience codified into VIP (Verification IP) packages. The coverage model for an AXI slave interface is largely standardized: protocol state coverage, byte-enable combinations, burst lengths, error injection paths. You buy or license a VIP, you run it against your implementation, you close the coverage plan that came with the VIP.

Custom accelerator datapaths don't have this ecosystem. An ML inference accelerator built around a custom systolic array with non-standard weight-stationary dataflow has no off-the-shelf coverage plan and no VIP. The DV team has to construct the coverage model from the architecture specification, which is itself a work in progress while RTL is being written. Coverage planning and RTL development are happening in parallel, which means the coverage plan is always chasing a moving target.

The practical consequence: accelerator teams often reach tape-out with a coverage plan that was finalized months after RTL freeze, which means the coverage bins were defined to reflect the design as-built rather than as-specified. This isn't necessarily wrong — if you finalize the coverage plan to match the final RTL, you can achieve 100% coverage — but it means the coverage closure exercise may not catch the architectural intent bugs that early-defined coverage plans are designed to surface.

The Datapath Width Problem

Modern AI accelerators operate on wide datapaths: 512-bit or 1024-bit systolic array operand buses, INT8 and FP16 compute lanes in parallel, SIMD-style instruction execution. For constrained-random simulation, wide datapaths are fundamentally different from the 32-bit or 64-bit buses that most DV tooling was optimized for.

The coverage explosion is the first symptom. A 512-bit databus has a nominally astronomical coverage space, but the architecturally relevant coverage isn't uniform across that space. What actually matters is specific boundary conditions: max/min representable values in the active compute format, patterns that trigger saturation logic, zero-accumulator conditions on the systolic boundary, and partial-tile edge cases where the operand matrix doesn't evenly divide the systolic array dimensions. These conditions are sparse in the full random stimulus space but dense in the architecturally interesting regions.

The constrained-random testbench has to encode this architectural knowledge as constraints. Getting those constraints right — narrow enough to hit the interesting regions efficiently, broad enough to explore the space around them — is a domain judgment call that requires someone who understands both the hardware architecture and the DV methodology simultaneously. In a small team (5-8 engineers, all technical, none with deep DV specialization), that person may not exist at the right moment in the project timeline.

The RTL Churn Problem

Early-stage chip teams move fast. An AI accelerator team at the architectural exploration stage might change the systolic array tile dimensions, dataflow strategy, or memory hierarchy two or three times before RTL freeze. Each of those changes invalidates some portion of the existing testbench and coverage plan.

The standard approach to this in larger chip teams is a formal change control process: RTL changes trigger a verification impact assessment before they're integrated. Small teams skip this step for speed, which is rational for early exploration but creates technical debt that becomes critical near tape-out. A testbench that was written for a weight-stationary dataflow architecture doesn't test an output-stationary dataflow architecture correctly, even if the interfaces look similar. The bugs that a weight-stationary testbench doesn't surface in an output-stationary design are exactly the ones that cause post-silicon failures.

What we've observed is that accelerator teams often hit this problem around 70-75% coverage on their first full regression run: the coverage curve climbs quickly on the interface-level bins (which the constrained-random testbench hits well) and then stalls on the compute-core bins where the testbench assumptions have drifted from the current RTL. Those are the 25-30% coverage gaps that require directed testing, but the directed test scenarios are hard to write without an up-to-date coverage plan that reflects the current architecture.

The Simulator Farm Access Problem

Established chip companies run simulation farms with hundreds or thousands of cores running regressions in parallel. An early-stage accelerator startup has, at best, a small cloud simulation budget and likely a shared internal cluster. Simulation time is a meaningful constraint on coverage closure in a way it isn't at companies with unlimited farm access.

This constraint changes the economics of the coverage closure problem. At a large chip company with abundant simulation resources, the dominant cost of directed-test writing is engineer time — running more seeds is cheap. At an early-stage team, running more seeds has a real budget impact. Every simulation hour spent on redundant seeds that close no new coverage bins is a budget cost, not just a time cost.

Test efficiency — the ratio of new coverage bins closed per simulation hour — matters more at early-stage teams than anywhere else. This is precisely the problem Photoniq is built to address: prioritizing which test scenarios to run next based on predicted coverage gain, so simulation budget goes toward tests that actually advance closure rather than tests that the random simulator would generate without guidance.

We're not saying that having more simulation resources would solve the verification challenges described here — it wouldn't. The coverage model mismatch and RTL churn problems don't respond to more compute. But simulation budget efficiency compounds with the other challenges, and it's the one that's most directly addressable with better tooling.

The Coverage Sign-Off Definition Problem

What does 100% functional coverage mean for a custom accelerator? The question is harder than it sounds. For a protocol IP block with a defined specification, coverage closure means demonstrating that every specified protocol behavior has been exercised. For a custom datapath, "every specified behavior" includes the architectural intent of the compute engine, which may be partially implicit in the design spec and partially in the RTL engineer's head.

Many early-stage teams define their coverage sign-off criteria retrospectively — they write coverage bins to match the testbench's natural stimulus space, close those bins, and declare success. This is logically consistent but potentially meaningless from a bug-escape prevention standpoint. Coverage bins that are easy to close are often not the bins that correspond to the most likely bug sources.

A more rigorous approach is to define coverage bins from the failure modes first: what conditions would cause this accelerator to produce incorrect compute results, and are those conditions covered? For a systolic array, that means: accumulator overflow conditions, partial tile edge cases, memory hazard scenarios on the weight buffer, and the specific numerical edge cases for each supported compute format. Those bins are harder to close, but closing them actually reduces post-silicon risk.

The practical barrier is time: defining failure-mode-first coverage bins requires architecture and DV collaboration that is hard to sustain when both teams are operating at startup pace. The coverage plan becomes whatever the DV engineer can write in the time available, which tends toward stimulus-space coverage rather than failure-mode coverage.

What Actually Helps

Based on the patterns we've seen, a few practices make a meaningful difference for accelerator teams doing their first tape-out:

Define the coverage plan iteration zero before RTL is written. Even if it changes substantially, starting with failure-mode-oriented coverage bins anchors the testbench to architectural risk rather than implementation structure. The coverage plan should be a living document with explicit versioning — when RTL changes, someone should update the coverage plan version and note which bins are affected.

Separate interface-level coverage (which constrained-random handles well) from compute-correctness coverage (which it doesn't) from day one. Run constrained-random simulation for the former and allocate directed-test engineering time specifically for the latter. Don't let the 85% interface-level closure rate mask the fact that compute-correctness coverage is at 40%.

Build the DV infrastructure to support partial-design simulation before full-chip integration. Accelerator cores with their own testbenches, tested in isolation, allow coverage closure work to proceed while the SoC integration is still in flux. This is standard practice at large chip companies and gets skipped at startups because it looks like overhead. It pays back when the integration schedule compresses.

The verification challenges at AI chip startups are real and specific, but they're not unsolvable. The teams that navigate them well are the ones who treat verification methodology as a first-class engineering discipline alongside the compute architecture work — not as a compliance exercise to run after RTL is done.