Development with AI Agents

How Claude / Copilot CLI was used to build Flash compiler features, FlexLSO programs, and HRW integration — from RTL analysis to passing datapath tests.
March 2026

1. Flash — Introduction

What is Flash?

Flash is a Rust-based domain-specific compiler for the DPU's programmable flow processors (SGE, PTA) and FlexLSO. It compiles .flash programs into JSON artifacts that FunOS loads to program hardware TCAM/instruction memory at runtime.

Pipeline: Source → Scanner (47 tokens) → Parser (recursive descent) → Semantic Model → Encoder → JSON + C bindings

The Flash pipeline

① Flow Processor — Structured TCAM Match
.flash program key + flow + handler Flash Compiler S3 target JSON TCAM entries FunOS Loader programs CSRs SGE Flow Processor 128 TCAM entries × 68-bit key 256 instruction memory entries Scope: compile-time validation of TCAM entries

The 📐 Interactive Flash Architecture Diagram provides a detailed clickable view of the compiler internals.

2. Flow Processor TCAM — Structured Match Syntax

The first project was improving how TCAM entries are expressed in Flash. The SGE flow processor has 128 TCAM entries with a 68-bit key — but the flow syntax treated it as opaque hex.

Asking the AI to analyze the problem

Session Snippet — Analyzing TCAM Semantics
Human
Read the Flash compiler source. Look at specifically how the TCAM match semantic is defined and suggest if it is human friendly or improvements can be made?
AI
Current Flash flow syntax:
flow {
    mask  0xF0000
    match 0x200000
    action do_stuff
}
Problems identified:
  1. Raw hex mask/match is not human-friendly — the 68-bit key is structured (queue type, opcode, error bits), but treated as opaque hex
  2. No key structure declaration — programmer must consult keygen separately to know bit positions
  3. No priority/overlap validation — overlapping patterns not detected
  4. Flow is too low-level — clean handler/flow separation, but flow lacks any abstraction
Suggested: Add key struct declaration + structured match with field-level matching + compiler validations.

The transformation

Before — Raw hex
flow {
    mask  0xf000_0000_0000_0000
    match 0x2000_0000_0000_0000
    action sq_db
}
After — Structured match
key sge_keygen_key @ message {
    msg_type  : 4
    sid       : 1
    cid       : 1
    thd_id    : 2
    queue_type: 2
    ...
}

flow {
    match sge_keygen_key {
        msg_type == MSG_TYPE_SQ_DB
    }
    action sq_db
}

The compiler resolves field names to bit positions, computes mask/value automatically, and validates overlaps and exhaustiveness at compile time.

✅ What worked well
  • AI analyzed 68-bit key structure across compiler + RTL, proposed 4-part improvement
  • Implemented end-to-end: scanner keywords → parser → model → encoder → tests
  • Converted all 17 SGE programs; found a real TCAM collision bug during conversion
⚠️ What needed human push
  • "Use constants for mask and value — it looks ugly" → AI initially generated raw hex in structured syntax
  • Design taste: AI proposes technically correct solutions; human refines ergonomics
📖 Full Session Log — Flow Processor TCAM

3. FlexLSO — From RTL Analysis to Real Programs

FlexLSO Pipeline — Compiler + Programs (New S3 Block)
Legacy LSO RTL reference only .flash program profiles + lines Flash Compiler FlexLSO flavor 13 opcodes, 3-bank enc JSON mem0 + mem1 + mem2 FunOS Loader ⏳ pending ETP FlexLSO Engine Profile SRAM, 8 ALUs needs S3 refmodel TCP + RoCEv2 (4 profiles)

Inspired by the TCAM work, the next question was: can we extend Flash to handle FlexLSO — a major new S3 hardware block with no prior SW support? This section shows the deliberate approach used to develop it end-to-end.

Phase 1: Start with RTL analysis

Rather than jumping to implementation, the first step was having the AI read and understand the actual hardware — legacy LSO RTL and the new FlexLSO instruction set.

Session Snippet — RTL Deep-Dive
Human
Read the flex_lso_spec and go through the asics RTL repository. Understand ETP FlexLSO specifically focusing on the input WU, NU operate command, LSO operate command, fixed/flex structure that gets fed to FlexLSO instructions. Let me know once done.
AI
ETP FlexLSO: Spec ↔ RTL Mapping

1. Input WUwudma_nu_com_opr_t. S3 adds two new bits: flex (0=Legacy, 1=Flex) and size (0=32B, 1=64B).

2. LSO Operate Command — Legacy = fixed 128-bit format. Flex = 16B fixed portion (opc, cmd, hdr_mod_ctl, mode, mss, flex_flag, profile, template, proto_ofst) + variable flex data (16B or 48B).

3. Data to FlexLSO engine (128B) — Fixed DS (16B) + ALU State (16B) + Flex DS (up to 48B). Three separate SRAM banks: mem0 (14-bit metadata), mem1 (instructions 0-3), mem2 (instructions 4-7).

This produced a comprehensive 850-line specification extracted from RTL — covering command structures, header stack types, per-packet modification logic, and the 5-stage data pipeline.

Phase 2: Design through Q&A

Before writing any code, the approach was to have the AI propose and then ask clarifying questions. This back-and-forth — 12 clarifying questions across multiple rounds — iterated on key design decisions before a single line of code was written.

🔍 See the full design Q&A exchange — the most interesting part of this session
Session Snippet — Design Proposal
Human
Can we enhance Flash compiler to tackle FlexLSO as well? Basically, write programs that compile and generate CSRs for FlexLSO instructions. Need way to represent structs, instruction lines etc. Think deeply and come back with a proposal.
AI
After analyzing the target/flavour system, I propose a FLEXLSO flavour on a new S3 target with these constructs:
struct for header types, flags for opr_flag bits, profile with indexed lines, each line targeting a stack position with typed header binding. Compute instructions use multi-stage pipeline syntax (compute A + B -> swap hdr.field)...

[Analysis of all 13 opcodes, memory model, encoding format follows]
Session Snippet — Human-Driven Design Refinement
Human
For header dispatch, I want to pass struct like tcp.seq_num instead of constants like 16. Can you improve on that?
AI
// Before (raw offsets):
compute fix.seq_num + flex.tcp_seq_num_inc
    -> swap header[16, 4]

// After (symbolic):
line inner_l4 -> stack(0) as tcp : tcp_hdr_t {
    compute fix.seq_num + flex.tcp_seq_num_inc
        -> swap tcp.seq_num
}
The compiler resolves tcp.seq_num → offset 16, length 4 from the struct definition. Raw header[ofst, len] still available as fallback.
Session Snippet — Iterative Refinement of Programs
Human
when opr & 24'h000001 == 24'h000001 — why are there raw hex constants? Need descriptive constants defined and used.
AI
Done. All guards now use named constants. Also enhanced compiler so when accepts named constants (not just literals):
when opr & FIRST_OR_LAST == ZERO — clear that it checks "neither first nor last".
Added compound constants: SE_LAST, IMM_LAST, ZERO.

Phase 3: Implementation + Rapid iteration

With the design settled, the AI implemented the full compiler in one session, followed by iterative program development with constant human feedback.

Compiler implementation
Scanner (new tokens: +, |, ->) → Parser (10 new methods) → Model (profile/line/instruction resolution) → Encoder (3-bank memory output). 167 tests on first pass.
Test expansion & audit
Asked AI to review coverage → expanded 24 → 71 tests. Feature audit found 87% complete → fixed 4 gaps (multi-file imports, bounds validation). Final: 223 tests.
TCP & RoCEv2 programs
4 profiles: single/double encap for each protocol. Iterative review: immediate data placement fix, constant naming, ownership model (HW vs SW).
Code review
Asked AI to review its own code → found 3 issues: integer overflow, bounds validation, unused parse result. All fixed.

Catching real errors

Session Snippet — AI Catches Incorrect Placement
Human
line eth_terminate -> stack(3) { when opr & IMM_LAST == IMM_LAST { add_hdr flex.immediate, 12, 4 } }
This immediate is for RoCE header. Why are you appending to Ethernet header?
AI
Fixed. Immediate data (add_hdr) is now in the BTH line (stack 0) — right where it belongs in the RoCE header, before ICRC. The ETH lines are now just terminate.
223
Tests passing
13
Opcodes supported
4
LSO profiles (TCP + RoCE)
850
Lines of RTL spec extracted
✅ What worked well
  • RTL-first approach — AI extracted comprehensive spec that served as implementation reference
  • Design Q&A before coding — clarifying questions surfaced edge cases early
  • Rapid iteration — compiler implemented in one session, features added on demand
  • AI reviewed its own code and found real bugs (overflow, bounds)
⚠️ What needed human direction
  • HW/SW ownership boundaries (fixed = HW, flex = SW) — AI treated everything as SW until corrected
  • Protocol correctness — RoCE immediate data placement in wrong header layer
  • Knowing when to push back on AI output and iterate on naming, structure, constants

Design specs

📄 FlexLSO Compiler Proposal 📄 FlexLSO Programs Design

📖 Full Session Log — FlexLSO Development

4. HRW — Closing the Loop with Real Traffic

HRW Pipeline — End-to-End with Datapath Validation
.flash program actions + profiles Flash Compiler HRW flavor (S2) 10 opcodes JSON + C bindings .h ISM + actions hrw_json.c JSON → CSR writes 541 lines HRW Ref Model S2-POSIX (hrw.cc) ISM + TCAM + CRB ✓ 13/13 datapath tests passing Scope: full end-to-end — compiler → loader → HW ref model → traffic Complete pipeline — real packets through HW reference model

FlexLSO was complete but couldn't be traffic-tested (S3 ETP refmodel not available). The next step was to apply the same approach to HRW (Header Rewrite) on S2, where existing ref model + tests could validate actual datapath behavior.

Design-first approach

Session Snippet — Thorough Design Process
Human
Resume the work on the HRW proposal. Read all the relevant software — Flash compiler, HRW ref model (hrw.cc), FunOS hrw_tests.c. Then, ask me clarifying questions on design.
AI
HRW Flash Flavor — Design Proposal

Pipeline: .flash → structured JSON → FunOS driver parses JSON at runtime → CSR writes → HW/ref model

Key constructs:
action NAME { ... } — Block of ≤8 HRW-native instructions (movd, movi, addi, etc.)
profile NAME { step ... } — Sequence of ≤32 ISM steps referencing named actions
enum hboa_idx { OL_L2=0, ... } — User-defined HBOA layer names
const CONST_GLOBAL = 0 — Named constant indices

Phase 1: Full ISM sequencer + action commands
Phase 2: TCAM dispatch
Session Snippet — Pushing for HW Constraint Enforcement
Human
Did you look at the data structures involved? There are N2P, flow state — some fields are HW-defined, rest are SW-controlled. We need these APIs in flash programs with C-bindings. Also, did you check the HRW refmodel for HW constraints? We need flash to enforce these at compile time.
AI
Comprehensive analysis of N2P (64B common header), flow state (28B with HW/SW split), and const_block structures. Mapped all fields with HW vs SW ownership. Ref model constraints identified: max 8 instructions per action, max 32 ISM steps, nibble alignment, HBOA index bounds — all to be enforced at compile time.

Full pipeline: Compiler → Loader → Datapath

Unlike FlexLSO, HRW was carried all the way through to real traffic flowing through the HW reference model.

Flash compiler — HRW flavor
New S2 target, 10 opcodes, all source/dest selectors (hdr, newhdr, flowstate, const, n2p). 173 Rust tests + formal design spec.
FunOS JSON loader driver
hrw_json.c (541 lines) — parses Flash JSON output, programs ISM + action commands via CSR APIs.
Test port & datapath validation
Ported hrw_tests.c (2582 lines, 3 test functions) to use Flash programs. Created hrw_datapath.flash with 24 actions, 12 profiles.
Iterative debugging to pass all 13 tests
Multiple debugging cycles: nibble vs byte offsets, HBOA-relative addressing, flow state struct padding, ADDI 4-operand syntax, composite profile NOP actions.
173
Compiler tests
10
HRW opcodes
13/13
Datapath tests passing
~1000
Lines design spec
✅ What worked well
  • End-to-end validation — proved Flash output works through actual HW ref model
  • AI analyzed ref model error codes to narrow down root causes
  • C-bindings generation — auto-generated header files for FunOS integration
⚠️ Key discoveries during debugging
  • All HRW offsets are nibbles (half-bytes), not bytes — AI initially used byte offsets
  • HBOA-relative addressing: hdr/newhdr vars MUST use @ 0 — HBOA provides the layer offset at runtime
  • Flow state struct had extra padding in Flash definition vs C offsetof() values

Design spec

📄 HRW Flash Compiler Spec (~1000 lines)

📖 Full Session Log — HRW Development

5. Lessons Learned — What Works and What Doesn't

✅ What works well with AI agents
  • Codebase analysis at scale — Reading RTL, compiler internals, thousands of lines of C/Rust and synthesizing understanding
  • Language & compiler design — Propose syntax, iterate on feedback, implement full pipeline
  • Comprehensive test generation — Including negative tests, edge cases, coverage audits
  • Bug detection during migration — Found the csr_oper TCAM collision that humans missed
  • Rapid iteration — Feature request → implementation → tests in minutes, not days
  • Documentation — Specs, user guides, design proposals, session logs
  • Self-review — AI can review its own code and find real issues (overflow, bounds)
⚠️ What needs human direction
  • HW/SW boundary decisions — Which structs are HW-owned vs SW-owned requires domain knowledge
  • Protocol correctness — AI may put RoCE data in the wrong header layer
  • Architecture decisions — When to pivot (FlexLSO → HRW for testability), how to structure files
  • Debugging direction — AI analyzes well but human must point to the right area to investigate
❌ What doesn't work
  • Blind trust on bit-level encoding — Always verify nibble/byte, offset conventions, struct layouts against HW
  • Assuming AI understands HW semantics — Nibble offsets, HBOA-relative addressing were wrong initially
  • Long autonomous runs — Best results with frequent human checkpoints and feedback loops

The key insight

AI as a 10x amplifier, not a replacement

The most effective pattern was human sets direction, AI executes at scale, human reviews and refines. Each Flash area took 1-2 sessions (~4-6 hours) vs what would have been weeks of manual work. The AI handled the tedious parts (parsing all RTL files, writing 200+ tests, implementing compiler pipeline) while the human focused on design decisions, correctness, and ergonomics.

6. The Repeatable Workflow

The same pattern was used successfully across all three areas:

1
Analyze
2
Design
3
Implement
4
Validate
5
Refine
StepWhat the human doesWhat the AI does
1. Analyze Point to RTL, specs, existing code Read thousands of lines, synthesize understanding, produce reference docs
2. Design Ask clarifying questions, steer decisions Propose language constructs, data models, implementation plan
3. Implement Say "go ahead" + review output Write scanner/parser/model/encoder, generate tests, commit
4. Validate Run builds, point to failures Analyze errors, propose fixes, iterate
5. Refine Code review feedback, naming, ergonomics Apply fixes, expand tests, update docs