Development with AI Agents

How Claude / Copilot CLI was used to build Flash compiler features, FlexLSO programs, and HRW integration — from RTL analysis to passing datapath tests.

March 2026

1. Flash — Introduction

What is Flash?

Flash is a Rust-based domain-specific compiler for the DPU's programmable flow processors (SGE, PTA) and FlexLSO. It compiles .flash programs into JSON artifacts that FunOS loads to program hardware TCAM/instruction memory at runtime.

Pipeline: Source → Scanner (47 tokens) → Parser (recursive descent) → Semantic Model → Encoder → JSON + C bindings

The Flash pipeline

① Flow Processor — Structured TCAM Match

The 📐 Interactive Flash Architecture Diagram provides a detailed clickable view of the compiler internals.

2. Flow Processor TCAM — Structured Match Syntax

The first project was improving how TCAM entries are expressed in Flash. The SGE flow processor has 128 TCAM entries with a 68-bit key — but the flow syntax treated it as opaque hex.

Asking the AI to analyze the problem

Session Snippet — Analyzing TCAM Semantics

Human

Read the Flash compiler source. Look at specifically how the TCAM match semantic is defined and suggest if it is human friendly or improvements can be made?

Current Flash flow syntax:

flow {
    mask  0xF0000
    match 0x200000
    action do_stuff
}

Problems identified:

Raw hex mask/match is not human-friendly — the 68-bit key is structured (queue type, opcode, error bits), but treated as opaque hex
No key structure declaration — programmer must consult keygen separately to know bit positions
No priority/overlap validation — overlapping patterns not detected
Flow is too low-level — clean handler/flow separation, but flow lacks any abstraction

Suggested: Add key struct declaration + structured match with field-level matching + compiler validations.

The transformation

Before — Raw hex

flow {
    mask  0xf000_0000_0000_0000
    match 0x2000_0000_0000_0000
    action sq_db
}

After — Structured match

key sge_keygen_key @ message {
    msg_type  : 4
    sid       : 1
    cid       : 1
    thd_id    : 2
    queue_type: 2
    ...
}

flow {
    match sge_keygen_key {
        msg_type == MSG_TYPE_SQ_DB
    }
    action sq_db
}

The compiler resolves field names to bit positions, computes mask/value automatically, and validates overlaps and exhaustiveness at compile time.

✅ What worked well

AI analyzed 68-bit key structure across compiler + RTL, proposed 4-part improvement
Implemented end-to-end: scanner keywords → parser → model → encoder → tests
Converted all 17 SGE programs; found a real TCAM collision bug during conversion

⚠️ What needed human push

"Use constants for mask and value — it looks ugly" → AI initially generated raw hex in structured syntax
Design taste: AI proposes technically correct solutions; human refines ergonomics

📖 Full Session Log — Flow Processor TCAM

3. FlexLSO — From RTL Analysis to Real Programs

FlexLSO Pipeline — Compiler + Programs (New S3 Block)

Inspired by the TCAM work, the next question was: can we extend Flash to handle FlexLSO — a major new S3 hardware block with no prior SW support? This section shows the deliberate approach used to develop it end-to-end.

Phase 1: Start with RTL analysis

Rather than jumping to implementation, the first step was having the AI read and understand the actual hardware — legacy LSO RTL and the new FlexLSO instruction set.

Session Snippet — RTL Deep-Dive

Human

Read the flex_lso_spec and go through the asics RTL repository. Understand ETP FlexLSO specifically focusing on the input WU, NU operate command, LSO operate command, fixed/flex structure that gets fed to FlexLSO instructions. Let me know once done.

ETP FlexLSO: Spec ↔ RTL Mapping

1. Input WU — wudma_nu_com_opr_t. S3 adds two new bits: flex (0=Legacy, 1=Flex) and size (0=32B, 1=64B).

2. LSO Operate Command — Legacy = fixed 128-bit format. Flex = 16B fixed portion (opc, cmd, hdr_mod_ctl, mode, mss, flex_flag, profile, template, proto_ofst) + variable flex data (16B or 48B).

3. Data to FlexLSO engine (128B) — Fixed DS (16B) + ALU State (16B) + Flex DS (up to 48B). Three separate SRAM banks: mem0 (14-bit metadata), mem1 (instructions 0-3), mem2 (instructions 4-7).

This produced a comprehensive 850-line specification extracted from RTL — covering command structures, header stack types, per-packet modification logic, and the 5-stage data pipeline.

Phase 2: Design through Q&A

Before writing any code, the approach was to have the AI propose and then ask clarifying questions. This back-and-forth — 12 clarifying questions across multiple rounds — iterated on key design decisions before a single line of code was written.

🔍 See the full design Q&A exchange — the most interesting part of this session

Session Snippet — Design Proposal

Human

Can we enhance Flash compiler to tackle FlexLSO as well? Basically, write programs that compile and generate CSRs for FlexLSO instructions. Need way to represent structs, instruction lines etc. Think deeply and come back with a proposal.

After analyzing the target/flavour system, I propose a FLEXLSO flavour on a new S3 target with these constructs:
struct for header types, flags for opr_flag bits, profile with indexed lines, each line targeting a stack position with typed header binding. Compute instructions use multi-stage pipeline syntax (compute A + B -> swap hdr.field)...

[Analysis of all 13 opcodes, memory model, encoding format follows]

Session Snippet — Human-Driven Design Refinement

Human

For header dispatch, I want to pass struct like tcp.seq_num instead of constants like 16. Can you improve on that?

// Before (raw offsets):
compute fix.seq_num + flex.tcp_seq_num_inc
    -> swap header[16, 4]

// After (symbolic):
line inner_l4 -> stack(0) as tcp : tcp_hdr_t {
    compute fix.seq_num + flex.tcp_seq_num_inc
        -> swap tcp.seq_num
}

The compiler resolves tcp.seq_num → offset 16, length 4 from the struct definition. Raw header[ofst, len] still available as fallback.

Session Snippet — Iterative Refinement of Programs

Human

when opr & 24'h000001 == 24'h000001 — why are there raw hex constants? Need descriptive constants defined and used.

Done. All guards now use named constants. Also enhanced compiler so when accepts named constants (not just literals):
when opr & FIRST_OR_LAST == ZERO — clear that it checks "neither first nor last".
Added compound constants: SE_LAST, IMM_LAST, ZERO.

Phase 3: Implementation + Rapid iteration

With the design settled, the AI implemented the full compiler in one session, followed by iterative program development with constant human feedback.

Compiler implementation

Scanner (new tokens: +, |, ->) → Parser (10 new methods) → Model (profile/line/instruction resolution) → Encoder (3-bank memory output). 167 tests on first pass.

Test expansion & audit

Asked AI to review coverage → expanded 24 → 71 tests. Feature audit found 87% complete → fixed 4 gaps (multi-file imports, bounds validation). Final: 223 tests.

TCP & RoCEv2 programs

4 profiles: single/double encap for each protocol. Iterative review: immediate data placement fix, constant naming, ownership model (HW vs SW).

Code review

Asked AI to review its own code → found 3 issues: integer overflow, bounds validation, unused parse result. All fixed.

Catching real errors

Session Snippet — AI Catches Incorrect Placement

Human

line eth_terminate -> stack(3) { when opr & IMM_LAST == IMM_LAST { add_hdr flex.immediate, 12, 4 } }
This immediate is for RoCE header. Why are you appending to Ethernet header?

Fixed. Immediate data (add_hdr) is now in the BTH line (stack 0) — right where it belongs in the RoCE header, before ICRC. The ETH lines are now just terminate.

223

Tests passing

Opcodes supported

LSO profiles (TCP + RoCE)

850

Lines of RTL spec extracted

✅ What worked well

RTL-first approach — AI extracted comprehensive spec that served as implementation reference
Design Q&A before coding — clarifying questions surfaced edge cases early
Rapid iteration — compiler implemented in one session, features added on demand
AI reviewed its own code and found real bugs (overflow, bounds)

⚠️ What needed human direction

HW/SW ownership boundaries (fixed = HW, flex = SW) — AI treated everything as SW until corrected
Protocol correctness — RoCE immediate data placement in wrong header layer
Knowing when to push back on AI output and iterate on naming, structure, constants

Design specs

📄 FlexLSO Compiler Proposal 📄 FlexLSO Programs Design

📖 Full Session Log — FlexLSO Development

4. HRW — Closing the Loop with Real Traffic

HRW Pipeline — End-to-End with Datapath Validation

FlexLSO was complete but couldn't be traffic-tested (S3 ETP refmodel not available). The next step was to apply the same approach to HRW (Header Rewrite) on S2, where existing ref model + tests could validate actual datapath behavior.

Design-first approach

Session Snippet — Thorough Design Process

Human

Resume the work on the HRW proposal. Read all the relevant software — Flash compiler, HRW ref model (hrw.cc), FunOS hrw_tests.c. Then, ask me clarifying questions on design.

HRW Flash Flavor — Design Proposal

Pipeline: .flash → structured JSON → FunOS driver parses JSON at runtime → CSR writes → HW/ref model

Key constructs:
• action NAME { ... } — Block of ≤8 HRW-native instructions (movd, movi, addi, etc.)
• profile NAME { step ... } — Sequence of ≤32 ISM steps referencing named actions
• enum hboa_idx { OL_L2=0, ... } — User-defined HBOA layer names
• const CONST_GLOBAL = 0 — Named constant indices

Phase 1: Full ISM sequencer + action commands
Phase 2: TCAM dispatch

Session Snippet — Pushing for HW Constraint Enforcement

Human

Did you look at the data structures involved? There are N2P, flow state — some fields are HW-defined, rest are SW-controlled. We need these APIs in flash programs with C-bindings. Also, did you check the HRW refmodel for HW constraints? We need flash to enforce these at compile time.

Comprehensive analysis of N2P (64B common header), flow state (28B with HW/SW split), and const_block structures. Mapped all fields with HW vs SW ownership. Ref model constraints identified: max 8 instructions per action, max 32 ISM steps, nibble alignment, HBOA index bounds — all to be enforced at compile time.

Full pipeline: Compiler → Loader → Datapath

Unlike FlexLSO, HRW was carried all the way through to real traffic flowing through the HW reference model.

Flash compiler — HRW flavor

New S2 target, 10 opcodes, all source/dest selectors (hdr, newhdr, flowstate, const, n2p). 173 Rust tests + formal design spec.

FunOS JSON loader driver

hrw_json.c (541 lines) — parses Flash JSON output, programs ISM + action commands via CSR APIs.

Test port & datapath validation

Ported hrw_tests.c (2582 lines, 3 test functions) to use Flash programs. Created hrw_datapath.flash with 24 actions, 12 profiles.

Iterative debugging to pass all 13 tests

Multiple debugging cycles: nibble vs byte offsets, HBOA-relative addressing, flow state struct padding, ADDI 4-operand syntax, composite profile NOP actions.

173

Compiler tests

HRW opcodes

13/13

Datapath tests passing

~1000

Lines design spec

✅ What worked well

End-to-end validation — proved Flash output works through actual HW ref model
AI analyzed ref model error codes to narrow down root causes
C-bindings generation — auto-generated header files for FunOS integration

⚠️ Key discoveries during debugging

All HRW offsets are nibbles (half-bytes), not bytes — AI initially used byte offsets
HBOA-relative addressing: hdr/newhdr vars MUST use @ 0 — HBOA provides the layer offset at runtime
Flow state struct had extra padding in Flash definition vs C offsetof() values

Design spec

📄 HRW Flash Compiler Spec (~1000 lines)

📖 Full Session Log — HRW Development

5. Lessons Learned — What Works and What Doesn't

✅ What works well with AI agents

Codebase analysis at scale — Reading RTL, compiler internals, thousands of lines of C/Rust and synthesizing understanding
Language & compiler design — Propose syntax, iterate on feedback, implement full pipeline
Comprehensive test generation — Including negative tests, edge cases, coverage audits
Bug detection during migration — Found the csr_oper TCAM collision that humans missed
Rapid iteration — Feature request → implementation → tests in minutes, not days
Documentation — Specs, user guides, design proposals, session logs
Self-review — AI can review its own code and find real issues (overflow, bounds)

⚠️ What needs human direction

HW/SW boundary decisions — Which structs are HW-owned vs SW-owned requires domain knowledge
Protocol correctness — AI may put RoCE data in the wrong header layer
Architecture decisions — When to pivot (FlexLSO → HRW for testability), how to structure files
Debugging direction — AI analyzes well but human must point to the right area to investigate

❌ What doesn't work

Blind trust on bit-level encoding — Always verify nibble/byte, offset conventions, struct layouts against HW
Assuming AI understands HW semantics — Nibble offsets, HBOA-relative addressing were wrong initially
Long autonomous runs — Best results with frequent human checkpoints and feedback loops

The key insight

AI as a 10x amplifier, not a replacement

The most effective pattern was human sets direction, AI executes at scale, human reviews and refines. Each Flash area took 1-2 sessions (~4-6 hours) vs what would have been weeks of manual work. The AI handled the tedious parts (parsing all RTL files, writing 200+ tests, implementing compiler pipeline) while the human focused on design decisions, correctness, and ergonomics.

6. The Repeatable Workflow

The same pattern was used successfully across all three areas:

Analyze

→

Design

→

Implement

→

Validate

→

Refine

Step	What the human does	What the AI does
1. Analyze	Point to RTL, specs, existing code	Read thousands of lines, synthesize understanding, produce reference docs
2. Design	Ask clarifying questions, steer decisions	Propose language constructs, data models, implementation plan
3. Implement	Say "go ahead" + review output	Write scanner/parser/model/encoder, generate tests, commit
4. Validate	Run builds, point to failures	Analyze errors, propose fixes, iterate
5. Refine	Code review feedback, naming, ergonomics	Apply fixes, expand tests, update docs

7. Session Logs & Resources

Full session histories

Area	Session Log	Key Specs
Flow Processor TCAM	flash_flow_processor.html	Flash `doc/syntax.md`
FlexLSO	flexlso_session_log.html	Compiler Proposal, Programs Design
HRW	hrw_session_history.html	HRW Compiler Spec

Code repositories

Repo	Key branches
Flash (compiler)	`suresh/flow_processor`, `suresh/flex_lso`, Flash_HRW
FunOS_Flash (programs)	`suresh/structured_tcam_match`, `suresh/flexlso`
FunOS_HRW	`hw/hrw/hrw_json.c`, `tests/hrw/`

Interactive tools

📐 Flash Architecture Diagram — clickable, searchable view of the compiler internals

Summary of deliverables

3 compiler flavors — Flow Processor TCAM, FlexLSO, HRW
~600 tests across all areas (143 + 223 + 173 + flow/integration)
Real programs — 17 SGE programs, 4 LSO profiles (TCP + RoCE), 24 HRW actions + 12 profiles
End-to-end validation — HRW datapath: 13/13 tests through HW ref model
Documentation — Design specs, user guides, RTL reference, session logs