Development with AI Agents
1. Flash — Introduction
What is Flash?
Flash is a Rust-based domain-specific compiler for the DPU's programmable flow processors (SGE, PTA) and FlexLSO. It compiles .flash programs into JSON artifacts that FunOS loads to program hardware TCAM/instruction memory at runtime.
Pipeline: Source → Scanner (47 tokens) → Parser (recursive descent) → Semantic Model → Encoder → JSON + C bindings
The Flash pipeline
The 📐 Interactive Flash Architecture Diagram provides a detailed clickable view of the compiler internals.
2. Flow Processor TCAM — Structured Match Syntax
The first project was improving how TCAM entries are expressed in Flash. The SGE flow processor has 128 TCAM entries with a 68-bit key — but the flow syntax treated it as opaque hex.
Asking the AI to analyze the problem
flow syntax:
flow {
mask 0xF0000
match 0x200000
action do_stuff
}
Problems identified:
- Raw hex mask/match is not human-friendly — the 68-bit key is structured (queue type, opcode, error bits), but treated as opaque hex
- No key structure declaration — programmer must consult keygen separately to know bit positions
- No priority/overlap validation — overlapping patterns not detected
- Flow is too low-level — clean handler/flow separation, but flow lacks any abstraction
key struct declaration + structured match with field-level matching + compiler validations.
The transformation
flow {
mask 0xf000_0000_0000_0000
match 0x2000_0000_0000_0000
action sq_db
}
key sge_keygen_key @ message {
msg_type : 4
sid : 1
cid : 1
thd_id : 2
queue_type: 2
...
}
flow {
match sge_keygen_key {
msg_type == MSG_TYPE_SQ_DB
}
action sq_db
}
The compiler resolves field names to bit positions, computes mask/value automatically, and validates overlaps and exhaustiveness at compile time.
- AI analyzed 68-bit key structure across compiler + RTL, proposed 4-part improvement
- Implemented end-to-end: scanner keywords → parser → model → encoder → tests
- Converted all 17 SGE programs; found a real TCAM collision bug during conversion
- "Use constants for mask and value — it looks ugly" → AI initially generated raw hex in structured syntax
- Design taste: AI proposes technically correct solutions; human refines ergonomics
3. FlexLSO — From RTL Analysis to Real Programs
Inspired by the TCAM work, the next question was: can we extend Flash to handle FlexLSO — a major new S3 hardware block with no prior SW support? This section shows the deliberate approach used to develop it end-to-end.
Phase 1: Start with RTL analysis
Rather than jumping to implementation, the first step was having the AI read and understand the actual hardware — legacy LSO RTL and the new FlexLSO instruction set.
1. Input WU —
wudma_nu_com_opr_t. S3 adds two new bits: flex (0=Legacy, 1=Flex) and size (0=32B, 1=64B).2. LSO Operate Command — Legacy = fixed 128-bit format. Flex = 16B fixed portion (
opc, cmd, hdr_mod_ctl, mode, mss, flex_flag, profile, template, proto_ofst) + variable flex data (16B or 48B).3. Data to FlexLSO engine (128B) — Fixed DS (16B) + ALU State (16B) + Flex DS (up to 48B). Three separate SRAM banks: mem0 (14-bit metadata), mem1 (instructions 0-3), mem2 (instructions 4-7).
This produced a comprehensive 850-line specification extracted from RTL — covering command structures, header stack types, per-packet modification logic, and the 5-stage data pipeline.
Phase 2: Design through Q&A
Before writing any code, the approach was to have the AI propose and then ask clarifying questions. This back-and-forth — 12 clarifying questions across multiple rounds — iterated on key design decisions before a single line of code was written.
🔍 See the full design Q&A exchange — the most interesting part of this sessionFLEXLSO flavour on a new S3 target with these constructs:struct for header types, flags for opr_flag bits, profile with indexed lines, each line targeting a stack position with typed header binding. Compute instructions use multi-stage pipeline syntax (compute A + B -> swap hdr.field)...[Analysis of all 13 opcodes, memory model, encoding format follows]
tcp.seq_num instead of constants like 16. Can you improve on that?
// Before (raw offsets):
compute fix.seq_num + flex.tcp_seq_num_inc
-> swap header[16, 4]
// After (symbolic):
line inner_l4 -> stack(0) as tcp : tcp_hdr_t {
compute fix.seq_num + flex.tcp_seq_num_inc
-> swap tcp.seq_num
}
The compiler resolves tcp.seq_num → offset 16, length 4 from the struct definition. Raw header[ofst, len] still available as fallback.
when opr & 24'h000001 == 24'h000001 — why are there raw hex constants? Need descriptive constants defined and used.
when accepts named constants (not just literals):when opr & FIRST_OR_LAST == ZERO — clear that it checks "neither first nor last".Added compound constants:
SE_LAST, IMM_LAST, ZERO.
Phase 3: Implementation + Rapid iteration
With the design settled, the AI implemented the full compiler in one session, followed by iterative program development with constant human feedback.
+, |, ->) → Parser (10 new methods) → Model (profile/line/instruction resolution) → Encoder (3-bank memory output). 167 tests on first pass.Catching real errors
line eth_terminate -> stack(3) { when opr & IMM_LAST == IMM_LAST { add_hdr flex.immediate, 12, 4 } }This immediate is for RoCE header. Why are you appending to Ethernet header?
add_hdr) is now in the BTH line (stack 0) — right where it belongs in the RoCE header, before ICRC. The ETH lines are now just terminate.
- RTL-first approach — AI extracted comprehensive spec that served as implementation reference
- Design Q&A before coding — clarifying questions surfaced edge cases early
- Rapid iteration — compiler implemented in one session, features added on demand
- AI reviewed its own code and found real bugs (overflow, bounds)
- HW/SW ownership boundaries (fixed = HW, flex = SW) — AI treated everything as SW until corrected
- Protocol correctness — RoCE immediate data placement in wrong header layer
- Knowing when to push back on AI output and iterate on naming, structure, constants
Design specs
📄 FlexLSO Compiler Proposal 📄 FlexLSO Programs Design
📖 Full Session Log — FlexLSO Development4. HRW — Closing the Loop with Real Traffic
FlexLSO was complete but couldn't be traffic-tested (S3 ETP refmodel not available). The next step was to apply the same approach to HRW (Header Rewrite) on S2, where existing ref model + tests could validate actual datapath behavior.
Design-first approach
hrw.cc), FunOS hrw_tests.c. Then, ask me clarifying questions on design.
Pipeline:
.flash → structured JSON → FunOS driver parses JSON at runtime → CSR writes → HW/ref modelKey constructs:
•
action NAME { ... } — Block of ≤8 HRW-native instructions (movd, movi, addi, etc.)•
profile NAME { step ... } — Sequence of ≤32 ISM steps referencing named actions•
enum hboa_idx { OL_L2=0, ... } — User-defined HBOA layer names•
const CONST_GLOBAL = 0 — Named constant indicesPhase 1: Full ISM sequencer + action commands
Phase 2: TCAM dispatch
Full pipeline: Compiler → Loader → Datapath
Unlike FlexLSO, HRW was carried all the way through to real traffic flowing through the HW reference model.
hrw_json.c (541 lines) — parses Flash JSON output, programs ISM + action commands via CSR APIs.hrw_tests.c (2582 lines, 3 test functions) to use Flash programs. Created hrw_datapath.flash with 24 actions, 12 profiles.- End-to-end validation — proved Flash output works through actual HW ref model
- AI analyzed ref model error codes to narrow down root causes
- C-bindings generation — auto-generated header files for FunOS integration
- All HRW offsets are nibbles (half-bytes), not bytes — AI initially used byte offsets
- HBOA-relative addressing:
hdr/newhdrvars MUST use@ 0— HBOA provides the layer offset at runtime - Flow state struct had extra padding in Flash definition vs C
offsetof()values
Design spec
📄 HRW Flash Compiler Spec (~1000 lines)
📖 Full Session Log — HRW Development5. Lessons Learned — What Works and What Doesn't
- Codebase analysis at scale — Reading RTL, compiler internals, thousands of lines of C/Rust and synthesizing understanding
- Language & compiler design — Propose syntax, iterate on feedback, implement full pipeline
- Comprehensive test generation — Including negative tests, edge cases, coverage audits
- Bug detection during migration — Found the csr_oper TCAM collision that humans missed
- Rapid iteration — Feature request → implementation → tests in minutes, not days
- Documentation — Specs, user guides, design proposals, session logs
- Self-review — AI can review its own code and find real issues (overflow, bounds)
- HW/SW boundary decisions — Which structs are HW-owned vs SW-owned requires domain knowledge
- Protocol correctness — AI may put RoCE data in the wrong header layer
- Architecture decisions — When to pivot (FlexLSO → HRW for testability), how to structure files
- Debugging direction — AI analyzes well but human must point to the right area to investigate
- Blind trust on bit-level encoding — Always verify nibble/byte, offset conventions, struct layouts against HW
- Assuming AI understands HW semantics — Nibble offsets, HBOA-relative addressing were wrong initially
- Long autonomous runs — Best results with frequent human checkpoints and feedback loops
The key insight
AI as a 10x amplifier, not a replacement
The most effective pattern was human sets direction, AI executes at scale, human reviews and refines. Each Flash area took 1-2 sessions (~4-6 hours) vs what would have been weeks of manual work. The AI handled the tedious parts (parsing all RTL files, writing 200+ tests, implementing compiler pipeline) while the human focused on design decisions, correctness, and ergonomics.
6. The Repeatable Workflow
The same pattern was used successfully across all three areas:
| Step | What the human does | What the AI does |
|---|---|---|
| 1. Analyze | Point to RTL, specs, existing code | Read thousands of lines, synthesize understanding, produce reference docs |
| 2. Design | Ask clarifying questions, steer decisions | Propose language constructs, data models, implementation plan |
| 3. Implement | Say "go ahead" + review output | Write scanner/parser/model/encoder, generate tests, commit |
| 4. Validate | Run builds, point to failures | Analyze errors, propose fixes, iterate |
| 5. Refine | Code review feedback, naming, ergonomics | Apply fixes, expand tests, update docs |
7. Session Logs & Resources
Full session histories
| Area | Session Log | Key Specs |
|---|---|---|
| Flow Processor TCAM | flash_flow_processor.html | Flash doc/syntax.md |
| FlexLSO | flexlso_session_log.html | Compiler Proposal, Programs Design |
| HRW | hrw_session_history.html | HRW Compiler Spec |
Code repositories
| Repo | Key branches |
|---|---|
| Flash (compiler) | suresh/flow_processor, suresh/flex_lso, Flash_HRW |
| FunOS_Flash (programs) | suresh/structured_tcam_match, suresh/flexlso |
| FunOS_HRW | hw/hrw/hrw_json.c, tests/hrw/ |
Interactive tools
📐 Flash Architecture Diagram — clickable, searchable view of the compiler internals
Summary of deliverables
- 3 compiler flavors — Flow Processor TCAM, FlexLSO, HRW
- ~600 tests across all areas (143 + 223 + 173 + flow/integration)
- Real programs — 17 SGE programs, 4 LSO profiles (TCP + RoCE), 24 HRW actions + 12 profiles
- End-to-end validation — HRW datapath: 13/13 tests through HW ref model
- Documentation — Design specs, user guides, RTL reference, session logs