I am a systems architect and AI researcher in Intel's Systems Architecture and Engineering team, working on inference-oriented, memory-centric system architectures — disaggregated and in-network memory, KV-cache placement across heterogeneous memory, and the serving of vision-language-action (VLA) models for robotics on dataflow and heterogeneous accelerators. My background is in physics and high-performance computing, and I have spent more than a decade at the intersection of computer systems, AI infrastructure and performance, across both industry and academia. The throughline: the FLOP/s-versus-TB/s balancing act, applied recursively at every layer of the memory hierarchy.
Technical lead for SPEC Cloud benchmark simulation and system-level performance projection, leading a team of five. Drive next-generation, inference-focused architecture research for large-memory-capacity serving — silicon photonics and rack-scale optical interconnects, dataflow vs. heterogeneous accelerators, disaggregated and in-network memory, and VLA-model serving for robotics.
Coordinated research with leading scientists on next-generation AI algorithms — GNNs, NLP and recommendation systems — for graph analytics and security.
Drove novel AI optimizations (SIMT, sparse-compute acceleration, quantization) and led AI workload characterization for the Intel PiUMA sparse accelerator. Based at the University of Edinburgh, collaborating with K. Heafield and P. Boyle.
HPC and data-science leadership across finance, cyber security and exascale; porting, profiling and optimisation on academia–industry collaborations.
Characterized vision-language-action robotics models (π0.5, GR00T) as a two-frequency serving workload and mapped them onto candidate architectures, comparing dataflow accelerators (Groq) against heterogeneous GPUs — a common FLOP/s-versus-TB/s balancing act across the memory hierarchy.
Assessed silicon-photonics and rack-scale optical interconnect technologies (e.g. Celestial AI) for memory-centric inference, comparing latency, throughput and cost across on- vs off-chip solutions; explored disaggregated compute/memory and in-network memory.
Established and led a cross-organisation collaboration — around seven contributors across Intel, Intel Labs and Harvard University (SEAS) — on graph-learning acceleration for Intel PiUMA. Owned inception, stakeholder management, software development and performance optimization — resulting in ISPASS 2023, a TASK Quarterly journal paper (2024) and a US patent.
Part of the Intel–University of Edinburgh collaboration (K. Heafield) on CPU-efficient, low-precision (int8) Marian neural machine translation, which won the 2020 NGT efficiency task. The underlying int8 GEMM kernels were adopted into the SPEC CPU benchmark in 2026 — where Marian machine translation is the sole NLP workload.
Built system-level workload characterization of ARCHER (UK national supercomputer) and researched data-aware, workflow-enabled HPC scheduling (SLURM-class) for Intel 3D XPoint non-volatile memory.
Designed and optimized the software stack of a custom GPU cluster — Highest Linpack at ISC'14 (3.38 Tflops/kW, est. 4th on the June 2014 Green500). Ported GADGET-3 kernels to GPU via OpenACC for 2× speedup.