Senior Systems Engineer, KDI SSD Product Engineering
Kingston Technology · Storage for AI & LLM Inference · Fountain Valley, CA · July 2019 – Present
- Authored the MLPerf Storage v3.0 KV Cache Benchmark in the MLCommons TF_KVCache working group. PR #270 merged to main in March 2026.
- On H100 NVL with vLLM 0.18 and Qwen2.5-72B FP8, NVMe KV offload lifted serving throughput 71–82% under 128-concurrency, 12K-context stress.
- Published Kingston's PCIe Gen5 NVMe for AI whitepaper. One Gen5 SSD sustained 11.6 GB/s feeding 127 A100s at 91.6% accelerator utilization on ResNet-50.
- Led Kingston's first MLPerf Storage v2 CLOSED submission across Dell Gen4, Lenovo Gen5, and Supermicro Gen5 platforms.
- Built and operate Nebula, an on-prem LLM platform on 8 GPUs (6×A40 + 2×A100) serving 53 engineers across three continents.
- Architected end-to-end AI infrastructure on Nebula: RAG pipeline tuned for engineering accuracy, custom evaluation framework, and agent codebases for code review across 10 languages with verified auto-fix loops, content generation, QoS reporting, and semiconductor intelligence.
- LoRA fine-tuned domain-specific KDI Spec Expert model variants scoring 4.88–4.9/5.0 against o1-mini and Claude Sonnet 4 on 23 expert NVMe questions, with zero technical hallucinations on SSD specification reasoning.
- Built Nebula's production observability stack (Loki, Promtail, Tempo, OpenTelemetry, Grafana) with Okta SSO, CrowdStrike, and Fail2ban security integration.
- Delivered enterprise-grade AI — including domain-specific fine-tunes unavailable commercially — at $330/month operating cost, versus $9K+/month for equivalent commercial alternatives.
- Developed 40+ custom tools and a 227-prompt library across 19 categories; managed automated analytics, backup, and disaster-recovery infrastructure for the platform.
- Led technical knowledge transfer and documentation as the platform transitioned toward formal IT governance.
- Built an internal storage and memory benchmark for local client-based AI systems (~23,500 LOC, 18 workloads, dual-platform Windows / Linux).
- Designed, built, and automated Kingston's enterprise SSD validation ecosystem: 50+ automated test suites covering RAID certification, VMware vSAN, NVMe protocol conformance, thermal validation, power-loss protection, JEDEC 219a endurance, and SNIA Enterprise PTS — the foundation of Kingston's ability to ship datacenter SSDs.
- Built the QoS Cloud benchmarking suite: 180+ real-workload scenarios stressing IOPS, bandwidth, and P99.99 / P99.999 tail latency across the DC SSD product line, with SQL backend and dashboard reporting for tier-1 customer engagements.
- Built the AWS-partner SSD qualification framework (Python + Ansible + Flask UI). Single-command qualification across hundreds of hosts.
- Resolved 32-second write-latency tail-stalls across an 8,000-device European hyperscaler fleet via blktrace replay; firmware fix dropped P99.9999 from 3060ms to 164ms.
- Designed Kingston's Quarch power-loss qualification suite. Discovered the CC.SHN graceful-shutdown sequence proving graceful vs ungraceful via the SMART unsafe-shutdowns counter.
- Established Broadcom 94XX/95XX RAID/HBA qualification path and Red Hat RHEL 8 certification for 20+ Kingston datacenter SSDs.
Senior Systems Engineer II, Storage Architecture
Geico · Chevy Chase, MD · September 2016 – July 2019
- Co-owned 50PB of distributed storage across US datacenters on a six-engineer team backing all of Geico's financial apps, customer databases, and compliance systems.
- Led 2,600+ volume migration across IBM SVC with zero downtime under 99.99% uptime contracts; multi-month cutover against live financial operations.
- Cut manual storage operations 80% via Python + Bash automation: NPIV zone configuration across dual SAN fabrics, volume mirroring, fleet-scale provisioning and decommission.
- Primary escalation engineer across IBM SVC / FlashSystem / A9000R / XIV, EMC VMAX, HP 3PAR 8200, Brocade DCX; ran zero-data-loss recovery on complete backend array outages.
- Owned cross-regional datacenter migrations and DR architecture on Site Recovery Manager for mission-critical financial workloads.