ML Engineer · Distributed Training · LLM Systems · Computer Vision
I build things that work at scale -- and try to understand why they work at all.
I work in the space between clean research ideas and the messy reality of clusters that fail, data that drifts, and models that need to stay honest in production.
Day-to-day: cloud-scale ML infrastructure at a hyperscaler, distributed training systems, fault-tolerant checkpointing, LLM safety layers, and the occasional low-level kernel when something needs to be faster or more reliable. The majority of that work lives in private repositories. What you see here are the side projects I chose to open-source because they felt worth sharing.
Things I care about technically
- Large-scale pre-training infrastructure -- MoE routing, fault-tolerant checkpointing, tensor/pipeline parallelism
- LLM safety and observability -- keeping models honest at inference time
- The hardware-software boundary: SIMD, CUDA, kernel-level optimization
- Novel architectures worth deploying, not just benchmarking
Things I care about less technically
- Code that impresses interviewers but breaks on week two
- Benchmarks that only win on synthetic data
- Documentation that describes the happy path and nothing else
Most of these exist because I needed to solve something concrete.
I’d rather have a few things that are real than many that just look good on a profile.
| Project | What it is | Status |
|---|---|---|
| moe-engine | Research-grade runtime for training large Mixture-of-Experts models at hyperscale. Features a fused Triton router, composable 4D parallelism (DP+EP+TP+PP), strict forward-pass invariants, and elastic fault tolerance with async two-tier checkpointing + automatic expert resharding on node failure. Includes chaos testing and detailed telemetry. Accompanied by a v1 preprint. | Active · Preprint Single-process PP implemented; multi-process pipeline and large-scale multi-node benchmarks in active development |
| GuardRail Studio | Inline LLM firewall with measured sub-10 ms p99 latency in load tests. Built with ONNX + Triton, drift detection, LoRA self-updating, and full canary deployment automation. Five documented development phases. | Active Latency numbers from controlled tests; full production hardware validation ongoing |
| KANX | Production-grade Kolmogorov-Arnold Networks library with PyTorch + TensorFlow backends, real ONNX export, Docker + Kubernetes support, and FastAPI serving. Includes benchmarks and a preprint. | Active · pip install kanx · Preprint |
| RLHF-PPO-DPO | Modular framework for full RLHF pipelines (SFT → reward modeling → PPO and DPO). Includes distributed design with ZeRO-3, async rollouts, and extensive testing. | Active Validated end-to-end on single-GPU toy setups; large-scale distributed runs not yet public |
| FlashSpec | Adaptive speculative decoding engine with online bandit draft selection and Triton-optimized verification. Includes throughput benchmarks on Llama-3 models. | Pre-alpha / Active development Some CI and GPU tests currently under refactoring |
| RAG-Multimodal-Financial-Doc-Analysis-and-Recall | Enterprise multimodal RAG system for financial documents (text + tables + charts via GPT). Strong emphasis on async processing, retries, structured observability, and type safety. | Active Finance-domain focused; detailed load benchmarks not yet public |
Not a comprehensive list. Just what I actually reach for.
Training & inference
PyTorch TensorFlow Triton ONNX TensorRT FSDP2 TorchElastic
LLM ecosystem
Transformers PEFT / LoRA vLLM LangChain FastAPI Triton Inference Server
Distributed & infra
NCCL Kubernetes Helm Terraform Airflow Ray
Observability
Prometheus Grafana OpenTelemetry Weights & Biases
Low-level
C++ AVX2 / SIMD CUDA pybind11
Data
PostgreSQL Qdrant MongoDB Spark Dask
Most of my interesting work happens in private repositories -- production systems at cloud scale where open-sourcing isn't an option. This GitHub is a public window, not the full picture.
That said: the repositories here are written to the same standard I use privately: tests, type checking, CI, real (if limited) benchmarks, and documentation that tries to admit what doesn’t work yet. When something is experimental or incomplete, the README says so.
I’m especially interested in the kinds of failures that only appear at real cluster scale, the practical trade-offs in LLM safety systems, and whether architectures like KANs will eventually find meaningful production use cases.
My path into this work wasn’t linear. I spent time in data engineering and instrumentation before moving deeper into ML systems. That background still shapes how I think about reliability and observability.
- Working on: fixing MoE engine chaos scenario A -- sudden node failure under expert resharding
- Reading: the Megatron-LM codebase and the FlexAttention paper
- Thinking about: whether MFU tracking gives you enough signal to catch silent training degradation early
The idea that a machine could hold memory across time -- that the past could shape the present through nothing more than a weight matrix -- was the moment I understood why this field is worth a lifetime.
The equation is simple. What it implies is not.
Outside of work I'm usually reading something I don't fully understand yet, listening to music that has no business being that good, and occasionally wondering if the model actually converged or if I just got lucky. I like working with people who say "I don't know" without embarrassment and argue about architecture in good faith.
Open to interesting conversations about distributed training, LLM infrastructure, or any hard ML systems problem worth losing sleep over.





