Min Htet Myet Mattral

Mattral

ML Engineer · Distributed Training · LLM Systems · Computer Vision

I build things that work at scale -- and try to understand why they work at all.

What I actually do

I work in the space between clean research ideas and the messy reality of clusters that fail, data that drifts, and models that need to stay honest in production.

Day-to-day: cloud-scale ML infrastructure at a hyperscaler, distributed training systems, fault-tolerant checkpointing, LLM safety layers, and the occasional low-level kernel when something needs to be faster or more reliable. The majority of that work lives in private repositories. What you see here are the side projects I chose to open-source because they felt worth sharing.

Things I care about technically

Large-scale pre-training infrastructure -- MoE routing, fault-tolerant checkpointing, tensor/pipeline parallelism
LLM safety and observability -- keeping models honest at inference time
The hardware-software boundary: SIMD, CUDA, kernel-level optimization
Novel architectures worth deploying, not just benchmarking

Things I care about less technically

Code that impresses interviewers but breaks on week two
Benchmarks that only win on synthetic data
Documentation that describes the happy path and nothing else

Selected work

Most of these exist because I needed to solve something concrete.
I’d rather have a few things that are real than many that just look good on a profile.

Project	What it is	Status
moe-engine	Research-grade runtime for training large Mixture-of-Experts models at hyperscale. Features a fused Triton router, composable 4D parallelism (DP+EP+TP+PP), strict forward-pass invariants, and elastic fault tolerance with async two-tier checkpointing + automatic expert resharding on node failure. Includes chaos testing and detailed telemetry. Accompanied by a v1 preprint.	Active · Preprint Single-process PP implemented; multi-process pipeline and large-scale multi-node benchmarks in active development
GuardRail Studio	Inline LLM firewall with measured sub-10 ms p99 latency in load tests. Built with ONNX + Triton, drift detection, LoRA self-updating, and full canary deployment automation. Five documented development phases.	Active Latency numbers from controlled tests; full production hardware validation ongoing
KANX	Production-grade Kolmogorov-Arnold Networks library with PyTorch + TensorFlow backends, real ONNX export, Docker + Kubernetes support, and FastAPI serving. Includes benchmarks and a preprint.	Active · `pip install kanx` · Preprint
RLHF-PPO-DPO	Modular framework for full RLHF pipelines (SFT → reward modeling → PPO and DPO). Includes distributed design with ZeRO-3, async rollouts, and extensive testing.	Active Validated end-to-end on single-GPU toy setups; large-scale distributed runs not yet public
FlashSpec	Adaptive speculative decoding engine with online bandit draft selection and Triton-optimized verification. Includes throughput benchmarks on Llama-3 models.	Pre-alpha / Active development Some CI and GPU tests currently under refactoring
RAG-Multimodal-Financial-Doc-Analysis-and-Recall	Enterprise multimodal RAG system for financial documents (text + tables + charts via GPT). Strong emphasis on async processing, retries, structured observability, and type safety.	Active Finance-domain focused; detailed load benchmarks not yet public

Stack

Not a comprehensive list. Just what I actually reach for.

Training & inference PyTorch TensorFlow Triton ONNX TensorRT FSDP2 TorchElastic

LLM ecosystem Transformers PEFT / LoRA vLLM LangChain FastAPI Triton Inference Server

Distributed & infra NCCL Kubernetes Helm Terraform Airflow Ray

Observability Prometheus Grafana OpenTelemetry Weights & Biases

Low-level C++ AVX2 / SIMD CUDA pybind11

Data PostgreSQL Qdrant MongoDB Spark Dask

A few honest notes

Most of my interesting work happens in private repositories -- production systems at cloud scale where open-sourcing isn't an option. This GitHub is a public window, not the full picture.

That said: the repositories here are written to the same standard I use privately: tests, type checking, CI, real (if limited) benchmarks, and documentation that tries to admit what doesn’t work yet. When something is experimental or incomplete, the README says so.

I’m especially interested in the kinds of failures that only appear at real cluster scale, the practical trade-offs in LLM safety systems, and whether architectures like KANs will eventually find meaningful production use cases.

My path into this work wasn’t linear. I spent time in data engineering and instrumentation before moving deeper into ML systems. That background still shapes how I think about reliability and observability.

Currently

Working on: fixing MoE engine chaos scenario A -- sudden node failure under expert resharding
Reading: the Megatron-LM codebase and the FlexAttention paper
Thinking about: whether MFU tracking gives you enough signal to catch silent training degradation early

Problem-solving

Algorithms are how I warm up. Systems are where I live.

Stats

🎶 Current frequency

Rhythm & motion

On the equation that changed everything

$$\mathbf{h}_t = \sigma!\left(\mathbf{W}_h,\mathbf{h}_{t-1} + \mathbf{W}_x,\mathbf{x}_t + \mathbf{b}\right)$$

The idea that a machine could hold memory across time -- that the past could shape the present through nothing more than a weight matrix -- was the moment I understood why this field is worth a lifetime.

The equation is simple. What it implies is not.

Outside of work I'm usually reading something I don't fully understand yet, listening to music that has no business being that good, and occasionally wondering if the model actually converged or if I just got lucky. I like working with people who say "I don't know" without embarrassment and argue about architecture in good faith.

mattralminn@gmail.com

Open to interesting conversations about distributed training, LLM infrastructure, or any hard ML systems problem worth losing sleep over.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly