Skip to content
View Mattral's full-sized avatar
👀
I may be slow to respond.
👀
I may be slow to respond.

Highlights

  • Pro

Block or report Mattral

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
mattral/README.md

Mattral

ML Engineer · Distributed Training · LLM Systems · Computer Vision

I build things that work at scale -- and try to understand why they work at all.


Profile views


What I actually do

I work in the space between clean research ideas and the messy reality of clusters that fail, data that drifts, and models that need to stay honest in production.

Day-to-day: cloud-scale ML infrastructure at a hyperscaler, distributed training systems, fault-tolerant checkpointing, LLM safety layers, and the occasional low-level kernel when something needs to be faster or more reliable. The majority of that work lives in private repositories. What you see here are the side projects I chose to open-source because they felt worth sharing.

Things I care about technically

  • Large-scale pre-training infrastructure -- MoE routing, fault-tolerant checkpointing, tensor/pipeline parallelism
  • LLM safety and observability -- keeping models honest at inference time
  • The hardware-software boundary: SIMD, CUDA, kernel-level optimization
  • Novel architectures worth deploying, not just benchmarking

Things I care about less technically

  • Code that impresses interviewers but breaks on week two
  • Benchmarks that only win on synthetic data
  • Documentation that describes the happy path and nothing else

Selected work

Most of these exist because I needed to solve something concrete.
I’d rather have a few things that are real than many that just look good on a profile.

Project What it is Status
moe-engine Research-grade runtime for training large Mixture-of-Experts models at hyperscale. Features a fused Triton router, composable 4D parallelism (DP+EP+TP+PP), strict forward-pass invariants, and elastic fault tolerance with async two-tier checkpointing + automatic expert resharding on node failure. Includes chaos testing and detailed telemetry. Accompanied by a v1 preprint. Active · Preprint
Single-process PP implemented; multi-process pipeline and large-scale multi-node benchmarks in active development
GuardRail Studio Inline LLM firewall with measured sub-10 ms p99 latency in load tests. Built with ONNX + Triton, drift detection, LoRA self-updating, and full canary deployment automation. Five documented development phases. Active
Latency numbers from controlled tests; full production hardware validation ongoing
KANX Production-grade Kolmogorov-Arnold Networks library with PyTorch + TensorFlow backends, real ONNX export, Docker + Kubernetes support, and FastAPI serving. Includes benchmarks and a preprint. Active · pip install kanx · Preprint
RLHF-PPO-DPO Modular framework for full RLHF pipelines (SFT → reward modeling → PPO and DPO). Includes distributed design with ZeRO-3, async rollouts, and extensive testing. Active
Validated end-to-end on single-GPU toy setups; large-scale distributed runs not yet public
FlashSpec Adaptive speculative decoding engine with online bandit draft selection and Triton-optimized verification. Includes throughput benchmarks on Llama-3 models. Pre-alpha / Active development
Some CI and GPU tests currently under refactoring
RAG-Multimodal-Financial-Doc-Analysis-and-Recall Enterprise multimodal RAG system for financial documents (text + tables + charts via GPT). Strong emphasis on async processing, retries, structured observability, and type safety. Active
Finance-domain focused; detailed load benchmarks not yet public

Stack

Not a comprehensive list. Just what I actually reach for.

Training & inference   PyTorch TensorFlow Triton ONNX TensorRT FSDP2 TorchElastic

LLM ecosystem   Transformers PEFT / LoRA vLLM LangChain FastAPI Triton Inference Server

Distributed & infra   NCCL Kubernetes Helm Terraform Airflow Ray

Observability   Prometheus Grafana OpenTelemetry Weights & Biases

Low-level   C++ AVX2 / SIMD CUDA pybind11

Data   PostgreSQL Qdrant MongoDB Spark Dask


A few honest notes

Most of my interesting work happens in private repositories -- production systems at cloud scale where open-sourcing isn't an option. This GitHub is a public window, not the full picture.

That said: the repositories here are written to the same standard I use privately: tests, type checking, CI, real (if limited) benchmarks, and documentation that tries to admit what doesn’t work yet. When something is experimental or incomplete, the README says so.

I’m especially interested in the kinds of failures that only appear at real cluster scale, the practical trade-offs in LLM safety systems, and whether architectures like KANs will eventually find meaningful production use cases.

My path into this work wasn’t linear. I spent time in data engineering and instrumentation before moving deeper into ML systems. That background still shapes how I think about reliability and observability.


Currently

  • Working on: fixing MoE engine chaos scenario A -- sudden node failure under expert resharding
  • Reading: the Megatron-LM codebase and the FlexAttention paper
  • Thinking about: whether MFU tracking gives you enough signal to catch silent training degradation early

Problem-solving

Algorithms are how I warm up. Systems are where I live.


Stats


🎶 Current frequency


Rhythm & motion

contribution snake



3D contribution graph

On the equation that changed everything

$$\mathbf{h}_t = \sigma!\left(\mathbf{W}_h,\mathbf{h}_{t-1} + \mathbf{W}_x,\mathbf{x}_t + \mathbf{b}\right)$$

The idea that a machine could hold memory across time -- that the past could shape the present through nothing more than a weight matrix -- was the moment I understood why this field is worth a lifetime.

The equation is simple. What it implies is not.


Outside of work I'm usually reading something I don't fully understand yet, listening to music that has no business being that good, and occasionally wondering if the model actually converged or if I just got lucky. I like working with people who say "I don't know" without embarrassment and argue about architecture in good faith.

mattralminn@gmail.com


Open to interesting conversations about distributed training, LLM infrastructure, or any hard ML systems problem worth losing sleep over.

Pinned Loading

  1. KANX KANX Public

    One library, four surfaces. Production-grade Kolmogorov-Arnold Networks || TensorFlow + PyTorch + ONNX. || A small KAN beats a 10× larger MLP on smooth, separable target. One library. Two backends.…

    Python 30 8

  2. Composed-Mixture-of-Experts-Engine Composed-Mixture-of-Experts-Engine Public

    moe-engine is a research-grade infrastructure layer for training large Mixture-of-Experts language models at hyperscale. It is designed around one core constraint: at 10K+ GPUs, nodes die continuou…

    Python 10 8

  3. GuardRail-Studio GuardRail-Studio Public

    An inline LLM firewall with a sub-10 ms p99 latency target — built in layers across five documented phases. Sits between your app and any LLM endpoint to classify, redact, or block threats in real …

    Python 10 7

  4. Improving-LLM-Models-with-RLHF-PPO-DPO Improving-LLM-Models-with-RLHF-PPO-DPO Public

    A modular, production-grade framework for Reinforcement Learning from Human Feedback (RLHF) with Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO).

    Python 23 5

  5. FlashSpec FlashSpec Public

    Adaptive speculative-decoding inference engine with Triton-optimised verification and online bandit draft selection.

    Python 10

  6. RAG-Multimodal-Financial-Doc-Analysis-and-Recall RAG-Multimodal-Financial-Doc-Analysis-and-Recall Public

    Enterprise-grade multimodal Retrieval-Augmented Generation (RAG) system for financial document analysis with async processing, fault tolerance, structured observability, and scalable architecture.

    Python 63 13