Yimin Jiang (江逸敏)

I am an engineer and an infrastructure lead at Anuttacon, working on the LLM training and inference frameworks for the company. The mission of Anuttacon is to create genuine multi-modal machine intelligence capable of seamless and organic interaction with humans.

Prior to Anuttacon, I was a training framework lead at StepFun. Our team developed and optimized the infrastructure (parallelism, data-loader, computation/communication operators, fault-tolerance and diagnosis, etc) for all training workloads, such as LLMs, unified multi-modal LLMs, and video/audio understanding and generation. Notable project involvement: Step-1o, Step-1.5V, Step-2, Step-3 AFD and StepMesh.

Before StepFun, I worked as a senior research scientist at ByteDance Seed, where I was among the first to develop the engineering frameworks for its pretraining and RL. In the pre-LLM era, I also headed the BytePS/ByteCCL and Sparse MoE projects in ByteDance -- deploying them across thousands of training and serving GPUs.

In the past few years, I am fortunate to have gained experience in optimizing and diagnosing large-scale systems involving over 10,000 GPUs. I have also deeply participated in the training process of several LLMs (with multimodality) from scratch, with the cumulative FLOPs exceeding 1e26 in total.

We are actively hiring engineers/researchers/interns at Anuttacon. If you align with our vision, please feel free to reach out.

Papers

Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training.
Xin Tan, Yuetao Chen, Yimin Jiang, Xing Chen, Kun Yan, Nan Duan, Yibo Zhu, Daxin Jiang, Hong Xu
ASPLOS 2026 [PDF]
Simulating the Next Generation of LLM Inference Systems.
Yicheng Feng, Xin Tan, Kin Hang Sew, Yimin Jiang, Yibo Zhu, Hong Xu
SOSP Workshop on Practical Adoption Challenges of ML for Systems, 2025 [PDF]
DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal LLMs.
Zili Zhang, Yinmin Zhong, Yimin Jiang, Hanpeng Hu, Jianjian Sun, Zheng Ge, Yibo Zhu, Xin Jin
SIGCOMM 2025 [PDF]
Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers.
Chenchen Shou, Guyue Liu, Hao Nie, Huaiyu Meng, Yu Zhou, Yimin Jiang, Wenqing Lv, Yelong Xu, Yuanwei Lu, Zhang Chen, Yanbo Yu, Yichen Shen, Yibo Zhu, Daxin Jiang
SIGCOMM 2025 [PDF]
Towards End-to-End Optimization of LLM-based Applications.
Xin Tan, Yimin Jiang, Yitao Yang, Hong Xu
ASPLOS 2025 [PDF]
Optimizing RLHF Training for Large Language Models with Stage Fusion.
Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Yimin Jiang, Yibo Zhu, Xin Jin
NSDI 2025 [PDF]
Adaptive Gating in Mixture-of-Experts based Language Models.
Jiamin Li, Qiang Su, Yitao Yang, Yimin Jiang, Cong Wang, Hong Xu
EMNLP 2023 [PDF]
Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models.
Juncai Liu, Jessie Hui Wang, Yimin Jiang
SIGCOMM 2023 [PDF]
Accelerating Distributed MoE Training and Inference with Lina.
Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, Hong Xu
ATC 2023 [PDF]
BytePS: A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters.
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, Chuanxiong Guo
OSDI 2020 [PDF]

Preprints

StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation.
Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, Hongyu Zhou, Yimin Jiang, Yibo Zhu, Daxin Jiang
In Preprint [PDF]
PipeWeaver: Addressing Data Dynamicity in Large Multimodal Model Training with Dynamic Interleaved Pipeline.
Zhenliang Xue, Hanpeng Hu, Xing Chen, Yimin Jiang, Yixin Song, Zeyu Mi, Yibo Zhu, Daxin Jiang, Yubin Xia, Haibo Chen
In Preprint [PDF]

Tech Reports

[PDF] Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding.*
[PDF] Step-Audio 2 Technical Report.*
[PDF] Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction.*
[PDF] Step-Video-T2V: The Practice, Challenges, and Future of Video Foundation Model.*
[PDF] Step-Video-TI2V: A State-of-the-Art Text-Driven Image-to-Video Generation Model.*
[PDF] Step1X-Edit: A Practical Framework for General Image Editing.
[PDF] Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model.
[PDF] NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale.

Note: * Indicates core contributions.