AI Infrastructure Handbook

Hi there! This is meant to be a resource to learn about the supercomputing infrastructure used to train and run LLMs.

My goal is to build a comprehensive, beginner-friendly resource that helps others. I hope you find it helpful in your own journey!

This site is pretty new, so expect this site to evolve and improve over time.

What You'll Find Here

I'm documenting my notes on various aspects of AI infrastructure, including:

Understanding how AI supercomputers are built and organized
Setting up and optimizing distributed training systems
Practical tips for improving performance
Real examples and case studies I've found interesting

About Me

I work at Microsoft on the performance of our AI supercomputing infrastructure, specifically on improving GPU (A100 and H100) control plane provisioning operations.

Everything I share here is curated from publicly available information - it doesn't reflect my employer's views or contain any confidential details. You can learn more about me here (opens in a new tab).

Contributing

If you'd like to help improve this site, you can visit the GitHub repository (opens in a new tab) and create a pull request with your suggestions. You can also click on the "Edit This Page" button at the bottom on the right side of the page.

Alternatively, you can reach out to me via email or Twitter (opens in a new tab).

Building Blocks