The rise of Large Language Models (LLMs) has transformed what AI systems can do, but delivering these capabilities reliably and efficiently requires advances in machine learning systems. This course covers the foundations and current practice of ML systems for LLMs, spanning training, serving and applications. Students will learn key system design principles, optimizing compute and memory, and accelerating end-to-end performance with modern hardware and software techniques. The course also examines applied LLM systems, including RAG, AI agents, and ML operations, as well as alignment and safety considerations in deployment. It provides a practical and research-oriented background for students interested in building or studying production-grade machine learning systems. Students are expected to code and demonstrate end-to-end systems as the outcome of this course.

Pre-requisitions: UG machine learning, UG operating systems, Python coding.
Instructor: Yao Lu
TAs: Junyi Shen, Noppanat Wadlom, Zhengyuan Su
When and where: Thu 18:30-20:30 (lecture), 20:30-21:30 (tutorial) @ COM3-MPH

Schedule:

Lecture
date
Plan Note
Jan 15 Week 1: Introduction & MLsys Foundations
[HW1 Release: AutoDiff]
Jan 22 Week 2: TinyTorch & Automatic Differentiation
Jan 29 Week 3: Hardware Acceleration
Tutorial: TinyTorch
Feb 05 Week 4: LLMs Serving & Optimizations
Tutorial: LLM serving & benchmarks (project option)
HW1 due & [HW2 Release: TinyLLM]
Feb 12 Week 5: Application Systems: RAG, Agents & Deep Research
Tutorial: agentic workflows

Feb 19 Week 6: Parallelism & Training Techniques
Tutorial: building agentic applications (project option)
HW2 due & [HW3 Release: TinyRAG]
Feb 26 Recess week
Mar 05 Week 7: Efficient AI
Tutorial: on-device AI (project option)
Project proposal due
Mar 12 Week 8: LLM Alignment & Safety
Tutorial: efficient AI
HW3 due
Mar 19 Week 9: Data Systems for AI
Tutorial: LLM alignment in action
Mar 26 Week 10: ML Operations
Tutorial: data engineering
Mid-term project report due
Apr 02 Well-Being Day
Apr 09 Week 12: Cloud Systems for AI
Tutorial: ML operations
Apr 16 Week 13: Project Demos
Final project report due
Text book:
Machine Learning Systems: Principles and Practices of Engineering Artificially Intelligent Systems

Grading schemes:
  • Tutorials and HW1-3 for each individual student
  • Course project for groups of 2-3 students. No individual projects. You can choose among three directions of projects. The project will demonstrate systems design and implementation which leads to improvements of the system efficiency, robustness or generalizability.