The rise of Large Language Models (LLMs) has transformed what AI systems can do, but delivering these capabilities reliably and efficiently requires advances in machine learning systems. This course covers the foundations and current practice of ML systems for LLMs, spanning training, serving and applications. Students will learn key system design principles, optimizing compute and memory, and accelerating end-to-end performance with modern hardware and software techniques. The course also examines applied LLM systems, including RAG, AI agents, and ML operations, as well as alignment and safety considerations in deployment. It provides a practical and research-oriented background for students interested in building or studying production-grade machine learning systems. Students are expected to code and demonstrate end-to-end systems as the outcome of this course.

Pre-requisitions: UG machine learning, UG operating systems, Python coding.
Instructor: Yao Lu
TAs: Junyi Shen, Noppanat Wadlom, Zhengyuan Su, Wangcheng Tao
When and where: Thu 18:30-20:30 (lecture), 20:30-21:30 (tutorial) @ COM3-MPH

Schedule:

Lecture
date
Plan Note
Jan 15 Week 1: Introduction & MLsys Foundations
[slides]
[HW0 (Ungraded) & HW1: AutoDiff]
Jan 22 Week 2: TinyTorch & Automatic Differentiation
[slides]
Jan 29 Week 3: Hardware Acceleration
[slides] Tutorial: TinyTorch [tutorial]
Feb 05 Week 4: LLM Serving & Optimizations
[slides] LLM serving & benchmarks [slides] (project option )
HW1 due & [HW2 Release: TinyLLM]
Feb 12 Week 5: Application Systems: RAG, Agents & Workflows
[slides] Tutorial: agentic workflows [tutorial] [data]

Feb 19 Week 6: Parallelism, Training Techniques & Attention Optimizations
[slides] Building agentic applications [slides]
& on-device AI [slides] (project options)
HW2 due & [HW3 Release: TinyRAG]
Feb 26 Recess week
[Project 1]
[Project 2]
[Project 3]
Mar 05 Week 7: LLM Alignment & Safety
Tutorial: LLM finetuning & alignment
Project proposal due
Mar 12 Week 8: Efficient AI
Tutorial: same
HW3 due
Mar 19 Week 9: Data Systems for AI
Tutorial: data engineering
Mar 26 Week 10: ML Operations
Tutorial: same
Mid-term project report due
Apr 02 Well-Being Day
Apr 09 Week 12: Cloud Systems for AI
No tutorial
Apr 16 Week 13: Project Demos
Final project report due
Text book:
Machine Learning Systems: Principles and Practices of Engineering Artificially Intelligent Systems

Grading schemes:
  • Tutorials and HW1-3 for each individual student
  • Course project for groups of 4 students. No individual projects. You can choose among three directions of projects. The project will demonstrate systems design and implementation which leads to improvements of the system efficiency, robustness or generalizability.