CS4262/5462 Machine Learning Systems

The rise of Large Language Models (LLMs) has transformed what AI systems can do, but delivering these capabilities reliably and efficiently requires advances in machine learning systems. This course covers the foundations and current practice of ML systems for LLMs, spanning training, serving and applications. Students will learn key system design principles, optimizing compute and memory, and accelerating end-to-end performance with modern hardware and software techniques. The course also examines applied LLM systems, including RAG, AI agents, and ML operations, as well as alignment and safety considerations in deployment. It provides a practical and research-oriented background for students interested in building or studying production-grade machine learning systems. Students are expected to code and demonstrate end-to-end systems as the outcome of this course.

Pre-requisitions: UG machine learning, UG operating systems, Python coding.
Instructor: Yao Lu
TAs: Junyi Shen, Noppanat Wadlom, Zhengyuan Su, Wangcheng Tao
When and where: Thu 18:30-20:30 (lecture), 20:30-21:30 (tutorial) @ COM3-MPH

Schedule:

Lecture date	Plan	Note
Jan 15	Week 1: Introduction & MLsys Foundations [slides]	[HW0 (Ungraded) & HW1: AutoDiff]
Jan 22	Week 2: TinyTorch & Automatic Differentiation [slides]
Jan 29	Week 3: Hardware Acceleration [slides] Tutorial: TinyTorch [tutorial]
Feb 05	Week 4: LLM Serving & Optimizations [slides] LLM serving & benchmarks [slides] (project option )	HW1 due & [HW2 Release: TinyLLM]
Feb 12	Week 5: Application Systems: RAG, Agents & Workflows [slides] Tutorial: agentic workflows [tutorial] [data]
Feb 19	Week 6: Parallelism, Training Techniques & Attention Optimizations [slides] Building agentic applications [slides] & on-device AI [slides] (project options)	HW2 due & [HW3 Release: TinyRAG]
Feb 26	Recess week	[Project 1] [Project 2] [Project 3]
Mar 05	Week 7: LLM Alignment & Safety Tutorial: LLM finetuning & alignment	Project proposal due
Mar 12	Week 8: Efficient AI Tutorial: same	HW3 due
Mar 19	Week 9: Data Systems for AI Tutorial: data engineering
Mar 26	Week 10: ML Operations Tutorial: same	Mid-term project report due
Apr 02	Well-Being Day
Apr 09	Week 12: Cloud Systems for AI No tutorial
Apr 16	Week 13: Project Demos	Final project report due

Text book:
Machine Learning Systems: Principles and Practices of Engineering Artificially Intelligent Systems

Grading schemes:

Tutorials and HW1-3 for each individual student
Course project for groups of 4 students. No individual projects. You can choose among three directions of projects. The project will demonstrate systems design and implementation which leads to improvements of the system efficiency, robustness or generalizability.