The rise of Large Language Models (LLMs) has transformed what AI systems can do, but delivering these capabilities reliably and efficiently requires advances in machine learning systems. This course covers the foundations and current practice of ML systems for LLMs, spanning training, serving and applications. Students will learn key system design principles, optimizing compute and memory, and accelerating end-to-end performance with modern hardware and software techniques. The course also examines applied LLM systems, including RAG, AI agents, and ML operations, as well as alignment and safety considerations in deployment. It provides a practical and research-oriented background for students interested in building or studying production-grade machine learning systems. Students are expected to code and demonstrate end-to-end systems as the outcome of this course. Pre-requisitions: UG machine learning, UG operating systems, Python coding. Instructor: Yao Lu TAs: Junyi Shen, Noppanat Wadlom, Zhengyuan Su, Wangcheng Tao When and where: Thu 18:30-20:30 (lecture), 20:30-21:30 (tutorial) @ COM3-MPH Schedule: Lecturedate Plan Note Jan 15 Week 1: Introduction & MLsys Foundations [slides] [HW0 (Ungraded) & HW1: AutoDiff] Jan 22 Week 2: TinyTorch & Automatic Differentiation [slides] Jan 29 Week 3: Hardware Acceleration [slides] Tutorial: TinyTorch [tutorial] Feb 05 Week 4: LLM Serving & Optimizations [slides] LLM serving & benchmarks [slides] (project option ) HW1 due & [HW2 Release: TinyLLM] Feb 12 Week 5: Application Systems: RAG, Agents & Workflows [slides] Tutorial: agentic workflows [tutorial] [data] Feb 19 Week 6: Parallelism, Training Techniques & Attention Optimizations [slides] Building agentic applications [slides] & on-device AI [slides] (project options) HW2 due & [HW3 Release: TinyRAG] Feb 26 Recess week [Project 1] [Project 2] [Project 3] Mar 05 Week 7: LLM Alignment & Safety Tutorial: LLM finetuning & alignment Project proposal due Mar 12 Week 8: Efficient AI Tutorial: same HW3 due Mar 19 Week 9: Data Systems for AI Tutorial: data engineering Mar 26 Week 10: ML Operations Tutorial: same Mid-term project report due Apr 02 Well-Being Day Apr 09 Week 12: Cloud Systems for AI No tutorial Apr 16 Week 13: Project Demos Final project report due Text book: Machine Learning Systems: Principles and Practices of Engineering Artificially Intelligent Systems Grading schemes: Tutorials and HW1-3 for each individual student Course project for groups of 4 students. No individual projects. You can choose among three directions of projects. The project will demonstrate systems design and implementation which leads to improvements of the system efficiency, robustness or generalizability.