The rise of Large Language Models (LLMs) has transformed what AI systems can do, but delivering these capabilities reliably and efficiently requires advances in machine learning systems. This course covers the foundations and current practice of ML systems for LLMs, spanning training, serving and applications. Students will learn key system design principles, optimizing compute and memory, and accelerating end-to-end performance with modern hardware and software techniques. The course also examines applied LLM systems, including RAG, AI agents, and ML operations, as well as alignment and safety considerations in deployment. It provides a practical and research-oriented background for students interested in building or studying production-grade machine learning systems. Students are expected to code and demonstrate end-to-end systems as the outcome of this course. Pre-requisitions: UG machine learning, UG operating systems, Python coding. Instructor: Yao Lu TAs: Junyi Shen, Noppanat Wadlom, Zhengyuan Su When and where: Thu 18:30-20:30 (lecture), 20:30-21:30 (tutorial) @ COM3-MPH Schedule: Lecturedate Plan Note Jan 15 Week 1: Introduction & MLsys Foundations [HW1 Release: AutoDiff] Jan 22 Week 2: TinyTorch & Automatic Differentiation Jan 29 Week 3: Hardware Acceleration Tutorial: TinyTorch Feb 05 Week 4: LLMs Serving & Optimizations Tutorial: LLM serving & benchmarks (project option) HW1 due & [HW2 Release: TinyLLM] Feb 12 Week 5: Application Systems: RAG, Agents & Deep Research Tutorial: agentic workflows Feb 19 Week 6: Parallelism & Training Techniques Tutorial: building agentic applications (project option) HW2 due & [HW3 Release: TinyRAG] Feb 26 Recess week Mar 05 Week 7: Efficient AI Tutorial: on-device AI (project option) Project proposal due Mar 12 Week 8: LLM Alignment & Safety Tutorial: efficient AI HW3 due Mar 19 Week 9: Data Systems for AI Tutorial: LLM alignment in action Mar 26 Week 10: ML Operations Tutorial: data engineering Mid-term project report due Apr 02 Well-Being Day Apr 09 Week 12: Cloud Systems for AI Tutorial: ML operations Apr 16 Week 13: Project Demos Final project report due Text book: Machine Learning Systems: Principles and Practices of Engineering Artificially Intelligent Systems Grading schemes: Tutorials and HW1-3 for each individual student Course project for groups of 2-3 students. No individual projects. You can choose among three directions of projects. The project will demonstrate systems design and implementation which leads to improvements of the system efficiency, robustness or generalizability.