Steve Mussmann
Assistant Professor, School of Computer Science, Georgia Tech
Research interests include data-centric ML, active labeling/learning, and data selection.
Affiliations: Foundations of AI (FoAI), ML@GT
Contact: mussmann@gatech.edu, KACB 3320
About me
Bio
Prior to starting at Georgia Tech in Fall 2024, Steve was a full-time machine learning researcher at Coactive AI. He finished a postdoc at the Paul Allen School of Compute Science and Engineering at the University of Washington with Kevin Jamieson and Ludwig Schmidt in September 2023. Steve graduated with a PhD in computer science from Stanford University in 2021, advised by Percy Liang, and a BS in math, statistics, and computer science from Purdue University in 2015.
Research
Machine learning is a tool that is incorporated in a quickly increasing variety and number of systems and processes in society. My research is driven by making ML easier-to-use, more effective, and more likely to be used in beneficial ways. This often takes the form of abstracting machine learning issues (data efficiency, interpretability, robustness, etc.) from specific application areas (computer vision, NLP, computational biology, etc.) to discover insights that lead to more useful algorithms and more reliable best practices. By using a mix of theoretical and experimental techniques, my research takes a broad perspective while ensuring practical relevance.
Research on learning algorithms has seen remarkable progress over the past decade, especially with regards to text and images, which has ignited interest in machine learning. While the learning algorithm is critical to an ML system, there are many other aspects that are under-studied, including data sourcing, pre-processing, annotation, cleaning, validation, and monitoring which all significantly affect the reliability and usability of the system. My work often falls under the umbrella of data-centric machine learning, where the focus is on improving the quality of the data while the model architecture and optimization algorithm are held fixed.
Much of my previous work falls into one of two categories:
Active Labeling/Learning: human supervision and interaction with nature (experiments) can be expensive and slow. Can we design efficient algorithms to iteratively choose data to label for use cases where collecting labels is expensive so that we can significantly decrease the cost and effort of labeling?
Data Selection: given increasingly large and noisy data sets, training on all available data can be expensive and can yield sub-optimal performance for specific tasks. Can we efficiently select training data that yield more accurate predictors?