- This event has passed.
December 11, 2019 @ 4:00 pm - 5:00 pm
Title: Statistical Inference for Large-Scale Data with Incomplete Labels
Presenter: Hyebin Song
Abstract: In various real-world problems, we are presented with data with partially observed or contaminated labels. One example is datasets from deep mutational scanning (DMS) experiments in proteomics, which typically do not contain non-functional sequences. This talk addresses statistical inference procedures for analyzing noisy, high-dimensional binary data. In the first part of the talk, I will discuss variable selection in the context of positive-unlabeled data when the number of features p is large. I present the PUlasso algorithm for variable selection and classification with positive and unlabeled responses, which is scalable to large-scale data and equipped with the minimax optimal mean-squared error guarantee. In the second part of the talk, I will discuss statistical inference procedures with noisy labels data. With the key observation that the noisy labels problem belongs to a special sub-class of generalized linear models, I will present convex and non-convex approaches for inference with statistical guarantees. Finally, I will present an application of our methodology to inferring sequence-function relationships and designing highly stabilized enzymes from large-scale DMS data.