January 22 @ 4:00 pm
Title: Navigation and Evaluation in High-Dimensional Data
Presenter: Kris Sankaran
Abstract: In the modern data analysis paradigm, fitting models is easy, but knowing how to design or evaluate them is difficult. In this talk, we will adapt insights from graphical statistics and goodness-of-fit testing to modern problems. We motivate and illustrate our methodology through real-world applications in microbiome genomics and climate systems science — the data we will encounter are rich in tree, spatial, and temporal structure.
This structure complicates the three practical tasks around which this talk revolves: data exploration, model formulation, and model evaluation. For the microbiome, we show how linking complementary displays can make it easy to interactively query structure in raw data. Our implementation is available as an R package, treelapse. We then describe connections between microbiome and text data, and how those connections suggest novel modeling strategies, visual summaries, and diagnostics. The experiments leading to our recommendations can be reproduced at https://github.com/krisrs1128/microbiome_plvm. Finally, we explain how artificial intelligence can be used to accelerate climate simulations, and introduce techniques for characterizing goodness-of-fit of the resulting models, inspired by Neyman’s smooth test.
Viewed broadly, these projects provide opportunities for human interaction in the automated data processing regime, facilitating (1) streamlined navigation of data and (2) critical evaluation of models.