2021-2022 Statistics Department Seminars

Jianhua HuangJianhua Huang: September 15, 2021
Abstract: Estimation of distances of galaxies is one essential component to determine the age and composition of our Universe. The Triangulum Galaxy (catalogued as M33) is the third-largest member of the Local Group of galaxies, and one of the most distant permanent objects that can be viewed with the naked eyes. Mira variable stars are a class of pulsating stars characterized by very red colors. In an interdisciplinary project, we used available light curve data to infer the periods and period-luminosity relations (PLR) of Miras in M33, and then infer the distance of M33 by referencing to the known distance of the LMC, a satellite galaxy of the Milky Way. In this talk, I shall discuss various challenges we faced in this project, including dealing with data sparsity, conducting genuine simulation, reducing computational cost, and quantifying uncertainty. I shall also show how statistical ideas were brought into play in dealing with these challenges. The statistical concepts or methods used include Gaussian process regression, semi-parametric inference, Bayesian hierarchical models, and stochastic variational inference.

Jonas Kahn: September 22, 2021
Abstract: Quantum statistics require a mathematical framework different from that of classical statistics: if we can measure A or B, we cannot in general, even theoretically, measure A and B. We’ll start the talk by introducing the quantum statistics framework.

The key object is the state, encoded by a semi-definite positive matrix with trace one. In practice it will often be low-rank. In particular, the states usually considered in physics classes, namely the pure states, are the rank-one states.

We will show that a very simple and computationally efficient procedure allows rate-optimal estimation of a state or of a channel (quantum transformation), while being adaptive to the rank, and allowing the experimentalist to know in real-time if the rank is low enough to stop acquiring more data.

Song GaoSong Gao: September 29, 2021
Abstract: To contain the COVID-19 spread, one of the nonpharmaceutical interventions is physical (social) distancing. An interactive web-based mapping platform, which provides daily human mobility information using large-scale anonymized mobile phone location data in the US, was developed by the Geospatial Data Science Lab at UW-Madison. Using such multiscale origin-to-destination (OD) travel flow data, a novel mobility-augmented epidemic model was developed to help analyze the COVID-19 spread dynamics at multiple geographical scales (e.g., state, county, and neighborhood), inform public health policy, and deepen our understanding of human behavior under the unprecedented public health crisis.

 

Jelena DiakonikolasJelena Diakonikolas: October 13, 2021
Abstract: Empirical Risk Minimization (ERM) problems are central to machine learning, and their efficient optimization has been studied from different perspectives, often taking advantage of the finite sum structure present in typical problem formulations. In particular, tight oracle complexity bounds have been obtained under fairly general assumptions about the loss functions. In this talk, I will present a rather surprising and general result that takes advantage of the separability of nonsmooth convex loss functions with efficiently computable proximal operators — such as, e.g., the hinge loss and the sum of absolute errors — to obtain an algorithm that exhibits significantly lower complexity than what is predicted by the lower bounds for general nonsmooth convex losses. The talk is based on joint work with Chaobing Song and Stephen Wright.

Aaron Clauset: October 20, 2021
Abstract: Predicting missing links in networks is a fundamental task in network analysis and modeling. However, current link prediction algorithms exhibit wide variations in their accuracy, and we lack a general understanding of which methods work better in which contexts. In this talk, I’ll describe a novel meta-learning solution to this problem, which makes predictions that appear to be nearly optimal by learning to combine three classes of prediction methods: community detection algorithms, structural features like degrees and triangles, and network embeddings. We evaluate 203 component methods individually and in stacked generalization on (i) synthetic data with known structure, for which we analytically calculate the optimal link prediction performance, and (ii) a large corpus of 550 structurally diverse networks from social, biological, technological, information, economic, and transportation domains. Across settings, supervised stacking nearly always performs best and produces nearly-optimal performance on synthetic networks. Moreover, we show that accuracy saturates quickly, and near-optimal predictions typically requires only a handful of component methods. Applied to real data, we quantify the utility of each method on different types of networks, and then show that the difficulty of predicting missing links varies considerably across domains: it is easiest in social networks and hardest in technological networks. I’ll close with forward-looking comments on the limits of predictability for missing links in complex networks and on the utility of stacked generalizations for achieving them.

Barry Nussbaum: October 27, 2021
Abstract: Statisticians have long known that success in our profession frequently depends on our ability to succinctly explain our results so decision makers may correctly integrate our efforts into their actions. However, this is no longer enough. While we still must make sure that we carefully present results and conclusions, the real difficulty is what the recipient thinks we just said. The situation becomes more challenging in the age of “big data”. This presentation will discuss what to do, and what not to do. Examples, including those used in court cases, executive documents, and material presented for the President of the United States, will illustrate the principles.

Arun Kuchibhotla: November 3, 2021
Abstract: The HulC: Hull based Confidence RegionsAbstract: We develop and analyze the HulC, an intuitive and general method for constructing confidence sets using the convex hull of estimates constructed from subsets of the data. Unlike classical methods which are based on estimating the (limiting) distribution of an estimator, the HulC is often simpler to use and effectively bypasses this step. In comparison to the bootstrap, the HulC requires fewer regularity conditions and succeeds in many examples where the bootstrap provably fails. Unlike subsampling, the HulC does not require knowledge of the rate of convergence of the estimators on which it is based. The validity of the HulC requires knowledge of the (asymptotic) median-bias of the estimators. We further analyze a variant of our basic method, called the Adaptive HulC, which is fully data-driven and estimates the median-bias using subsampling. We show that the Adaptive HulC retains the aforementioned strengths of the HulC. In certain cases where the underlying estimators are pathologically asymmetric, the HulC and Adaptive HulC can fail to provide useful confidence sets. We discuss these methods in the context of several challenging inferential problems which arise in parametric, semi-parametric, and non-parametric inference. Although our focus is on validity under weak regularity conditions, we also provide some general results on the width of the HulC confidence sets, showing that in many cases the HulC confidence sets have near-optimal width.

Elizabeth OgburnElizabeth Ogburn: November 10, 2021
Abstract: Nonsense associations can arise when an exposure and an outcome of interest exhibit similar patterns of dependence. Confounding is present when potential outcomes are not independent of treatment. This talk will describe how confusion about these two phenomena results in shortcomings in popular methods in three areas: causal inference with multiple treatments and unmeasured confounding and causal and statistical inference with social network data. For each of these two areas I will demonstrate the flaws in existing methods and describe new methods that were inspired by careful consideration of dependence and confounding.

Sydeaka Watson: November 17, 2021
Abstract: In the ten years since I earned a statistics Ph.D., I’ve enjoyed a diverse set of professional experiences as a professor, biostatistician, data scientist, and consultant with jobs touching multiple industries – including government, academia, telecommuncations, and pharmaceuticals. I’ve supported and led STEM diversity efforts as Chair of the Committee on Minorities in Statistics in the American Statistical Association (ASA), co-Organizer of the Dallas chapter of Blacks in Technology, and as Organizer of the Dallas chapter of R-Ladies Global. I’ve volunteered my data science expertise to tackle important social justice issues. And I have also experienced the joys and challenges of starting my own data science consulting practice.
Wow, what a ride!

In this presentation, I will reflect on my academic and professional journey, noting the skills that helped me bridge the gap between statistics and data science. I will highlight selected statistics and data science projects that I’ve worked on in my various roles, both personally and professionally. Finally, I will also share my current research interests and discuss the types of technologies and modeling strategies that are useful in my work.

Pratheepa JeganathanPratheepa Jeganathan: December 1, 2021
Abstract: High-throughput sequencing generates massive molecular microbial datasets that pose several statistical challenges. As a result, statistical methods have been developed to address contamination sequences from reagents, unequal sampling, strain switching (present as one taxon in one set of specimens and a close, distinct strain appears in the other set of specimens), sparsity, and heterogeneity.

One of the important goals in microbiome research is often to find taxonomic differences across environments or groups. In this talk, we will demonstrate differential topic analysis that facilitates inferences on latent microbial communities when strain switching can be an impediment.

First, in the presence of DNA contamination, we quantify true abundance using Bayesian reference analysis. Next, we present a data similarity matrix-based method to detect strain switching. Then, we will show how to use topic models to provide useful aggregates for differential abundance analysis based on topics rather than individual strains using an R package diffTop available on Github.

Yuan Zhang: December 8, 2021
Abstract: Network method of moments is an important tool for nonparametric inference of relational data, but a long-existing open challenge is fast and accurate approximation to the sampling distributions of network moments. In this paper, we present the first result with provable higher-order accuracy. Sharply contrasting the classical scenario of noiseless U-statistics, we discover, with surprise, that in the network setting, two typically-hated factors — sparsity and observational errors — can jointly contribute a blessing “self-smoothing” effect that reinstates the validity of Edgeworth expansions under much weaker assumptions.

For practitioners, our easy-to-implement empirical method is faster and more accurate than other state-of-art methods. It is also versatile, by making no substantial assumption on network structure, apart from exchangeability and (conditionally) independent edge generation.

We showcase several applications of our results to inference on network moments: 1. providing the first proof that some popular network bootstrap schemes have higher-order accuracy; 2. explicitly formulating Cornish-Fisher confidence intervals and one-sample tests, both with accurate level controls. If time permits, I will also discuss the application to network two-sample moment method.

Chan Park (Student talk): December 15, 2021
Abstract: For several decades, Senegal has faced inadequate water, sanitation, and hygiene (WASH) facilities in households, contributing to persistent, high levels of communicable diarrheal diseases. Unfortunately, the ideal WASH policy where every household in Senegal installs WASH facilities is impossible due to logistical and budgetary concerns. This work proposes to estimate an optimal allocation rule of WASH facilities in Senegal by combining recent advances in personalized medicine and partial interference in causal inference. Our allocation rule helps public health officials in Senegal decide what fraction of total households in a region should get WASH facilities based on block-level and household-level characteristics. We characterize the excess risk of the allocation rule and show that our rule outperforms other allocation policies in Senegal.

Hanbaek Lyu: February 23, 2022
Abstract: Stochastic majorization-minimization (SMM) is an online extension of the classical principle of majorization-minimization, which consists of sampling i.i.d. data points from a fixed data distribution and minimizing a recursively defined majorizing surrogate of an objective function. In this talk, we introduce stochastic block majorization-minimization, where the surrogates can now be only block multi-convex and a single block is optimized at a time within a diminishing radius or with a proximal regularization. Relaxing the standard strong convexity requirements for surrogates in SMM, our framework gives wider applicability including online CANDECOMP/PARAFAC (CP) dictionary learning and yields greater computational efficiency especially when the problem dimension is large. We provide an extensive convergence analysis on the proposed algorithm, which we derive under possibly dependent data streams, relaxing the standard i.i.d. assumption on data samples. We show that the proposed algorithm converges almost surely to the set of stationary points of a nonconvex objective under constraints at a rate O((\log n)^{1+\eps}/n^{1/2}) for the empirical loss function and O((\log n)^{1+\eps}/n^{1/4}) for the expected loss function, where n denotes the number of data samples processed. Under some additional assumption, the latter convergence rate can be improved to O((\log n)^{1+\eps}/n^{1/2}). Our results provide first convergence rate bounds for various online matrix and tensor decomposition algorithms under a general Markovian data setting.

Isabel Fucher: March 2, 2022
Abstract: The use of causal mediation analysis to evaluate the pathways by which an exposure affects an outcome is widespread in the social and biomedical sciences. Recent advances in this area have established formal conditions for identification and estimation of natural direct and indirect effects. However, these conditions typically involve stringent no unmeasured confounding assumptions and that the mediator has been measured without error. These assumptions may fail to hold in practice where mediation methods are often applied. I will give a detailed overview of the assumptions necessary for identification of indirect effects and describe estimators of indirect effects that are robust to forms of
unmeasured confounding.

Ann Lee: March 9, 2022
Abstract: Many areas of science make extensive use of computer simulators that implicitly encode likelihood functions of complex systems. Classical statistical methods are poorly suited for these so- called likelihood-free inference (LFI) settings, outside the asymptotic and low-dimensional regimes.

Although new machine learning methods, such as normalizing flows, have revolutionized the sample efficiency and capacity of LFI methods, it remains an open question whether they produce confidence sets with correct conditional coverage. In this talk, I will describe our group’s recent and ongoing research on developing scalable and modular procedures for (i) constructing Neyman confidence sets with finite-sample guarantees of nominal coverage, and for (ii) computing diagnostics that estimate conditional coverage over the entire parameter space. We refer to our framework as likelihood-free frequentist inference (LF2I). Any method that defines a test statistic, like the likelihood ratio, can be adapted to LF2I to create valid confidence sets and diagnostics, without costly Monte Carlo samples at fixed parameter settings. In my talk, I will discuss where we stand with LF2I and challenges that still remain. (Part of these efforts are joint with Niccolo Dalmasso, Rafael Izbicki, Luca Masserano, Tommaso Dorigo, Mikael Kuusela, and David Zhao. An earlier version of this work can be found on arXiv:2107.03920)

Yosef Rinott: March 23, 2022
Abstract: Quantum supremacy refers to the ability of a quantum computer to perform any task, possibly contrived and useless, which a classical computer cannot perform in reasonable time. Various important implications follow if indeed quantum supremacy is proved, and I will discuss some of them. Recently a group in Google and two groups in China claimed to have proved quantum supremacy. Such proofs involve various statistical aspects due to the noisy nature of quantum computers, and the nature of the tasks they perform, and I will discuss these aspects. Most of the lecture will be non-technical, but given time, I will try to go into details of some of the statistical issues that arise. Spoiler: I will not pass judgement about whether quantum supremacy has indeed been demonstrated.

Maria De ArteagaMaria De Arteaga: April 6, 2022
Abstract: Machine learning (ML) is increasingly being used to support decision-making in many organizational settings. However, there is currently a gap between the design and evaluation of ML algorithms and the functional role of these algorithms as tools for decision support. The first part of the talk will highlight the role of humans-in-the-loop, and the importance of evaluating decisions instead of predictions, through a study of the adoption of a risk assessment tool in child maltreatment hotline screenings. The second part of the talk will focus on the gap between the construct of interest and the proxy that the algorithm optimizes for. Using a proposed machine learning methodology that extracts knowledge from experts’ historical decisions, we show that in the context of child maltreatment hotline screenings (1) there are high-risk cases whose risk is considered by the experts but not wholly captured in the target labels used to train a deployed model, and (2) we can bridge this gap if we purposefully design with this goal in mind.

Can Le: April 13, 2022
Abstract: Networks analysis has been commonly used to study the interactions between units of complex systems. One problem of particular interest is learning the network’s underlying connection pattern given a single and noisy instantiation. While many methods have been proposed to address this problem in recent years, they usually assume that the true model belongs to a known class, which is not verifiable in most real-world applications. Consequently, network modeling based on these methods either suffers from model misspecification or relies on additional model selection procedures that are not well understood in theory and can potentially be unstable in practice. To address this difficulty, we propose a mixing strategy that leverages available arbitrary models to improve their individual performances. The proposed method is computationally efficient and almost tuning-free; thus, it can be used as an off-the-shelf method for network modeling. We show that the proposed method performs equally well as the oracle estimate when the true model is included as individual candidates. More importantly, the method remains robust and outperforms all current estimates even when the models are misspecified. Extensive simulation examples are used to verify the advantage of the proposed mixing method. Evaluation of link prediction performance on 385 real-world networks from six domains also demonstrates the universal competitiveness of the mixing method across multiple domains.

Jesus Arroyo: April 20, 2022
Abstract: Graph matching is the problem of aligning the vertices of two unlabeled graphs in order to maximize the shared structure across them. This talk will highlight some recent likelihood-based approaches for this problem. The first part of the talk will focus on unipartite networks, where one of the graphs is an errorfully observed copy of the other. We present necessary and sufficient conditions for consistency of the maximum likelihood estimator, and use these to study matchability in different families of random graphs. The second part will discuss the problem of graph matching between bipartite and unipartite networks. We formulate the problem via undirected graphical models, and
study the connections between graph matching and graphical model estimation. The methods are illustrated in simulated data and real brain networks.

Ryan Murray: April 27, 2022
Abstract: One fundamental geometric quantity in robust statistics is known as a depth function, which generalizes the notion of quantiles and medians to multiple dimensions. This talk will discuss recent work (in collaboration with Martin Molina-Fructuoso) which connects certain types of data depths (specifically Tukey/Halfspace depths) with Hamilton-Jacobi equations, a first-order partial differential equation that is fundamental to control theory. These equations provide potential new approaches for theoretical, modeling, and computational aspects of data depths. Connections to convex geometry and a number of related open problems will also be discussed.