High-dimensional and dependent data with additional structure
Doctor of Philosophy
The age of computing has enabled the collection of massive amounts of data. These data present numerous statistical challenges, because many data sets are high-dimensional and dependent. While statistical inference for high-dimensional and dependent data is challenging, many data come with additional structure that can be exploited to facilitate statistical inference. This thesis considers two widely used classes of models for high-dimensional and dependent data with additional structure, high-dimensional multivariate time series and exponential-family random graph models. In the case of high-dimensional multivariate time series, there is often additional structure in the form of spatial structure, e.g., air pollution is monitored by monitors and the geographical locations of monitors are known. If air pollutants cannot travel long distances, then the estimation of past-present and present-present dependencies of air pollution at monitors can be restricted to short distances. Here, a novel two-step estimation approach is proposed to estimate the range of dependence along with the parameters of multivariate time series in high-dimensional settings. Theoretical results show that the two-step estimation approach reduces statistical error in high-dimensional settings. Simulation results confirm that the two-step estimation approach reduces statistical error and computing time. An application to air pollution in the U.S. demonstrates that the two-step estimation approach gives rise to results that are in line with scientific knowledge, whereas estimation approaches ignoring the spatial structure report results that are in conflict with scientific knowledge. In the case of exponential-family random graph models, it is likewise common that there is additional structure: e.g., it is known that many networks, such as insurgencies and terrorist networks, are local in nature. Here, a novel two-step estimation approach is proposed to estimate the local structure along with the dependence pattern of networks. The proposed two-step estimation approach can be implemented in parallel and hence paves the ground for massive-scale estimation of exponential-family random graph models. Theoretical results are provided along with simulation results. An application to a large Amazon product network demonstrates the usefulness of the proposed two-step estimation approach.
Dependent data; High-dimensional data; Vector autoregressive process; Exponential-family random graph model; Local dependence