Model-based clustering for multivariate time series of counts
Thomas, Sarah Julia
Ensor, Katherine B.; Ray, Bonnie K.
Doctor of Philosophy
This dissertation develops a modeling framework for univariate and multivariate zero-inflated time series of counts and applies the models in a clustering scheme to identify groups of count series with similar behavior. The basic modeling framework used is observation-driven Poisson regression with generalized linear model (GLM) structure. The zero-inflated Poisson (ZIP) model is employed to characterize the possibility of extra observed zeros relative to the Poisson, a common feature of count data. These two methods are combined to characterize time series of counts where the counts and the probability of extra zeros may depend on past data observations and on exogenous covariates. A key contribution of this work is a novel modeling paradigm for multivariate zero-inflated counts. The three related models considered are the jointly-inflated, the marginally-inflated, and the doubly-inflated multivariate Poisson. The doubly-inflated model encompasses both marginal-inflation, which allows for additional zeros at each time epoch for each individual count series, and joint-inflation, which allows for zero-inflation across all multivariate series. These models improve upon previously proposed models, which are either too rigid or too simplistic to be applicable in a wide variety of applications. To estimate the model parameters, a new Monte Carlo Estimation Maximization (MCEM) algorithm is developed. The Monte Carlo sampling eliminates complex recursion formulas needed for calculating the probability function of the multivariate Poisson. The algorithm is easily adapted for different multivariate zero-inflation schemes. The new models, new estimation methods, and applications in clustering are demonstrated on simulated and real datasets. For an application in finance, the number of trades and the number of price changes for bonds are modeled as a bivariate doubly zero-inflated Poisson time series, where observations of zero trades or zero price changes represent the liquidity risk for that bond. In an environmental science application, the new models are used in a model-based clustering scheme to study counts of high pollution events at air quality monitoring stations around Houston, Texas. Clustering reveals regions of the air monitoring network which behave similarly in terms of time dependence and response to covariates representing atmospheric conditions and physical sources of air pollution.