Locally-adaptive polynomial-smoothed histograms with application to massive and pre-binned data sets
Author
Papkov, Galen I.
Date
2008Advisor
Scott, David W.
Degree
Doctor of Philosophy
Abstract
Data-driven research is often hampered by privacy restrictions in the form of limited datasets or graphical representations without the benefit of actual data. This dissertation has developed a variety of nonparametric techniques that circumvent these issues by using local moment information. Scott and Sagae's local moment non-parametric density estimation method, namely the smoothed polynomial histogram, provides a solid foundation for this research. This method utilizes binned data and their sample moments, such as the bin areas, means, and variances, in order to estimate the underlying distribution of the data via polynomial splines. The optimization problem does not account for the differing amounts of data across bins. More emphasis or trust should be placed on lower-order moments as they tend to be more accurate than higher-order moments. Hence, to ensure fidelity to the data and its local sample moments, this research has incorporated a weight matrix into the optimization problem.
An alternative to the weighted smoothed polynomial histogram is the penalized smoothed polynomial histogram, which is similar to the smoothed polynomial histogram, but with a difference penalty on the coefficients. This type of penalty is simple to implement and yields equivalent, if not better results than the smoothed polynomial histogram and can also benefit from the inclusion of a weight matrix. Advancement has also been achieved by extending the smoothed polynomial histogram to higher dimensions via the use of tensor products of B-splines. In addition to density estimation, these nonparametric techniques can be used to conduct bump hunting and change-point analysis. Future work will explore the effects of adaptive meshes, automatic knot selection, and higher-order derivatives in the penalty on the quality of these local-moment density estimators.
Keyword
Statistics