Show simple item record

dc.contributor.advisor Hess, Kenneth
dc.creatorLecocke, Michael Louis
dc.date.accessioned 2009-06-04T08:09:34Z
dc.date.available 2009-06-04T08:09:34Z
dc.date.issued 2005
dc.identifier.urihttp://hdl.handle.net/1911/18776
dc.description.abstract Motivation. Binary classification is a common problem in many types of research including clinical applications of gene expression microarrays. This research is comprised of a large-scale empirical study that involves a rigorous and systematic comparison of classifiers, in terms of supervised learning methods and both univariate and multivariate feature selection approaches. Other principle areas of investigation involve the use of cross-validation (CV) and how to guard against the effects of optimism and selection bias when assessing candidate classifiers via CV. This is taken into account by ensuring that the feature selection is performed during training of the classification rule at each stage of a CV process ("external CV"), which to date has not been the traditional approach to performing cross-validation. Results. A large-scale empirical comparison study is presented, in which a 10-fold CV procedure is applied internally and externally to a univariate as well as two genetic algorithm-(GA-) based feature selection processes. These procedures are used in conjunction with six supervised learning algorithms across six published two-class clinical microarray datasets. It was found that external CV generally provided more realistic and honest misclassification error rates than those from using internal CV. Also, although the more sophisticated multivariate FSS approaches were able to select gene subsets that went undetected via the combination of genes from even the top 100 univariately ranked gene list, neither of the two GA-based methods led to significantly better 10-fold internal nor external CV error rates. Considering all the selection bias estimates together across all subset sizes, learning algorithms, and datasets, the average bias estimates from each of the GA-based methods were roughly 2.5 times that of the univariate-based method. Ultimately, this research has put to test the more traditional implementations of the statistical learning aspects of cross-validation and feature selection and has provided a solid foundation on which these issues can and should be further investigated when performing limited-sample classification studies using high-dimensional gene expression data.
dc.format.extent 200 p.
dc.format.mimetype application/pdf
dc.language.iso eng
dc.subjectStatistics
dc.title An empirical study of feature selection in binary classification with DNA microarray data
dc.type.genre Thesis
dc.type.material Text
thesis.degree.department Statistics
thesis.degree.discipline Engineering
thesis.degree.grantor Rice University
thesis.degree.level Doctoral
thesis.degree.name Doctor of Philosophy
dc.identifier.citation Lecocke, Michael Louis. "An empirical study of feature selection in binary classification with DNA microarray data." (2005) Diss., Rice University. http://hdl.handle.net/1911/18776.


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record