Prediction Oriented Marker Selection (PROMISE) for High Dimensional Regression with Application to Personalized Medicine
Scott, David W.
Doctor of Philosophy
In personalized medicine, biomarkers are used to select therapies with the highest likelihood of success based on a patients individual biomarker profile. Two important goals of biomarker selection are to choose a small number of important biomarkers that are associated with treatment outcomes and to maintain a high-level of prediction accuracy. These goals are challenging because the number of candidate biomarkers can be large compared to the sample size. Established methods for variable selection based on penalized regression methods such as the lasso and the elastic net have yielded promising results. However, selecting the right amount of penalization is critical to maintain the desired properties for both variable selection and prediction accuracy. To select the regularization parameter, cross-validation (CV) is most commonly used. It tends to provide high prediction accuracy as well as a high true positive rate, at the cost of a high false positive rate. Resampling methods such as stability selection (SS) conversely maintains a good control of the false positive rate, but at the cost of yielding too few true positives. We propose prediction oriented marker selection (PROMISE), which combines SS with CV to include the advantages of both methods. We applied PROMISE to (1) the lasso and (2) the elastic net for individual marker selection, (3) the group lasso for pathway selection, and (4) the combination of the group lasso with the lasso for individual marker selection within the selected pathways. Data analysis show that PROMISE produces a more sparse solution than CV, reducing the false positives compared to CV, while giving similar prediction accuracy and true positives. In our simulation and real data analysis, SS does not work well for variable selection and prediction. PROMISE can be applied in many fields to select regularization parameters when the goals are to minimize both type I and type II errors and to maximize prediction accuracy.