A Bayesian hierarchical model for detecting associations between haplotypes and disease using unphased SNPs
Fox, Garrett Reed
Doctor of Philosophy
This thesis addresses using haplotypes to detect disease predisposing chromosomal regions based on a Bayesian hierarchical model for case-control data. By utilizing the Stochastic Search Variable Selection (SSVS) procedure of George and McCulloch (1997), the number of parameters is riot constrained by the sample size, as are the frequentist methods. Haplotype information is used in the form of estimated haplotype frequencies, and using these values in the model as if they were the true population frequencies. A Bayesian hierarchical probit model was developed by estimating the distribution of haplotype pairs for an individual based on these estimated populaltion frequencies and using SSVS to make decisions about model selection. To date, Bayesian models for haplotype based case-control data assume either the haplotypes are known, or that haplotypes can be clustered such that every haplotype within a cluster has the same effect on disease status. A simulation was performed analyzing the testing properties of this Bayesian model and comparing it to a popular frequentist method (Schaid, 2002). Both real genotype data from the Dallas Heart Study (DHS) and simulated data were used to study the operating characteristics of the new model The Bayesian method is shown to have higher power than Schaid's frequentist method when there are a limited number of common haplotypes in a region, a situation that appears to be common (Gabriel, 2002). An approach based on the maximum of Chi-squared statistics at each marker locus performed suprisingly well against both haplotype methods in various cases. These simulations contribute to the ongoing debate on the efficacy of haplotype methods. The most suprising result was the ability of the genotype methods to outperform the haplotype methods in various instances where there were cis-acting interactions. The Bayesian haplotype method performed better in comparison when dealing with low penetrance in highly conserved blocks. Additionally, a set of simulations were based on a number of genes from the DHS data set with multiple haplotype block regions. This demonstrated the similarities of the haplotype methods and the added flexibility when analyzing posterior distributions. We also demonstrate that interactions between loci in separate blocks can be detected without having interaction terms in the regression model. Future work should focus on more efficient methods of detecting these and other complex interactions.