Statistics Applications and Innovative Bayesian Statistics Modeling for Precision Medicine in Human Cancer Genomics and Clinical Trials
Doctor of Philosophy
Human cancers are caused by aberrations of multiple biomarkers, and thus the pathogenesis is very complex and inconclusive in terms of individual variability in mutations, copy number alterations (CNAs), methylation, gene expression, 'genomic' environment (e.g., human pathways), immune systems, virus and lifestyle for each person. Recent developments in personalized or precision medicine in human cancers, powered by recent advances in Bayesian inference, statistical machine learning and big data, have the potential to revamp the "one-size-fits-all" approach in modern medicine by identifying a fully personalized treatment plans for groups of patients characterized by their biomarker makeup. The key behind these developments is to identify the biomarkers in the upstream analysis of cancer genomics data, and then create a biomarker-based precision medicine analysis system/protocol (BPMAP), which can automatically deliver clinical decisions on the patients' data accrual and recommend treatment plans for the coming patient to take. BPMAP relies on little human effort and has the ability to extend to any early phase clinical trial design applications of human complex diseases, i.e., neurology, aging, and cardiovascular diseases and so on. In this thesis, we develop a precision medicine design for early phase clinical trial designs based on biomarkers identified from the integrative analysis of human genomic data. To begin with, we develop a simple tool, called autocorrelation scanning profile, in which we evaluate the data quality control and refine the analysis of copy number data. Data quality is a critical issue in the analysis of DNA copy number alterations obtained from microarrays. It is commonly assumed that copy number alteration data can be modeled as piecewise constant and the measurement errors of different probes are independent. However, these assumptions do not always hold in practice. We found that measurement errors are highly correlated between probess that interrogate nearby genomic loci, and piecewise constant model does not fit the data well in some published datasets. The correlated errors can cause problems in downstream analysis, leading to a large number of DNA segments falsely identified as having copy number gains and losses. We also provide the R code to deal with this quality control problem, and apply it into some typical datasets, illustrating its broad applicability to copy number data on different sequencing platforms. Then, we propose a spectral decomposition method, a new framework to delineate the key cancer driver genes from regions encoding a large number of genes based on CNAs, gene expression, and clinical outcomes. Lots of methods are developed to identify driver genes (tumor suppressor and oncogenes) from CNAs, which very low overlapping coverages are found in the candidate gene sets from different studies even if the data are collected for the same human cancer on the same sequencing platform. After collecting the CNA data from the Cancer Genome Atlas (TCGA) project, we develop a new approach to CNA analysis based on spectral decomposition of the copy number profiles into focal and broad CNAs. Our method showed that our identified cancer driver genes are significantly overlapped with the existing gene sets, and also we identified a number of novel focal regions, such as focal gain of ESR1, focal loss of LASMP, prognostic site at 3q26.2 and losses of sub-telomere regions in multiple chromosomes. Further analysis are carried out on network modularity and human signaling pathways. Tesed on ovarian cancer data, the results demonstrate that spectral decomposition of CNA profiles offers a new way of understanding the role of CNAs in human cancers. After this, we extend our method to pancancer study into a high-resolution genome-wide comparison analysis of CNAs in the blood, tumor and tumor-adjacent tissues of 8,870 patients with 28 types of cancers. We identify genomic hotspots that harbor recurrent focal CNAs in the blood. Detected in more than 13% of the patients, these CNAs represent unique somatic alteration pattern that are absent or under-presented in the corresponding solid tumors or tumor-adjacent tissues. The occurrence of the blood-specific CNAs was correlated with older patient age, shorter progression-free survival, and elevated immune activities within several types of primary tumors. Therefore, novel tumor-extrinsic systemic elements may influence the development and clinical progression of solid tumors by way of altering anti-tumor immune responses. After biomarker identification in the upstream analysis of cancer genomics data, we move onto early phase clinical trial designs with the aim to target personalized medicine. First of all, we study the problem of identifying maximum tolerated dose contour (i.e., multiple MTDs) in phase I drug combination trials for patients with homogeneous biomarkers. We propose a new dose-finding design, the waterfall design, to find the MTD contour for drug combination trials. Taking the divide-and-conquer strategy, the waterfall design divides the task of finding the MTD contour into a sequence of one-dimensional dose-finding processes, known as subtrials. The subtrials are conducted sequentially in a certain order, such that the results of each subtrial will be used to inform the design of subsequent subtrials. Such information borrowing allows the waterfall design to explore the two-dimensional dose space efficiently using a limited sample size and decreases the chance of overdosing and underdosing patients. To accommodate the consideration that doses on the MTD contour may have very different efficacy or synergistic effects because of drug-drug interaction, we further extend our approach to a phase I/II design with the goal of finding the MTD with the highest efficacy. Simulation studies show that the waterfall design is safer and has higher probability of identifying the true MTD contour than some existing designs. The R package "BOIN" to implement the waterfall design is freely available from CRAN. Furthermore, we study the problem of selecting the optimal treatments for groups of patients by considering short-term binary efficacy outcome in phase II, and long-term continuous efficacy outcome of survivals in phase III clinical trials. Our models capture the biomarker-subgroup-treatment and their interaction information that is predictive to the observed phase IIb or II/III outcomes. In the phase IIb trial, we consider the use of statistical models for binary tumor response as a function of treatment and biomarker-related subgroup, and biological responses conditional on tumor response as a function of repeatedly measured time, to select an optimal treatment for an individual patient or a set of patients (subgroup/cluster). In the phase II/III trials, we first define the acceptable treatment sets based on the short term binary efficacy and longitudinal outcomes, and then identify the most effective treatment by monitoring patients' long-term efficacy endpoints. A two-stage treatment identification algorithm is proposed to find the personalized optimal treatment for patients with a specific biomarker pattern. Simulation studies show that the proposed design has higher probability of identifying the true treatment strategy than some existing designs. The source R code is freely available upon request. To sum up, our proposed set of models and algorithms can be used to aid in the precision medicine design for oncology trials. After proper adjustment and modeling, it is very promising to extend our framework and master protocol to practical biomarker-based precision medicine trial designs for other complex diseases.