|
OR/MS Today - June 2009 O.R. & Biomedicine Computational Biology and Medical Applications Opportunities abound for O.R. in the interplay between medicine, bioengineering and medical physics. By Harvey J. Greenberg, Allen G. Holder, Ming-Ying Leung and Russell Schwartz In recent years the frontiers of biological and medical research have become increasingly dependent on sophisticated modeling, analysis and computational techniques. Operations research (O.R.) and computer science (CS) have emerged as vital tools for investigating complex biological systems. The interaction between O.R. and biomedicine has been fruitful and synergistic. In turn, the biomedical research has spawned new mathematical developments in O.R. and its interfaces with CS. In fall 2007 the INFORMS Journal on Computing initiated the Area of Computational Biology and Medical Applications to recognize the role of O.R. in this exciting frontier and to attract new researchers who can contribute to this revolution. While much could be said, this article is intended to give a brief account of how O.R. is already becoming critical to biomedical research and where we believe great opportunities exist for new contributions and contributors. Our hope is that you, the reader, will see the merit in joining us in this enterprise. We encourage those applying O.R. to biomedicine, whether established practitioners or new contributors, to consider JoC as an outlet for publishing results. Although there are dozens of journals already publishing O.R. applied to this area, not all of them recognize that they are using O.R. and will generally not bring their work to the attention of the broader O.R. community. A primary aim of the JoC Area's research is to help scientists reach more accurate conclusions more quickly and reliably. That is where the modeling and solution techniques of O.R. combine with the algorithm and interface designs of CS. Working in the O.R. tradition of a team effort, we get results. Have a look at recent publications that illustrate a broader spectrum of interests, such as the JoC special issue in 2004 (16:4) and the special issue of Annals of Operations Research in 2006 (148:1). The following sections give a brief introduction to some opportunities for O.R. in biology and medicine.
There is a rich history of applying O.R. to problems in healthcare and medicine. See Brandeau et al. [1] for a modern review and Ozcan [10] for an instructional text. The tradition within healthcare is to use O.R. to support operational and managerial decisions, and this is where the history is most steeped. However, advanced medical technologies and greater sophistication in their use has led to numerous medical procedures that have lent themselves to O.R. Many of these include the interplay between medicine, bio-engineering and medical physics. We discuss the popular topic of optimizing radiotherapy treatments to develop the general theme of applying O.R. and computing to medical procedures. See Ehrgott et al. [5] for a comprehensive review. The fundamental premise behind medical applications is to provide a quality of care within the confines of the application, and the optimal design of a radiotherapy treatment is like any other O.R. application from this overarching perspective. A brief look at the clinical practice is important to understand the multiple ways O.R. is involved. Patient images are used to locate targets and to identify nearby tissues that should not receive high amounts of radiation, called organs at risk. Tissues are delineated on a sequence of 2D images to create a 3D rendering of the segmented anatomy. An oncologist prescribes radiation goals for the targets and upper bounds for the organs at risk, as well as other prescription information. This is used to design a treatment that is typically delivered in daily fractions over a period of a few weeks. There are two places where O.R. plays a role prior to treatment design. First, the disease is commonly viewed through different images, each of which presents different aspects of the disease. Ideally, images are fused via an optimal process so that physicians can better assess the disease's state. Second, the lengthy delineation process that identifies anatomical structures is an image segmentation problem, and such problems have previously benefited from O.R. methods of optimization and modeling. This process is currently managed manually but could benefit significantly from automation, raising many opportunities in the O.R./CS interface.
Clinical practice dictates a trial-and-error technique in many cases. A designer selects beams, optimizes fluence and evaluates the computed treatment. Evaluation often indicates the possibility of improvement, and the designer re-selects beams and repeats the process until satisfaction is reached. The final treatment then undergoes the third phase, which sequences delivery. Each of the phases has been studied as an optimization problem, and the current trend is to tie them together in hopes of optimizing the totality of the design. However, this goal is challenged by many modeling and solving issues, not the least of which is the fact that clinical optimality is not well defined. The individual studies of each phase have positioned the field to address optimizing the coupled processes, and a systematic comparison of the combinations supported by the literature is needed. This is a significant computational challenge, one for which the O.R. community is exceptionally well qualified. Although our discussion has specifically addressed optimizing radiotherapy treatments, the basic themes canvass many medical procedures. Indeed, nearly any technique that is tailored to meet the needs of a specific patient and/or clinic has an underlying design to optimize. Examples outside medical physics include the design of prosthetics, robotic surgery, rehabilitation and the optimal scheduling of vaccines, to name a few. There are also examples in diagnostics, with one of the most well known being the use of separating hyperplanes to distinguish between disease and non-disease or between clinically important sub-types of common diseases. New applications in medicine often require multi-disciplinary research interests since the problems span modeling, solving and analysis. Moreover, harnessing modern clinical technologies and methodologies relies on our ability to compute solutions efficiently and to provide computer-assisted analysis of results. Genetic variations predominantly take the form of single nucleotide polymorphisms (SNPs), which are single DNA bases with two common forms in the genome, of which several million have now been identified. Other differences between our genomes consist of structural variations, in which large pieces of DNA are duplicated, deleted or rearranged in some people relative to others. It has only recently become apparent that these structural variations are far more common than was previously suspected and account for a large part of the genetic variation within our species. Several major initiatives are now underway, such as the International Haplotype Map (see www.hapmap.org) and the 1,000 Genomes Project (see www.1000genomes.org), to further catalog the millions of common human genetic variants and to determine how they are distributed within thousands of individual people and across dozens of human populations. The product of these initiatives is a vast and rapidly growing set of data that needs to be analyzed to identify new variations, to perform basic research into our history as a species, to better characterize the functions and evolution of our genes, and to apply the results to improving human health. These problems depend on advances in computational methods for analyzing large genomic data sets, an area presenting many opportunities for O.R.
Most of our DNA, though, comes in the forms of pairs of chromosomes with one copy of each inherited from the mother and from the father. Genome sequencers from the Venter Institute have only recently determined the first diploid human genome, separating the contributions of the two slightly different copies of each chromosome in a single individual. The problem of reliably reconstructing the two chromosome copies, known as haplotype assembly or diploid assembly, has likewise attracted interest in the O.R. community, but many problems i.e., opportunities remain. The large amounts of genomic variation data becoming available have also revived some "solved" problems from the much older field of phylogenetics (the inference of evolutionary trees). Even the simplest versions of phylogenetics are formally computationally intractable. While practical heuristics have done well with smaller datasets, they are poorly suited to the genome-scale datasets becoming available. Traditional O.R. optimization techniques, such as branch-and-bound methods, have a long history in this field, and more advanced methods are now being applied to solve increasingly harder problem instances. In addition, the availability of genome-scale variation data has given new importance to solving challenging variants of the phylogeny problem. For example, the mixture of densely packed variants and large genomic distances makes it essential to incorporate into phylogenetic models a process called recombination, in which chromosomes evolve in part by swapping genetic material with one another. The resulting problem of inferring phylogenetic networks with recombination, as opposed to the more traditional phylogenetic trees, remains unsolvable in practice. As more sophisticated models of molecular evolution are developed, to account for our greater knowledge of genome variation and how it evolves, there will almost surely be a need for more sophisticated optimization methods to make practical use of these models in phylogenetic applications. An emerging area for O.R. in genomics is the use of optimization and simulation methods to apply knowledge of genetic variation to the diagnosis and treatment of disease. Work in characterizing genetic variation is largely motivated by the hope of finding genetic variants that are statistically associated with disease, which in turn will help us develop diagnostic tests or identify genes that may be targets for drugs. O.R. methods have begun to be used to assist in mining genomic variation datasets for often weak evidence of association with disease, but much remains to be done. A central goal of the genetic variation studies is personalized medicine the idea that a patient's treatment is specific to his/her DNA. This is already happening, revolutionizing clinical practice by enabling doctors to shift from diagnostic and remedial to predictive and preventive medicine. One of the many important effects is bringing more drugs to patients by stratifying the clinical trials, using DNA properties to separate those who benefit from the drug from those who would be harmed. Not only does this provide better treatment to more patients, but it also lowers the cost of each drug since more enter the market. (It costs about $1.4 billion and 12-15 years to bring a drug to market, starting with 10,000 prospects. If two of the prospects get through clinical trials due to stratified testing, that cost can be split between the prices of two drugs, rather than covered by the one drug that makes it.)
While the biological knowledge needed to make personalized medicine a reality is now becoming available, the computational infrastructure does not yet exist. There are significant opportunities for O.R. practitioners familiar with similar decision analysis problems to determine how to guide the design of the clinical trials and how to productively put this decision-making power into the hands of medical doctors in clinical practice. Members of the O.R. community interested in learning more might consult any of a number of recent sources on genome variation and related computational problems. An overview from a computer science perspective of many of the classic computational problems in genome assembly and analysis is given by Gusfield [7]. Felsenstein [6] provides a more detailed summary of the specific issue of phylogenetics and some of the complications in practical phylogeny inference. More recent computational problems in genetic variation analysis were surveyed by Halldórsson et al. [8] One may also refer to Pennisi et al. [12] for a survey aimed at the general scientist on the biological basis of genetic variation studies and their medical importance, with many helpful links to other resources and seminal studies in the area.
DNA and RNA nucleotide-base sequences are most often compared to random sequences generated from a four-letter alphabet representing the four nucleotides. The amino acid sequence for proteins is from a 20-letter alphabet representing the 20 distinct amino acid residues. Sequence models based on alternative alphabets reflecting the physical and chemical properties of the nucleic acids and amino acids (e.g., strengths of the hydrogen bonds between complementary nucleotides, the electrostatic charges on the amino acids) are also used. In the simplest case, the models are represented as a sequence of independent and identically distributed random variables. In more sophisticated representations, these are Markov-type models, including higher order and hidden Markov chains. From these random sequence models, general scoring schemes and assessment methods for statistical significance of molecular sequence features are derived. In particular, the limit distribution of the maximal non-aligned two-sequence segmental score has served as the statistical foundation for the widely used BLAST software [9] for molecular sequence database searching.
Many other applications of Markov chain sequence models use information about short strings of letters, called words, in the sequences. For an alphabet of size s, there are sk distinct words of length k. The underlying Markov model provides a theoretical framework for deriving occurrence frequencies and distributions of various individual words or groups of words with a special pattern. For example, the palindrome pattern in DNA and RNA, which is involved in molecular binding and folding activities, has been studied for a variety of biological problems including gene expression and regulation, viral genome replication, etc. While specific investigations differ, we note that they often reduce to a problem of finding suitable statistical distributions, such as the Poisson and normal types of distributions, to approximate the behavior of the random variables or random processes of interest. The book by Deonier et al. [2] gives an excellent introduction to the analysis of word distribution and occurrences and their applications in DNA mapping and signal recognition. Although obtaining nucleotide and amino acid sequences has become routine, capturing the variation of the functional characteristics at different parts of the sequence is not always straightforward. For example, it is generally difficult to distinguish whether a DNA segment codes for a protein or not, or whether or not an amino acid sequence segment belongs to a DNA binding domain. These problems have motivated the use of hidden Markov models [4] in which the functional states are typically hidden, and the chain of observable emitted symbols is the nucleotide or amino acid sequence itself. Profile hidden Markov models for sequence families have also been applied successfully to obtain optimal and suboptimal sequence alignments that are useful for predicting protein structures or inferring phylogenetic relationships. In order to understand the functions of cells, tissues, organs, etc., from the analysis of the basic biomolecules like DNA, RNA and proteins, one needs to know what genes are expressed under different physiological conditions. With the development of microarray technology (see [3, Chapter 11] for a brief introduction) during the past two decades, expression levels of thousands of genes can be measured simultaneously, thus creating the possibilities of investigating and comparing entire genomes or proteomes. These studies have contributed greatly to recent biomedical advances in cancer and pharmaceutical research, and they have spurred the development of a myriad of new statistical and computational tools to deal with high-dimensional data with large numbers of variables but small sample sizes. Among these, Bayesian modeling and inference, data-mining techniques like classification and clustering, as well as optimization-based machine learning techniques, such as support vector machines, are popular approaches. Microarray technology has empowered scientists to acquire extensive information on gene activities with relative ease. However, the genes on the DNA and the proteins they code for do not work independently, but rather in highly dynamic and interactive genetic and biochemical networks. Among various stochastic models used in modeling complex genetic regulatory networks, the probabilistic Boolean network has probably received the most attention. In this model, the state (on or off) of a gene is represented by a Boolean variable and the interactions among genes are represented by Boolean functions. The biological basis for such a formulation rests in the observation that during regulation of functional states, the cell exhibits switch-like behavior. This allows coarse-grained properties of large genetic networks and interactions of genes to be studied with limited quantitative biochemical details. Steady-state behavior of probabilistic Boolean network models and the effect of perturbations can be studied as optimal control problems in the framework of Markov chains [3]. A classical text is by Waterman [13], and a more recent one is by Wilkinson [14]. Those working in stochastic processes will find familiar foundations for applications to computational biology.
Metabolic pathways use enzymes (proteins that catalyze the reaction) to convert metabolites (small organic compounds). These are like factories and come in different varieties. Science Daily [Aug 16, 2006] reports:
O.R. contributes to the analysis of extreme pathways, using linear programming for the steady-state stoichiometric equation, Sv = 0, where S is an mxn matrix whose coefficients describe the rate of each of m metabolites used or produced in each of n reactions, and v is the flux. Extreme rays of the polyhedral cone reveal how pathways are composed and combined. Further, there are different normalizations for v and several objectives. For the human mitochondria, objectives include ATP production, heme biosynthesis and mixed phospholipid biosynthesis. This field has been benefitting from O.R. models, ranging from linear programming to multiple, nonlinear objectives and binary variables. Databases and software have only recently been developed to catalog and annotate the pathways. Kinetics, from estimated reaction rates, are fundamental for modeling the behavior of complicated networks for applications such as drug design. Standard models and theory from O.R., such as continuous-time Markov models and queueing theory, are (under-) used to understand them, yet they have become central to the simulation of complicated reaction networks. Add artificial intelligence methods, such as reasoning about missing (unknown) parts of a pathway, and we have an O.R. arsenal of models, methods, and analysis techniques. Indeed, the O.R. community is ready to contribute to breakthroughs, and you can be part of that.
Allen Holder is an associate professor of Mathematics at the Rose-Hulman Institute of Technology and area editor of Computational Biology and Medical Applications for the INFORMS Journal on Computing. Ming-Ying Leung is a professor of Mathematical Sciences and director of the Bioinformatics Program at The University of Texas at El Paso and associate editor for Computational Biology and Medical Applications for the INFORMS Journal on Computing. Russell Schwartz is an associate professor of Biological Sciences at Carnegie Mellon University and associate editor for Computational Biology and Medical Applications for the INFORMS Journal on Computing. References
OR/MS Today copyright © 2009 by the Institute for Operations Research and the Management Sciences. All rights reserved. Lionheart Publishing, Inc. 506 Roswell Rd., Suite 220, Marietta, GA 30060 USA Phone: 770-431-0867 | Fax: 770-432-6969 E-mail: lpi@lionhrtpub.com URL: http://www.lionhrtpub.com Web Site © Copyright 2009 by Lionheart Publishing, Inc. All rights reserved. |