OR/MS Today - June 2009



O.R. & Biomedicine


Computational Biology and Medical Applications

Opportunities abound for O.R. in the interplay between medicine, bioengineering and medical physics.

By Harvey J. Greenberg, Allen G. Holder, Ming-Ying Leung and Russell Schwartz


In recent years the frontiers of biological and medical research have become increasingly dependent on sophisticated modeling, analysis and computational techniques. Operations research (O.R.) and computer science (CS) have emerged as vital tools for investigating complex biological systems. The interaction between O.R. and biomedicine has been fruitful and synergistic. In turn, the biomedical research has spawned new mathematical developments in O.R. and its interfaces with CS.

In fall 2007 the INFORMS Journal on Computing initiated the Area of Computational Biology and Medical Applications to recognize the role of O.R. in this exciting frontier and to attract new researchers who can contribute to this revolution. While much could be said, this article is intended to give a brief account of how O.R. is already becoming critical to biomedical research and where we believe great opportunities exist for new contributions and contributors. Our hope is that you, the reader, will see the merit in joining us in this enterprise. We encourage those applying O.R. to biomedicine, whether established practitioners or new contributors, to consider JoC as an outlet for publishing results. Although there are dozens of journals already publishing O.R. applied to this area, not all of them recognize that they are using O.R. and will generally not bring their work to the attention of the broader O.R. community.

A primary aim of the JoC Area's research is to help scientists reach more accurate conclusions more quickly and reliably. That is where the modeling and solution techniques of O.R. combine with the algorithm and interface designs of CS. Working in the O.R. tradition of a team effort, we get results. Have a look at recent publications that illustrate a broader spectrum of interests, such as the JoC special issue in 2004 (16:4) and the special issue of Annals of Operations Research in 2006 (148:1). The following sections give a brief introduction to some opportunities for O.R. in biology and medicine.

operations research and management science
Figure 1. Variations in the DNA sequences of different people can lead to altered structure or expression of proteins, influencing disease risk. Source: U.S. Department of Energy Genome Programs (http://genomics.energy.gov).

O.R. in Healthcare and Medicine


There is a rich history of applying O.R. to problems in healthcare and medicine. See Brandeau et al. [1] for a modern review and Ozcan [10] for an instructional text. The tradition within healthcare is to use O.R. to support operational and managerial decisions, and this is where the history is most steeped. However, advanced medical technologies and greater sophistication in their use has led to numerous medical procedures that have lent themselves to O.R. Many of these include the interplay between medicine, bio-engineering and medical physics. We discuss the popular topic of optimizing radiotherapy treatments to develop the general theme of applying O.R. and computing to medical procedures. See Ehrgott et al. [5] for a comprehensive review.

The fundamental premise behind medical applications is to provide a quality of care within the confines of the application, and the optimal design of a radiotherapy treatment is like any other O.R. application from this overarching perspective. A brief look at the clinical practice is important to understand the multiple ways O.R. is involved. Patient images are used to locate targets and to identify nearby tissues that should not receive high amounts of radiation, called organs at risk. Tissues are delineated on a sequence of 2D images to create a 3D rendering of the segmented anatomy. An oncologist prescribes radiation goals for the targets and upper bounds for the organs at risk, as well as other prescription information. This is used to design a treatment that is typically delivered in daily fractions over a period of a few weeks.

There are two places where O.R. plays a role prior to treatment design. First, the disease is commonly viewed through different images, each of which presents different aspects of the disease. Ideally, images are fused via an optimal process so that physicians can better assess the disease's state. Second, the lengthy delineation process that identifies anatomical structures is an image segmentation problem, and such problems have previously benefited from O.R. methods of optimization and modeling. This process is currently managed manually but could benefit significantly from automation, raising many opportunities in the O.R./CS interface.

   operations research management science

Figure 2: Modern technology, such as the radiotherapy treatment system depicted, supports increasingly complicated medical procedures that benefit from sophisticated O.R. and computing to harness the technology's capabilities. Source: Medicalphysicsweb (http://medicalphysicsweb.org).
The 3D partition of the anatomy and the prescription information are used to design a patient-specific treatment. Treatment design is divided into three phases, each of which is associated with an optimization process. The first phase is to select the pathways (beams) through the anatomy along which the radiation will be delivered. Although many commercial systems allow this to be automated, these methods are generally not trusted and are typically ignored. Instead, beams are manually selected with sophisticated imaging software. The second phase calculates the amount of radiation, called "fluence," that is to be delivered along each of the selected beams. This problem is solved by an optimization algorithm, although models and solution methods vary among commercial systems. The third phase decides how to deliver the treatment as well as possible. This is automated, but not optimized, in all commercial systems.

Clinical practice dictates a trial-and-error technique in many cases. A designer selects beams, optimizes fluence and evaluates the computed treatment. Evaluation often indicates the possibility of improvement, and the designer re-selects beams and repeats the process until satisfaction is reached. The final treatment then undergoes the third phase, which sequences delivery. Each of the phases has been studied as an optimization problem, and the current trend is to tie them together in hopes of optimizing the totality of the design. However, this goal is challenged by many modeling and solving issues, not the least of which is the fact that clinical optimality is not well defined. The individual studies of each phase have positioned the field to address optimizing the coupled processes, and a systematic comparison of the combinations supported by the literature is needed. This is a significant computational challenge, one for which the O.R. community is exceptionally well qualified.

Although our discussion has specifically addressed optimizing radiotherapy treatments, the basic themes canvass many medical procedures. Indeed, nearly any technique that is tailored to meet the needs of a specific patient and/or clinic has an underlying design to optimize. Examples outside medical physics include the design of prosthetics, robotic surgery, rehabilitation and the optimal scheduling of vaccines, to name a few. There are also examples in diagnostics, with one of the most well known being the use of separating hyperplanes to distinguish between disease and non-disease or between clinically important sub-types of common diseases. New applications in medicine often require multi-disciplinary research interests since the problems span modeling, solving and analysis. Moreover, harnessing modern clinical technologies and methodologies relies on our ability to compute solutions efficiently and to provide computer-assisted analysis of results.

O.R. in Genomics


Just as O.R. has proven its value in applied medical contexts, it is likewise emerging as a crucial component of basic research in modern biology. One especially important application area is the study of genetic variations, which are the small differences that distinguish the genome of any one person from any other. Since the landmark production of the first human genome sequences, much attention in human genomics has turned to identifying how individual people differ from these canonical (consensus) genomes across human populations.

Genetic variations predominantly take the form of single nucleotide polymorphisms (SNPs), which are single DNA bases with two common forms in the genome, of which several million have now been identified. Other differences between our genomes consist of structural variations, in which large pieces of DNA are duplicated, deleted or rearranged in some people relative to others. It has only recently become apparent that these structural variations are far more common than was previously suspected and account for a large part of the genetic variation within our species.

Several major initiatives are now underway, such as the International Haplotype Map (see www.hapmap.org) and the 1,000 Genomes Project (see www.1000genomes.org), to further catalog the millions of common human genetic variants and to determine how they are distributed within thousands of individual people and across dozens of human populations. The product of these initiatives is a vast and rapidly growing set of data that needs to be analyzed to identify new variations, to perform basic research into our history as a species, to better characterize the functions and evolution of our genes, and to apply the results to improving human health. These problems depend on advances in computational methods for analyzing large genomic data sets, an area presenting many opportunities for O.R.

   operations research management science

Figure 3: SNPs genotyped from 20 chromosomes across a region of chromosome 21, with the inset revealing their organization into a few common haplotypes across individuals. Source: Science, Vol. 294, No. 5,547, pp. 1,719-1,723.
One area in which O.R. methods have already proven valuable is in solving new computational problems arising from modern genome sequence determination. Shotgun sequencing — the technology used by Celera Genomics in generating one of the two initial consensus human genome sequences and now the preferred method for all large sequencing projects — depends on solving a large-scale graph optimization problem (a variant of the traveling salesman problem) to assemble a genome from millions of small fragments.

Most of our DNA, though, comes in the forms of pairs of chromosomes with one copy of each inherited from the mother and from the father. Genome sequencers from the Venter Institute have only recently determined the first diploid human genome, separating the contributions of the two slightly different copies of each chromosome in a single individual. The problem of reliably reconstructing the two chromosome copies, known as haplotype assembly or diploid assembly, has likewise attracted interest in the O.R. community, but many problems — i.e., opportunities — remain.

The large amounts of genomic variation data becoming available have also revived some "solved" problems from the much older field of phylogenetics (the inference of evolutionary trees). Even the simplest versions of phylogenetics are formally computationally intractable. While practical heuristics have done well with smaller datasets, they are poorly suited to the genome-scale datasets becoming available. Traditional O.R. optimization techniques, such as branch-and-bound methods, have a long history in this field, and more advanced methods are now being applied to solve increasingly harder problem instances.

In addition, the availability of genome-scale variation data has given new importance to solving challenging variants of the phylogeny problem. For example, the mixture of densely packed variants and large genomic distances makes it essential to incorporate into phylogenetic models a process called recombination, in which chromosomes evolve in part by swapping genetic material with one another. The resulting problem of inferring phylogenetic networks with recombination, as opposed to the more traditional phylogenetic trees, remains unsolvable in practice. As more sophisticated models of molecular evolution are developed, to account for our greater knowledge of genome variation and how it evolves, there will almost surely be a need for more sophisticated optimization methods to make practical use of these models in phylogenetic applications.

An emerging area for O.R. in genomics is the use of optimization and simulation methods to apply knowledge of genetic variation to the diagnosis and treatment of disease. Work in characterizing genetic variation is largely motivated by the hope of finding genetic variants that are statistically associated with disease, which in turn will help us develop diagnostic tests or identify genes that may be targets for drugs. O.R. methods have begun to be used to assist in mining genomic variation datasets for often weak evidence of association with disease, but much remains to be done. A central goal of the genetic variation studies is personalized medicine — the idea that a patient's treatment is specific to his/her DNA. This is already happening, revolutionizing clinical practice by enabling doctors to shift from diagnostic and remedial to predictive and preventive medicine.

One of the many important effects is bringing more drugs to patients by stratifying the clinical trials, using DNA properties to separate those who benefit from the drug from those who would be harmed. Not only does this provide better treatment to more patients, but it also lowers the cost of each drug since more enter the market. (It costs about $1.4 billion and 12-15 years to bring a drug to market, starting with 10,000 prospects. If two of the prospects get through clinical trials due to stratified testing, that cost can be split between the prices of two drugs, rather than covered by the one drug that makes it.)

operations research management science
Figure 4: Many drugs are effective due to a therapeutic advantage, shown in the figure as the difference between the curves. The therapeutic advantage varies between individuals and between strains of the disease. Source: Felix W. Frueh, CDER, FDA, 2005.

While the biological knowledge needed to make personalized medicine a reality is now becoming available, the computational infrastructure does not yet exist. There are significant opportunities for O.R. practitioners familiar with similar decision analysis problems to determine how to guide the design of the clinical trials and how to productively put this decision-making power into the hands of medical doctors in clinical practice.

Members of the O.R. community interested in learning more might consult any of a number of recent sources on genome variation and related computational problems. An overview from a computer science perspective of many of the classic computational problems in genome assembly and analysis is given by Gusfield [7]. Felsenstein [6] provides a more detailed summary of the specific issue of phylogenetics and some of the complications in practical phylogeny inference. More recent computational problems in genetic variation analysis were surveyed by Halldórsson et al. [8] One may also refer to Pennisi et al. [12] for a survey aimed at the general scientist on the biological basis of genetic variation studies and their medical importance, with many helpful links to other resources and seminal studies in the area.

Stochastic Models


Stochastic models, notably Markov chains, have formed one of the foundations for computational biology and related medical applications. DNA, RNA and amino acid sequences resemble juxtapositions of random and non-random letter sequences from a finite alphabet. It is a generally accepted premise that nucleotide- and amino acid-base sequences at the biologically important sites, which are associated with structural conservation, functional significance or evolutionary relations, have characteristics distinct from random letter sequences. In order to help identify unusual patterns within a single sequence or significant similarities among multiple sequences that might have biological relevance, real data sequences are often compared with random letter sequence models.

DNA and RNA nucleotide-base sequences are most often compared to random sequences generated from a four-letter alphabet representing the four nucleotides. The amino acid sequence for proteins is from a 20-letter alphabet representing the 20 distinct amino acid residues. Sequence models based on alternative alphabets reflecting the physical and chemical properties of the nucleic acids and amino acids (e.g., strengths of the hydrogen bonds between complementary nucleotides, the electrostatic charges on the amino acids) are also used. In the simplest case, the models are represented as a sequence of independent and identically distributed random variables. In more sophisticated representations, these are Markov-type models, including higher order and hidden Markov chains.

From these random sequence models, general scoring schemes and assessment methods for statistical significance of molecular sequence features are derived. In particular, the limit distribution of the maximal non-aligned two-sequence segmental score has served as the statistical foundation for the widely used BLAST software [9] for molecular sequence database searching.

operations research management science
Figure 5: Hidden Markov model for a DNA sequence motif allowing for insertions and deletions in a three-base consensus sequence. Source: H.J. Greenberg course notes, 2003, based on [4].

Many other applications of Markov chain sequence models use information about short strings of letters, called words, in the sequences. For an alphabet of size s, there are sk distinct words of length k. The underlying Markov model provides a theoretical framework for deriving occurrence frequencies and distributions of various individual words or groups of words with a special pattern. For example, the palindrome pattern in DNA and RNA, which is involved in molecular binding and folding activities, has been studied for a variety of biological problems including gene expression and regulation, viral genome replication, etc. While specific investigations differ, we note that they often reduce to a problem of finding suitable statistical distributions, such as the Poisson and normal types of distributions, to approximate the behavior of the random variables or random processes of interest. The book by Deonier et al. [2] gives an excellent introduction to the analysis of word distribution and occurrences and their applications in DNA mapping and signal recognition.

Although obtaining nucleotide and amino acid sequences has become routine, capturing the variation of the functional characteristics at different parts of the sequence is not always straightforward. For example, it is generally difficult to distinguish whether a DNA segment codes for a protein or not, or whether or not an amino acid sequence segment belongs to a DNA binding domain. These problems have motivated the use of hidden Markov models [4] in which the functional states are typically hidden, and the chain of observable emitted symbols is the nucleotide or amino acid sequence itself. Profile hidden Markov models for sequence families have also been applied successfully to obtain optimal and suboptimal sequence alignments that are useful for predicting protein structures or inferring phylogenetic relationships.

In order to understand the functions of cells, tissues, organs, etc., from the analysis of the basic biomolecules like DNA, RNA and proteins, one needs to know what genes are expressed under different physiological conditions. With the development of microarray technology (see [3, Chapter 11] for a brief introduction) during the past two decades, expression levels of thousands of genes can be measured simultaneously, thus creating the possibilities of investigating and comparing entire genomes or proteomes. These studies have contributed greatly to recent biomedical advances in cancer and pharmaceutical research, and they have spurred the development of a myriad of new statistical and computational tools to deal with high-dimensional data with large numbers of variables but small sample sizes. Among these, Bayesian modeling and inference, data-mining techniques like classification and clustering, as well as optimization-based machine learning techniques, such as support vector machines, are popular approaches.

Microarray technology has empowered scientists to acquire extensive information on gene activities with relative ease. However, the genes on the DNA and the proteins they code for do not work independently, but rather in highly dynamic and interactive genetic and biochemical networks. Among various stochastic models used in modeling complex genetic regulatory networks, the probabilistic Boolean network has probably received the most attention. In this model, the state (on or off) of a gene is represented by a Boolean variable and the interactions among genes are represented by Boolean functions. The biological basis for such a formulation rests in the observation that during regulation of functional states, the cell exhibits switch-like behavior. This allows coarse-grained properties of large genetic networks and interactions of genes to be studied with limited quantitative biochemical details. Steady-state behavior of probabilistic Boolean network models and the effect of perturbations can be studied as optimal control problems in the framework of Markov chains [3].

A classical text is by Waterman [13], and a more recent one is by Wilkinson [14]. Those working in stochastic processes will find familiar foundations for applications to computational biology.

Systems Biology


The bioscience community has started taking a systems approach to account for interactions among many components in a biological system; such interactions are often modeled as networks — see Palsson [11] for a comprehensive introduction. The systems approach is ready for new O.R. applications, bringing to bear decades of work on network flows, combinatorial optimization and dealing with uncertainty.

   operations research management science

Figure 6: Systems biology uses networks to explain the complex behavior exhibited by cellular processes. Source: U.S. Department of Energy Genome Programs (http://genomics.energy.gov).
Two primary types of systems are cell-signaling and metabolic pathways. A signaling molecule, like a hormone, sends a signal to a receptor by converting one physical or chemical form of the signal to another (called signal transduction). For example, signaling often works through cascades of phosphorylation, in which the signal is transmitted by special proteins called kinases that attach phosphate groups to their targets, altering their behavior. These targets may themselves be kinases activated by the phosphates to phosphorylate further targets, transmitting and amplifying a signal until it results in some functional change. There are many reasons for signaling, including programmed cell death. Regulation is another, and mixed-integer programming models have proven valuable to infer optimal pathway properties, such as one of minimum length to produce some particular metabolite. We may also want to block a pathway while minimizing "side effects," such as not blocking other pathways.

Metabolic pathways use enzymes (proteins that catalyze the reaction) to convert metabolites (small organic compounds). These are like factories and come in different varieties. Science Daily [Aug 16, 2006] reports:

Toxoplasma gondii is one nasty bug. A microscopic parasite, it lives in the intestinal tract of cats but can be carried by most warm-blooded animals. In humans, it can harm or even kill a developing fetus, and it can as well sicken those with compromised immune systems, such as AIDS patients. Now, for the first time, cellular biologists at the University of Georgia and the University of Pennsylvania have shown that fatty acid synthesis in T. gondii is essential for the parasite's survival. The discovery could lead to the development of new drugs to make the parasite's effects much less troublesome in both humans and animals.

O.R. contributes to the analysis of extreme pathways, using linear programming for the steady-state stoichiometric equation, Sv = 0, where S is an mxn matrix whose coefficients describe the rate of each of m metabolites used or produced in each of n reactions, and v is the flux. Extreme rays of the polyhedral cone reveal how pathways are composed and combined. Further, there are different normalizations for v and several objectives. For the human mitochondria, objectives include ATP production, heme biosynthesis and mixed phospholipid biosynthesis. This field has been benefitting from O.R. models, ranging from linear programming to multiple, nonlinear objectives and binary variables.

Databases and software have only recently been developed to catalog and annotate the pathways. Kinetics, from estimated reaction rates, are fundamental for modeling the behavior of complicated networks for applications such as drug design. Standard models and theory from O.R., such as continuous-time Markov models and queueing theory, are (under-) used to understand them, yet they have become central to the simulation of complicated reaction networks. Add artificial intelligence methods, such as reasoning about missing (unknown) parts of a pathway, and we have an O.R. arsenal of models, methods, and analysis techniques. Indeed, the O.R. community is ready to contribute to breakthroughs, and you can be part of that.



Harvey Greenberg (hjgreenberg@gmail.com) is professor emeritus of Mathematical and Statistical Sciences at the University of Colorado Denver and the founding editor of the INFORMS Journal on Computing.

Allen Holder is an associate professor of Mathematics at the Rose-Hulman Institute of Technology and area editor of Computational Biology and Medical Applications for the INFORMS Journal on Computing.

Ming-Ying Leung is a professor of Mathematical Sciences and director of the Bioinformatics Program at The University of Texas at El Paso and associate editor for Computational Biology and Medical Applications for the INFORMS Journal on Computing.

Russell Schwartz is an associate professor of Biological Sciences at Carnegie Mellon University and associate editor for Computational Biology and Medical Applications for the INFORMS Journal on Computing.


References


  1. M. Brandeau, F. Sainfort, and W. Pierskalla, editors, 2004, "Operations Research and Health Care: A Handbook of Methods and Applications," Kluwer Academic Publishers, Norwell, Mass.
  2. R. C. Deonier, S. Tavare, and M.S. Waterman, 2005, "Computational Genome Analysis: An Introduction," Springer, New York, N.Y.
  3. R. Dougherty, I. Shmulevich, J. Chen, and Z.J. Wang, 2005, "Genomic Signal Processing and Statistics," Hindawi, New York, N.Y.
  4. R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, 1998, "Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids," Cambridge University Press, Cambridge, Mass.
  5. M. Ehrgott, C. Güler, H. Hamacher, and L. Shao, 2008, "Mathematical optimization in intensity modulated radiation therapy, 4OR — A Quarterly Journal of Operations Research, Vol. 6, No. 3, pp., 199-262.
  6. J. Felsenstein, 2004, "Inferring Phylogenies," Sinauer and Associates, Sunderland, Mass.
  7. D. Gus?eld, 1997, "Algorithms on Strings, Trees and Sequences," Cambridge University Press, Cambridge, U.K.
  8. B. Halldórsson, V. Bafna, N. Edwards, R. Lippert, S. Yooseph, and S. Istrail, 2003, "Combinatorial problems arising in SNP and haplotype analysis," in C. Calude, M. J. Dinneen, and V. Vajnovszki, editors, "Discrete Mathematics and Theoretical Computer Science, Proceedings of the 4th International Conference" (LNCS 2731), Springer-Verlag, Berlin/Heidelberg, pp. 26 — 47.
  9. S. Karlin and S. F. Altschul, 1990, "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes," Proceedings of the National Academy of Science, Vol. 87, No. 6, pp. 2,264-2,268.
  10. Y. Ozcan, 2005, "Quantitative Methods in Health Care Management: Techniques and Applications," Jossey-Bass/Wiley, San Francisco, Calif.
  11. B. Ø. Palsson, 2006, "Systems Biology: Properties of Reconstructed Networks," Cambridge University Press, New York, N.Y.
  12. E. Pennisi, 2007, "Breakthrough of the year: Human genetic variation," Science, Vol. 318(5858), pp. 1,842 — 1,843.
  13. M. S. Waterman, 1995, "Introduction to Computational Biology: Maps, Sequences and Genomes," Chapman & Hall/CRC, Boca Raton, Fla.
  14. D. Wilkinson, 2006, "Stochastic Modeling for Systems Biology," Chapman &/CRC, Boca Raton, Fla.





  • Table of Contents
  • OR/MS Today Home Page


    OR/MS Today copyright © 2009 by the Institute for Operations Research and the Management Sciences. All rights reserved.


    Lionheart Publishing, Inc.
    506 Roswell Rd., Suite 220, Marietta, GA 30060 USA
    Phone: 770-431-0867 | Fax: 770-432-6969
    E-mail: lpi@lionhrtpub.com
    URL: http://www.lionhrtpub.com


    Web Site © Copyright 2009 by Lionheart Publishing, Inc. All rights reserved.