Evolutionary Genomics Maria Anisimova Editor Statistical and Computational Methods Second Edition Methods in Molecular Biology 1910 M E T H O D S I N M O L E C U L A R B I O L O G Y Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, AL10 9AB, UK For further volumes: http://www.springer.com/series/7651 Evolutionary Genomics Statistical and Computational Methods Second Edition Edited by Maria Anisimova Institute of Applied Simulations, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), W € adenswil, Switzerland Swiss Institute of Bioinformatics, Lausanne, Switzerland Editor Maria Anisimova Institute of Applied Simulations School of Life Sciences and Facility Management Zurich University of Applied Sciences (ZHAW) W € adenswil, Switzerland Swiss Institute of Bioinformatics Lausanne, Switzerland ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-4939-9073-3 ISBN 978-1-4939-9074-0 (eBook) https://doi.org/10.1007/978-1-4939-9074-0 This book is an open access publication. © The Editor(s) (if applicable) and The Author(s) 2012, 2019. Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made. The images or other third party material in this book are included in the book’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the book’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature. The registered company address is: 233 Spring Street, New York, NY 10013, U.S.A. Preface This volume is a thoroughly revised second edition of Evolutionary Genomics: Statistical and Computational Methods published in 2012. Like the first edition, the new volume includes comprehensive reviews of the most recent and fundamental developments in bioinformatics methods for evolutionary genomics and related challenges associated with increasing data size, heterogeneity, and its inherent complexity. Throughout the volume, prominent authors address the challenge of analyzing and understanding the dynamics of complex biological systems, and elaborate on some promising strategies that would bring us closer to the ultimate “holy grail” of biology— uncovering of the relationships between genotype and phenotype. Consequently, the pre- sented collection of peer-reviewed articles also represents a synergy between theoretical and experimental scientists from a range of disciplines, working together towards a common goal. Once again, the revised volume reiterates the power of taking an evolutionary approach to study molecular data. This book is intended for scientists looking for a compact overview of the cutting-edge statistical and computational methods in evolutionary genomics. The volume may serve as a comprehensive guide for both graduate and advanced undergraduate students planning to specialize in genomics and bioinformatics. Equally, the volume should be helpful for experienced researchers entering genomics from more fundamental disciplines, such as statistics, computer science, physics, and biology. In other words, the material presented here should suit both a novice in biology with strong statistics and computational skills and a molecular biologist with a good grasp of standard mathematical concepts. To cater to differences in reader backgrounds, Part I is composed of educational primers to help with fundamental concepts in genome biology (Chapter 1), probability and statistics (Chapter 2), and molecular evolution (Chapter 3). As these concepts reappear repeatedly throughout the book, the first three chapters will help the neophyte to stay “afloat”. The exercises and questions offered at the end of each chapter serve to deepen the understanding of the material. Part II of this volume focuses on sequence homology and alignment—from aligning whole genomes (Chapter 4) to disentangling orthologs, paralogs, and transposable ele- ments (Chapters 5 and 6). Part III includes chapters on phylogenetic methods to study genome evolution. Chapter 7 presents multispecies coalescent methods for reconciling phylogenetic discord between gene and species trees. However, a mathematically convenient “binary tree” model does not always live up to scrutiny as numerous evolutionary processes act in reticulate (network-like) fashion, complicating the statistical description of evolution- ary models and increasing computational complexity, often to prohibitive levels. One simplification is to assume that some molecular sequence units (genes, gene segments) still evolve in a treelike manner. If so, Chapter 8 describes one practical approach to meaningfully summarize the binary tree distributions for a set of genomes as a “forest of trees”. Alternatively network-like phylogenetic relationships can be represented by graphs (Chapter 9). Dating methods for genome-scale data are discussed in Chapter 10, while Chapter 11 provides more examples of non-treelike processes in a comparative review of genome evolution in different breeding systems. v By disentangling different evolutionary forces acting on genomes, we hope to under- stand the origins of biological innovation, which is often thought to be coupled with natural selection. After all, how do we explain that, by the words of Darwin, “from so simple a beginning endless forms most beautiful and most wonderful have been, and are being, evolved”? This is the main topic of Part IV that discusses the methodology for evaluating selective pressures on genomic sequences (Chapters 12–14) and genomic evolution in light of protein domain architecture and transposable elements (Chapters 15 and 16). Part V of this book is dedicated to population genomics and other omics, with example applications to disease. Indeed, as evolution starts in populations, there is much interest in generating and studying population genome data for a wide range of species. Chapter 17 discusses models for genetic architectures of complex disease and genome-wide association studies for finding susceptibility variants. Chapter 18 reviews approaches to study ancestral population geno- mics. Chapters 19, 20 and 21 illustrate first principles of analyzing environmental sequences and applications to clinical trials and systems genetics. Finally, Part VI concludes the book by discussing current bottlenecks in handling and analyzing genomic data. Chapter 22 focuses on challenges and approaches for large and complex data representation and simul- taneous querying of heterogeneous databases. Chapter 23 makes the case for using efficient high-performance computing strategies for computationally demanding phylogenetic ana- lyses, in particular in the Bayesian framework. Solutions for scalable workflows and sharing programming resources are presented in Chapters 24 and 25. On behalf of all authors, I hope that this book will become a source of inspiration and new ideas for our readers. Wishing you a pleasant reading! W € adenswil, Switzerland Lausanne, Switzerland Maria Anisimova vi Preface Acknowledgements This renewed edition of Evolutionary Genomics: Statistical and Computational Methods is a result of a dedicated effort by 94 co-authors of the book representing research institutions from nearly two dozen different countries. Special thanks go to almost 50 independent reviewers whose constructive and detailed comments have greatly contributed to improving the overall quality of the book chapters and the clarity of the presentation. As for the first edition of this book, the cover image was made by the author of Chapter 6 and a talented photography artist, Wojciech Makałowski, from the University of Mu ̈nster, Germany. By a mutual agreement between all authors of the book, all chapters are available Open Access . Swiss Institute of Bioinformatics (SIB) and Zurich University of Applied Sciences (ZHAW) have generously contributed to cover a part of the Open Access publication fees. Finally, I would like to thank my colleagues at the Institute of Applied Simulations and the School of Life Sciences and Facility Management of ZHAW (Zurich University of Applied Sciences) as well as my family for their support and encouragement. vii Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii P ART I I NTRODUCTION : B IOINFORMATICIAN ’ S P RIMERS 1 Introduction to Genome Biology and Diversity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Noor Youssef, Aidan Budd, and Joseph P. Bielawski 2 Probability, Statistics, and Computational Science. . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Niko Beerenwinkel and Juliane Siebourg 3 A Not-So-Long Introduction to Computational Molecular Evolution . . . . . . . . . 71 Ste ́phane Aris-Brosou and Nicolas Rodrigue P ART II G ENOMIC A LIGNMENT AND H OMOLOGY I NFERENCE 4 Whole-Genome Alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Colin N. Dewey 5 Inferring Orthology and Paralogy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Adrian M. Altenhoff, Natasha M. Glover, and Christophe Dessimoz 6 Transposable Elements: Classification, Identification, and Their Use As a Tool For Comparative Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Wojciech Makałowski, Valer Gotea, Amit Pande, and Izabela Makałowska P ART III P HYLOGENOMICS AND G ENOME E VOLUTION 7 Modern Phylogenomics: Building Phylogenetic Trees Using the Multispecies Coalescent Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Liang Liu, Christian Anderson, Dennis Pearl, and Scott V. Edwards 8 Genome-Wide Comparative Analysis of Phylogenetic Trees: The Prokaryotic Forest of Life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 Pere Puigbo `, Yuri I. Wolf, and Eugene V. Koonin 9 The Methodology Behind Network Thinking: Graphs to Analyze Microbial Complexity and Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Andrew K. Watson, Romain Lannes, Jananan S. Pathmanathan, Raphae ̈l Me ́heust, Slim Karkar, Philippe Colson, Eduardo Corel, Philippe Lopez, and Eric Bapteste 10 Bayesian Molecular Clock Dating Using Genome-Scale Datasets . . . . . . . . . . . . . 309 Mariodos Reis and Ziheng Yang 11 Genome Evolution in Outcrossing vs. Selfing vs. Asexual Species . . . . . . . . . . . . . 331 Sylvain Gle ́min, Cle ́mentine M. Franc ̧ois, and Nicolas Galtier ix P ART IV N ATURAL S ELECTION AND I NNOVATION IN G ENOMIC S EQUENCES 12 Selection Acting on Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 Carolin Kosiol and Maria Anisimova 13 Looking for Darwin in Genomic Sequences: Validity and Success Depends on the Relationship Between Model and Data . . . . . . . . . . . . . . . . . . . . . 399 Christopher T. Jones, Edward Susko, and Joseph P. Bielawski 14 Evolution of Viral Genomes: Interplay Between Selection, Recombination, and Other Forces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 Stephanie J. Spielman, Steven Weaver, Stephen D. Shank, Brittany Rife Magalis, Michael Li, and Sergei L. Kosakovsky Pond 15 Evolution of Protein Domain Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469 Sofia K. Forslund, Mateusz Kaduk, and Erik L. L. Sonnhammer 16 New Insights on the Evolution of Genome Content: Population Dynamics of Transposable Elements in Flies and Humans . . . . . . . . . . . . . . . . . . . 505 Lain Guio and Josefa Gonza ́ lez P ART V P OPULATION G ENOMICS AND O MICS IN L IGHT OF D ISEASE AND E VOLUTION 17 Association Mapping and Disease: Evolutionary Perspectives . . . . . . . . . . . . . . . . . 533 Søren Besenbacher, Thomas Mailund, Bjarni J. Vilhja ́ lmsson, and Mikkel H. Schierup 18 Ancestral Population Genomics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 Julien Y. Dutheil and Asger Hobolth 19 Introduction to the Analysis of Environmental Sequences: Metagenomics with MEGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591 Caner Bag ̆cı, Sina Beier, Anna G orska, and Daniel H. Huson 20 Multiple Data Analyses and Statistical Approaches for Analyzing Data from Metagenomic Studies and Clinical Trials . . . . . . . . . . . . 605 Suparna Mitra 21 Systems Genetics for Evolutionary Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635 Pjotr Prins, Geert Smant, Danny Arends, Megan K. Mulligan, Rob W. Williams, and Ritsert C. Jansen P ART VI H ANDLING G ENOMIC D ATA : R ESOURCES AND C OMPUTATION 22 Semantic Integration and Enrichment of Heterogeneous Biological Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655 Ana Claudia Sima, Kurt Stockinger, Tarcisio Mendes de Farias, and Manuel Gil 23 High-Performance Computing in Bayesian Phylogenetics and Phylodynamics Using BEAGLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691 Guy Baele, Daniel L. Ayres, Andrew Rambaut, Marc A. Suchard, and Philippe Lemey x Contents 24 Scalable Workflows and Reproducible Data Analysis for Genomics . . . . . . . . . . . . 723 Francesco Strozzi, Roel Janssen, Ricardo Wurmus, Michael R. Crusoe, George Githinji, Paolo Di Tommaso, Dominique Belhachemi, Steffen Mo ̈ller, Geert Smant, Joepde Ligt, and Pjotr Prins 25 Sharing Programming Resources Between Bio* Projects. . . . . . . . . . . . . . . . . . . . . 747 Raoul J. P. Bonnal, Andrew Yates, Naohisa Goto, Laurent Gautier, Scooter Willis, Christopher Fields, Toshiaki Katayama, and Pjotr Prins Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767 Contents xi Contributors A DRIAN M. A LTENHOFF Computer Science Department, ETH Zurich, Zurich, Switzerland; Swiss Institute of Bioinformatics, Lausanne, Switzerland C HRISTIAN A NDERSON Advantage Testing of Boston, Newton Centre, MA, USA M ARIA A NISIMOVA Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), W € adenswil, Switzerland; Swiss Institute of Bioinformatics, Lausanne, Switzerland D ANNY A RENDS Animal Breeding Biology and Molecular Genetics, Albrecht Daniel Thaer- Institute for Agricultural and Horticultural Sciences, Humboldt University zu Berlin, Berlin, Germany S TE ́ PHANE A RIS -B ROSOU Department of Biology, University of Ottawa, Ottawa, ON, Canada; Department of Mathematics and Statistics, University of Ottawa, Ottawa, ON, Canada D ANIEL L. A YRES Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA G UY B AELE Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium C ANER B AG ̆ CI Algorithms in Bioinformatics, Faculty of Computer Science, University of Tu ̈bingen, Tu ̈bingen, Germany E RIC B APTESTE Sorbonne Universite ́s, Institut de Biologie Paris-Seine, UPMC Universite ́ Paris 6, Paris, France N IKO B EERENWINKEL Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland S INA B EIER Algorithms in Bioinformatics, Faculty of Computer Science, University of Tu ̈bingen, Tu ̈bingen, Germany D OMINIQUE B ELHACHEMI Life Technologies, Waltham, MA, USA S ØREN B ESENBACHER Department of Clinical Medicine (MOMA), Aarhus University, Aarhus, Denmark J OSEPH P. B IELAWSKI Department of Biology, Dalhousie University, Halifax, NS, Canada; Department of Mathematics & Statistics, Dalhousie University, Halifax, NS, Canada R AOUL J. P. B ONNAL Istituto Nazionale Genetica Molecolare INGM Romeo ed Enrica Invernizzi, Milan, Italy A IDAN B UDD Structural and Computational Biology (SCB) Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany P HILIPPE C OLSON Fondation Institut Hospitalo-Universitaire Me ́diterrane ́e Infection, Po ˆle des Maladies Infectieuses et Tropicales Clinique et Biologique, Fe ́de ́ration de Bacte ́riologie- Hygie `ne-Virologie, Centre Hospitalo-Universitaire Tione, Assistance Publique-Ho ˆpitaux de Marseille, Marseille, France; Unite ́ de Recherche sur les Maladies Infectieuses et Tropicales Emergentes (URMITE) UM63, CNRS 7278, IRD 198, INSERM U1095, Aix- Marseille University, Marseille, France E DUARDO C OREL Sorbonne Universite ́s, Institut de Biologie Paris-Seine, UPMC Universite ́ Paris 6, Paris, France M ICHAEL R. C RUSOE Common Workflow Language Project, Vilnius, Lithuania xiii T ARCISIO M ENDES DE F ARIAS University of Lausanne, Lausanne, Switzerland; SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland J OEP DE L IGT Department of Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands C HRISTOPHE D ESSIMOZ Swiss Institute of Bioinformatics, Lausanne, Switzerland; Department of Computational Biology, University of Lausanne, Lausanne, Switzerland; Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland; Department of Genetics, Evolution and Environment, University College London, London, UK; Department of Computer Science, University College London, London, UK C OLIN N. D EWEY Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA P AOLO D I T OMMASO Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Barcelona, Spain M ARIO DOS R EIS School of Biological and Chemical Sciences, Queen Mary University of London, London, UK J ULIEN Y. D UTHEIL Department of Evolutionary Genetics, Max Planck Institute of Evolutionary Biology, Plo ̈n, Germany P ETER E BERT Max Planck Institute for Informatics, Saarbru ̈cken, Saarland, Germany S COTT V. E DWARDS Department of Organismic and Evolutionary Biology & Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA C HRISTOPHER F IELDS Institute for Genomic Biology, University of Illinois at Urbana- Champaign, Urbana, IL, USA S OFIA K. F ORSLUND EMBL Heidelberg, Heidelberg, Germany; Max Delbru ̈ck Centre for Molecular Medicine, Berlin, Germany C LE ́ MENTINE M. F RANC ̧ OIS Institut des Sciences de l’Evolution, UMR5554, Universite ́ Montpellier II, Montpellier, France N ICOLAS G ALTIER Institut des Sciences de l’Evolution, UMR5554, Universite ́ Montpellier II, Montpellier, France L AURENT G AUTIER DMAC, Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kongens Lyngby, Denmark M ANUEL G IL ZHAW Zurich University of Applied Sciences, Winterthur, Switzerland; SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland G EORGE G ITHINJI KEMRI Wellcome Trust Research Programme, Kilifi, Kenya S YLVAIN G LE ́ MIN Institut des Sciences de l’Evolution, UMR5554, Universite ́ Montpellier II, Montpellier, France N ATASHA M. G LOVER Swiss Institute of Bioinformatics, Lausanne, Switzerland; Department of Computational Biology, University of Lausanne, Lausanne, Switzerland; Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland J OSEFA G ONZA ́ LEZ Institute of Evolutionary Biology (CSIC-Universitat Pompeu Fabra), Barcelona, Spain A NNA G O ́ RSKA Algorithms in Bioinformatics, Faculty of Computer Science, University of Tu ̈bingen, Tu ̈bingen, Germany V ALER G OTEA National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA N AOHISA G OTO Department of Genome Informatics, Genome Information Research Center, Research Institute for Microbial Diseases, Osaka University, Osaka, Japan L AIN G UIO Institute of Evolutionary Biology (CSIC-Universitat Pompeu Fabra), Barcelona, Spain xiv Contributors A SGER H OBOLTH Bioinformatics Research Center (BiRC), Aarhus University, Aarhus, Denmark D ANIEL H. H USON Algorithms in Bioinformatics, Faculty of Computer Science, University of Tu ̈bingen, Tu ̈bingen, Germany R ITSERT C. J ANSEN Groningen Bioinformatics Centre, GBB, University of Groningen, Groningen, Netherlands R OEL J ANSSEN Department of Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands C HRISTOPHER T. J ONES Department of Mathematics and Statistics, Dalhousie University, Halifax, NS, Canada M ATEUSZ K ADUK Department of Biochemistry and Biophysics, Stockholm Bioinformatics Centre, Science for Life Laboratory, Stockholm University, Solna, Sweden S LIM K ARKAR Sorbonne Universite ́s, Institut de Biologie Paris-Seine, UPMC Universite ́ Paris 6, Paris, France; Department of Ecology, Evolution, and Natural Resources, School of Environmental and Biological Sciences, Rutgers, The State University of NJ, New Brunswick, NJ, USA T OSHIAKI K ATAYAMA Database Center for Life Science, Joint Support-Center for Data Science Research, Research Organization of Information and Systems, Chiba, Japan E UGENE V. K OONIN National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA S ERGEI L. K OSAKOVSKY P OND Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA C AROLIN K OSIOL Centre of Biological Diversity, School of Biology, University of St Andrews, Fife, UK; Institut fu ̈r Populationsgenetik, Vetmeduni Vienna, Wien, Austria R OMAIN L ANNES Sorbonne Universite ́s, Institut de Biologie Paris-Seine, UPMC Universite ́ Paris 6, Paris, France P HILIPPE L EMEY Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium M ICHAEL L I Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA L IANG L IU Department of Statistics, University of Georgia, Athens, GA, USA P HILIPPE L OPEZ Sorbonne Universite ́s, Institut de Biologie Paris-Seine, UPMC Universite ́ Paris 6, Paris, France B RITTANY R IFE M AGALIS Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA T HOMAS M AILUND Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark I ZABELA M AKAŁOWSKA Institute of Anthropology, Adam Mickiewicz University, Poznan ́ , Poland W OJCIECH M AKAŁOWSKI Institute of Bioinformatics, University of Muenster, Muenster, Germany R APHAE ̈ L M E ́ HEUST Sorbonne Universite ́s, Institut de Biologie Paris-Seine, UPMC Universite ́ Paris 6, Paris, France S UPARNA M ITRA Leeds Institute of Medical Research, University of Leeds, Microbiology, Old Medical School, Leeds General Infirmary, Leeds LS1 3EX, West Yorkshire, UK S TEFFEN M O ̈ LLER Institute for Biostatistics and Informatics in Medicine and Ageing Research (IBIMA), Rostock University Medical Center, Rostock, Germany M EGAN K. M ULLIGAN Department of Genetics, Genomics and Informatics, The University of Tennessee Health Science Center, Memphis, TN, USA Contributors xv A MIT P ANDE Institute of Bioinformatics, University of Muenster, Muenster, Germany J ANANAN S. P ATHMANATHAN Sorbonne Universite ́s, Institut de Biologie Paris-Seine, UPMC Universite ́ Paris 6, Paris, France D ENNIS P EARL Department of Statistics, Pennsylvania State University, University Park, PA, USA P JOTR P RINS Department of Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands; Department of Genetics, Genomics and Informatics, The University of Tennessee Health Science Center, Memphis, TN, USA; Laboratory of Nematology, Department of Plant Science, Wageningen University, Wageningen, The Netherlands P ERE P UIGBO ` National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA; Division of Genetics and Physiology, Department of Biology, University of Turku, Turku, Finland A NDREW R AMBAUT Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, UK N ICOLAS R ODRIGUE Department of Biology, Carleton University, Ottawa, ON, Canada; Institute of Biochemistry, Carleton University, Ottawa, ON, Canada; School of Mathematics and Statistics, Carleton University, Ottawa, ON, Canada M IKKEL H. S CHIERUP Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark S TEPHEN D. S HANK Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA J ULIANE S IEBOURG Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland A NA C LAUDIA S IMA ZHAW Zurich University of Applied Sciences, Winterthur, Switzerland; University of Lausanne, Lausanne, Switzerland G EERT S MANT Laboratory of Nematology, Department of Plant Science, Wageningen University, Wageningen, the Netherlands E RIK L. L. S ONNHAMMER Department of Biochemistry and Biophysics, Stockholm Bioinformatics Centre, Science for Life Laboratory, Stockholm University, Solna, Sweden S TEPHANIE J. S PIELMAN Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA K URT S TOCKINGER ZHAW Zurich University of Applied Sciences, Winterthur, Switzerland F RANCESCO S TROZZI Enterome Bioscience, Paris, France M ARC A. S UCHARD Department of Human Genetics and Biomathematics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA E DWARD S USKO Department of Mathematics and Statistics, Dalhousie University, Halifax, NS, Canada B JARNI J. V ILHJA ́ LMSSON Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark A NDREW K. W ATSON Sorbonne Universite ́s, Institut de Biologie Paris-Seine, UPMC Universite ́ Paris 6, Paris, France S TEVEN W EAVER Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA R OB W. W ILLIAMS Department of Genetics, Genomics and Informatics, The University of Tennessee Health Science Center, Memphis, TN, USA S COOTER W ILLIS Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL, USA xvi Contributors Y URI I. W OLF National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA R ICARDO W URMUS BIMSB Scientific Bioinformatics Platform, Max Delbru ̈ck Center for Molecular Medicine, Berlin, Germany Z IHENG Y ANG Department of Genetics, Evolution and Environment, University College London, London, UK A NDREW Y ATES European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK N OOR Y OUSSEF Department of Biology, Dalhousie University, Halifax, NS, Canada Contributors xvii Part I Introduction: Bioinformatician’s Primers Chapter 1 Introduction to Genome Biology and Diversity Noor Youssef, Aidan Budd, and Joseph P. Bielawski Abstract Organisms display astonishing levels of cell and molecular diversity, including genome size, shape, and architecture. In this chapter, we review how the genome can be viewed as both a structural and an informational unit of biological diversity and explicitly define our intended meaning of genetic information. A brief overview of the characteristic features of bacterial, archaeal, and eukaryotic cell types and viruses sets the stage for a review of the differences in organization, size, and packaging strategies of their genomes. We include a detailed review of genetic elements found outside the primary chromosomal structures, as these provide insights into how genomes are sometimes viewed as incomplete informational entities. Lastly, we reassess the definition of the genome in light of recent advancements in our understanding of the diversity of genomic structures and the mechanisms by which genetic information is expressed within the cell. Collectively, these topics comprise a good introduction to genome biology for the newcomer to the field and provide a valuable reference for those developing new statistical or computation methods in genomics. This review also prepares the reader for anticipated transformations in thinking as the field of genome biology progresses. Key words Organism diversity, Viruses, Prokaryotes, Eukaryotes, Organelles, DNA, RNA, Protein, Regulatory DNA, Epigenetics, Plasmids, Transcription, Translation, DNA replication, Chromatin, Gene structure 1 Introduction Following the introduction of the concept of the genome in 1920 [1], the field of genome science has grown to encompass a vast range of interconnected topics (e.g., nucleic acid chemistry, molec- ular structure, replication and expression biochemistry, mutational processes, evolutionary dynamics, and interactions with cellular processes). Although the notion of the genome as a fundamental biological unit has been with us for nearly a century, it is only within the last decade that genomics has emerged as a transformative discipline within biology and the health sciences [2]. Its rapid development was in large part due to advances in massively parallel next-generation sequencing [3], which yielded unprecedented levels of genomic data. Those data revealed extensive natural Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods , Methods in Molecular Biology, vol. 1910, https://doi.org/10.1007/978-1-4939-9074-0_1, © The Author(s) 2019 3 variation in the way that genomes are structured and processed. This led modern biologists to reevaluate the fundamental definition of the genome. The typical definition of the genome is often dualistic, referen- cing both structural features and its function to store and transmit biological information [4]. For example, the US National Institutes of Health (NIH) uses the following definition: “A genome is an organism’s complete set of DNA, including all of its genes. Each genome contains all of the information needed to build and main- tain that organism. In humans, a copy of the entire genome—more than three billion DNA base pairs—is contained in all cells that have a nucleus.” This conception, as with many others, is structural with regard to physical features (viz., genes and DNA base pairs) and informational with regard to its role in carrying out cellular func- tions (viz., to build and maintain the organism). Through increased knowledge of genome diversity, the field has come to realize that both conceptions of the genome are sometimes insufficient [4]. We now understand that the physical structures of the genome can be transient and that the expression of information contained within a genome is often conditioned on non-genomic factors. The science of genome biology is entering a new era based on a deeper under- standing of the relationship between genotype and phenotype [5]. The purpose of this review is to provide a condensed overview of genome biology and to anticipate transformations in thinking that will occur as the field progresses. The remainder of this article is structured into four parts, with the next section providing a brief overview of the diversity of organismal cell types. The two subsequent sections introduce the structural and informational aspects of genomes, respectively. In the final section, we reassess the definition of the genome through selected biological examples and conclude with an updated perspective on the nature of the genome as an informational entity. 2 Organism Diversity and Cell Types Cells are the smallest living unit of an organism. All cells have three attributes in common: cell membrane, cytoplasm, and genome. Structurally, cells can be divided into two basic types: prokaryotic and eukaryotic cells. Eukaryotic cells tend to be more complex. They possess a nucleus and other membrane-bound organelles , which are specialized components in the cell that perform unique functions (e.g., nucleus, mitochondria, plastids). Conversely, pro- karyotic cells lack membrane-bound organelles. Although similar in cell structure, prokaryotes include two fundamentally distinct domains: the eubacteria (true bacteria, often referred to simply as bacteria) and the archaea. 4 Noor Youssef et al. Cellular life is detected in almost every environment on Earth. As life has colonized and adapted to the vast number of niches, cells have evolved an incredible amount of diversity in regard to size [6], form [7], lifestyle [8, 9], and complexity [10]. Understanding the basis of such diversity remains one of the central aims of biology. Readers interested in the latest understanding of Earth’s biodiver- sity, the unique characteristics of its organisms, and how both extant and extinct forms are related to each other are encouraged to explore the following resources: the University of California Museum of Paleontology “History of life through time” exhibit [11], the Tree of Life Web Project [12], the Encyclopedia of Life [13]. 2.1 Viruses Viruses are infectious agents of living cells that are unable to repro- duce in the absence of a host. Viruses are not considered cellular entities since they lack two of the essential attributes that define a cell; they possess neither a cell membrane nor cytoplasm. The discovery of virophages , viruses that parasitize other viruses, resur- rected the debate on their classification as living organisms [14]. Some consider viruses to be living entities since they can be hosts to other viruses, with a virophage infection leading to the eventual death of the host virus, implying an initial “living” state [15]. The opposing view asserts that a virus’ inability to reproduce outside of a cellular host makes them nonliving entities [16, 17]. Irrespective of their delineation as living or nonliving, viruses are relevant to this review as they possess genomes and are the most abundant biological replicators in the biosphere [18]. Outside of their host, viruses exist as viral particles ( virions ) consisting of a protein capsule that protects and encloses their genome. Once a virion has entered a host cell, it “hijacks” the host’s cellular structures and processes to carry out the metaboli- cally active phase of the viral life cycle. At this stage, the virus exhibits physiological properties reminiscent of living cells; they metabolize, grow, and reproduce. There is a wide range of viral lifestyles, with corresponding diversity in viral forms, sizes, hosts, and genomes [16]. The largest known virus, the mimivirus, was originally identified as an infectious agent of an amoeba [19] and can itself become a host for virophages [14]. To put this in context, the virion of a mimivirus can be larger than some prokaryotic cells [16]. At the other end of the scale are viruses such as the circo- viruses, some of which have small genomes made up of less than 2000 nucleotides [20]. A more detailed account of viral diversity can be found at the ViralZone website [21]. 2.2 Bacteria The bacterial cell is prokaryotic, and it is relatively simple as com- pared to eukaryotic cells. It has no membrane-bound organelles, and the chromosome (usually one) is not separated from the other components of the cell. While predominantly unicellular, they often live in biofilms , a community of cells bound together by a secreted Introduction to Genome Biology and Diversity 5 polymer matrix [22], displaying a range of cooperative behaviors [23]. They can also exhibit regulated differentiation into different cell types, where two cells with the same genome have different morphology and function [22, 24]. Only a very small fraction of bacterial diversity (less than 1%) can be cultured and grown in the laboratory [25]. The problem of uncultivable bacteria is a consequence of our limited knowledge of their physiological diversity and the interactions necessary for their growth [26]. To this end, efforts are being made to study bacteria in nature [27–29] but with limited progress given the immense metabolic diversity of bacteria. Even within the incomplete sam- pling of cultivable bacteria, there is considerable diversity in cell shape [30], mode of reproduction [9], and cell cycle regulation [31]. The bacterial cell cycle involves the coordination of genome replication and segregation of replicated copies into daughter cells, followed by cell division. In this way, the transmission of genetic material is “vertical” from one cell generation to the next. Under certain conditions, some bacteria, such as E. coli , can initiate a new round of genome replication prior to completion of cell division [32, 33], thereby resulting in an increase in the number of gene copies near the origin of replication as compared to loci replicated later [31]. Other bacteria, such as Caulobacter , maintain a tightly regulated cell cycle to ensure a single replication event per division [34]. Under optimal conditions, some species can com- plete their cell cycle every 20 min, implying that a single cell could produce more than a billion descendants in a mere 10 h. In addition to vertical transfer, genetic information can be transferred “hori- zontally” between unrelated cells via the processes of transforma- tion, conjugation, or transduction [35]. An event that transfers gene(s) between different species (or cells) by any of these three processes is referred to as a horizontal gene transfer ( HGT ) event. 2.3 Archaea Archaea are single-celled organisms that appear strikingly similar to bacteria under light and electron microscopes. Like bacteria they often have a single circular chromosome and lack a nucleus, and for a long period of time the archaea were wrongly categorized as bacteria. The first indication that the archaea might be a separate domain of life was obtained from phylogenetic analyses of the 16S rRNA gene [36]. Advancements in genome sequencing and analy- sis yielded further evidence of the evolutionary distinction between the bacterial and archaeal domains [37]. Despite their superficial cellular similarity to bacteria, the archaea have many molecular-level similarities to eukaryotes, leading researchers to hypothesize that the ancestor of the eukaryotes arose within the archaea [38]. Previously, archaea were assumed to be a mino