Sergei L Kosakovsky Pond ( spond@temple.edu / @sergeilkp / lab.hyphy.org ) Galaxy-ELIXIR webinar series Evolution of SARS-CoV-2 covid19.galaxyproject.org github.com/veg/SARS-CoV-2 This presentation bit.ly/sarscov2-selection Natural Selection • Mutation, recombination and other processes introduce variation into genomes of organisms • The fitness of an organism describes how well it can survive/grow/function/replicate in a given environment, or how well it can pass on its genetic material to future generations • Any particular mutation can be • Neutral: no or little change in fitness (the majority of genetic variation falls into this class according to the neutral theory) • Deleterious : reduced fitness • Adaptive : increased fitness • The same mutation can have di ff erent fitness costs in di ff erent environments (fitness landscape), and di ff erent genetic backgrounds (epistasis) What does selection in viruses look like? • Necessary conditions • Selective pressure (immune, drug, other host factors) • Time • Have we had those in SARS-CoV-2? • No clear evidence so far, which is not unexpected. • ~6 months? What does selection in viruses look like? • To detect selection we need • Su ffi ciently many sequences • Divergence/diversity • Repeated substitutions (diversifying) • Change in frequency (directional) 40 years of HA evolution in HA of H3N2 showing branches with repeated selective events at a canonical antigenic site Note the scale Node12 Node13 Node14 Node1511 A_BILTHOVEN_21801_19 A_MEMPHIS_105_1972_9 A_ENGLAND_42_1972_36 A_ENGLAND_72_230566 A_ENGLAND_42_1972_38 Node1477 Node20 Node1465 A_MEMPHIS_101_1972_9 A_MEMPHIS_102_72_790 Node1462 A_HONG_KONG_49_1974_ A_HONG_KONG_33_1973_ Node23 Node1454 A_PORT_CHALMERS_1_19 A_PORT_CHALMERS_1_19 Node24 Node1445 A_BILTHOVEN_7398_197 Node1456 A_BILTHOVEN_9459_197 Node25 Node1406 Node1446 Node1451 A_BILTHOVEN_5930_197 A_BILTHOVEN_5931_197 A_BILTHOVEN_2271_197 Node1438 Node1408 A_HONG_KONG_43_1975_ A_ALBANY_42_1975_596 A_MEMPHIS_101_1974_9 A_MEMPHIS_103_1974_9 A_BILTHOVEN_1843_197 A_BILTHOVEN_334_1975 Node1439 A_VICTORIA_3_1975_36 A_BEIJING_39_75_9040 Node1410 A_ALBANY_1_1976_6163 Node1399 A_ENGLAND_321_1977_1 Node1440 A_VICTORIA_1968_2420 A_ALBANY_15_1976_613 A_MEMPHIS_137_1976_9 A_BILTHOVEN_6545_197 A_MEMPHIS_103_1976_9 A_BILTHOVEN_3895_197 A_ALBANY_4_1977_5968 A_BILTHOVEN_5657_197 Node1424 A_AMSTERDAM_1609_197 A_BILTHOVEN_1761_197 A_BILTHOVEN_5029_197 A_MEMPHIS_106_1976_9 A_BILTHOVEN_2271_197 A_TEXAS_1_1977_37753 A_TEXAS_1_1977_36548 A_VICTORIA_3_1975_13 A_VICTORIA_3_1975_13 A_ROTTERDAM_8179_197 A_ROTTERDAM_5828_197 A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A T A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C A A C G A C G A C G A C G A C G A C G A C G A C G A C G A C G A C G A C G A C G A C G A C G A C N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N D D D D D D D D D D D D D D D Fixation after a few years in human hosts for H3N2 HA For SARS-CoV-2 we have... Su ffi ciently many sequences • Divergence/diversity • Repeated substitutions (diversifying) • Change in frequency (directional) https://observablehq.com/@stevenweaver/case-vs-sequence-count For SARS-CoV-2 we have... Little divergence... Su ffi ciently many sequences Divergence/diversity • Repeated substitutions (diversifying) • Change in frequency (directional) Mean of ~7 genome wide di ff erences from reference https://observablehq.com/@spond/current-state-of-sars-cov-2-evolution For SARS-CoV-2 we have... Little diversity... Su ffi ciently many sequences Divergence/diversity • Repeated substitutions (diversifying) • Change in frequency (directional) Mean of ~9 genome- wide pairwise di ff erences in contemporaneous strains https://observablehq.com/@spond/current-state-of-sars-cov-2-evolution BUT... There is EXTENSIVE apparent genome variation at population level Su ffi ciently many sequences Divergence/diversity • Repeated substitutions (diversifying) • Change in frequency (directional) https://observablehq.com/@spond/summary-of-sars-cov-2-genomic-diversity BUT... There is EXTENSIVE apparent genome variation at population level Su ffi ciently many sequences Divergence/diversity • Repeated substitutions (diversifying) • Change in frequency (directional) https://observablehq.com/@spond/summary-of-sars-cov-2-genomic-diversity Shared a/a variants Any variants MOST VARIANTS ARE RARE [ Some/extensive ] Sequencing error [ Some ] RNA editing [ Majority ] Neutral/slightly deleterious intra-host variants [ Minority ] Important variation https://observablehq.com/@spond/summary-of-sars-cov-2-genomic-diversity Su ffi ciently many sequences Divergence/diversity • Repeated substitutions (diversifying) • Change in frequency (directional) ≤ 1% >1% Conservation Measles, rinderpest, and peste-de-petite ruminant viruses nucleoprotein. Nucleotides Aminoacids Diversification An antigenic site in H3N2 IAV hemagglutinin Nucleotides Aminoacids Molecular signatures of selection Because synonymous substitutions do not alter the protein, we often assume that they are neutral The rate of accumulation of synonymous substitutions ( dS ) can serve as the neutral background evolutionary rate We can compare the rate of accumulation of non-synonymous substitutions ( dN ), which alter the protein sequence, to dS and use their ratio to classify the nature of the evolutionary process dS ⇠ number of fixed synonymous mutations proportion of random mutations that are synonymous dN ⇠ number of fixed non-synonymous mutations proportion of random mutations that are non-synonymous Molecular signatures of selection Over the last 15 years, my lab and collaborators have been developing a collection of methods for estimating dN/dS and interpreting evidence for selection Methods are implemented in the HyPhy ( hyphy.org ) software package and also available in Galaxy. Molecular signatures of selection Also available on a standalone server ( datamonkey.org ), which people have been using to study SARS-CoV-2 https:// observablehq.com/ @stevenweaver/ datamonkey-and- sars-cov-2-related- analyses Standard dN/dS analyses Won’t work too well at the moment... • Will be influenced by “noise” and variation at the tips • Focus only on internal branches. • Has very little signal to operate on • Even on 15,000+ sequences, total branch lengths (~power) is minuscule: ~0.1 substitution across the entire tree per site. • Power to infer selection on internal branches is low Su ffi ciently many sequences Divergence/diversity • Repeated substitutions (diversifying) • Change in frequency (directional)