Forecast biases related to Rossby wave properties and their impact on temperature extreme predictability Onno Doensen MSc Thesis Supervisors: Dr. Georgios Fragkoulidis (JGU Mainz) Prof. Dr. Volkmar Wirth (JGU Mainz) Prof. Dr. Jordi Vila-Guerau de Arellano (Wageningen UR) Dr. Chris Weijenborg (Wageningen UR) Institute for Atmospheric Physics - Johannes Gutenberg University Mainz Meteorology and Air Quality Group - Wageningen University and Research Wageningen, June 2021 Abstract Rossby waves packets (RWPs) are a key large-scale feature of the upper-tropospheric midlatitude flow. Past studies have shown that temperature extremes are often associated with the presence of RWPs. Therefore, high quality forecasts of temperature extremes require an adequate repre- sentation of the RWP characteristic features like amplitude and phase speed in numerical weather prediction (NWP) models. This work assesses the skill of two reforecast models, one based on the ECMWF (ERA5RF) and the other on the NCEP (GEFSRF) model, in forecasting RWP amplitude and phase speed, as well as their skill in forecasting the magnitude and duration of temperature extremes. The skill is computed with respect to the ERA5 reanalysis data for the time period 1984–2019. The overarching goal of this work is to investigate whether there is a relation between forecast biases in RWP properties and temperature extremes. It is found that both models exhibit RWP amplitude underestimation that grows over forecast time. However, this underestimation is stronger and more systematic in GEFSRF, whereas the bias sign is seasonally varying in ERA5RF. RWP phase speed forecast biases do grow over forecast time as well, but the sign of the bias varies through space. ERA5RF shows a higher predictive skill for temperature extremes, but differences between seasons and temperature extreme types are present in the two models. Furthermore, bad temperature extreme forecasts are often associated with an underestimation in RWP amplitude and an overestimation of RWP phase speed. Finally, an indication is found that good forecasts of temperature extremes are associated with stronger RWP amplitude. The aforementioned results give an overview in which areas NWP models should be improved regarding RWP properties and how these biases can affect the forecast quality of temperature extremes. i Contents 1 Introduction 1 1.1 Rossby wave packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Temperature Extremes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Research objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Data and Methods 6 2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Forecast biases: reforecast vs. reanalysis . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Temperature extreme forecast skill . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Results 12 3.1 Forecast biases in Rossby wave packet properties . . . . . . . . . . . . . . . . . . . . 12 3.1.1 GEFSRF forecast biases in the N. Hemisphere . . . . . . . . . . . . . . . . . 12 3.1.2 GEFSRF forecast biases in the S. Hemisphere . . . . . . . . . . . . . . . . . 15 3.1.3 ERA5RF forecast biases in the N. Hemisphere . . . . . . . . . . . . . . . . . 18 3.1.4 North Atlantic biases evolution with forecast time . . . . . . . . . . . . . . . 21 3.2 Classifying temperature extremes and their predictability . . . . . . . . . . . . . . . 23 3.2.1 General atmospheric conditions during temperature extremes . . . . . . . . . 23 3.2.2 Temperature extreme predictability . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Skill of temperature extreme forecasts and their relation to biases in Rossby wave packet properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.1 Typical Rossby wave packet properties during European temperature extremes 27 3.3.2 Rossby wave packet amplitude in good and bad temperature extreme forecasts 28 3.3.3 Rossby wave packet property biases in good and bad temperature extreme forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4 Concluding Remarks 40 4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5 Acknowledgements 45 6 References 46 ii 1 Introduction In the mid-latitude upper troposphere, the dominant flow is from west to east. This westerly flow is not purely zonal but typically characterized by large-scale meanders that owe their existence to the latitudinal variation of the Coriolis parameter (“beta” effect) due to the rotation and spherical shape of the Earth. These so-called Rossby waves take the form of large-scale undulations along the jet stream that acts as the background flow upon which they evolve and propagate. They play a big role in the transfer of moisture, momentum and energy and can bring unusually warm or cold air masses into the mid-latitude areas. Therefore, Rossby waves play an essential role in the global atmospheric circulation (Rhines, 2002). Rossby waves were first described by the Norwegian meteorologist Carl-Gustaf Rossby (Rossby and Collaborators, 1939; Haurwitz, 1940). For barotropic Rossby waves on the β -plane and with a constant purely zonal flow u 0 , the Rossby wave dispersion relation can be written as in Eq. 1 (for the full derivation see chapter 5 of Holton and Hakim (2012)), ω = u 0 k − kβ k 2 + l 2 (1) where k and l are the zonal and meridional wavenumbers respectively, and β represents the north- ward gradient of planetary vorticity. Due to their dispersive nature, Rossby waves have different group speeds ( c g = ∂ω ∂k ) and phase speeds ( c p = ω k ), where c g is always larger than c p due to the positive value of β (northward gradient of planetary vorticity). In the real atmosphere, c p represents the speeds at which individual troughs and ridges within a Rossby wave travel and c g represents the speed at which the Rossby wave propagates as a whole. 1.1 Rossby wave packets In the most idealized scenario, a Rossby wave has the form of a perfect sinusoidal wave that prop- agates along the hemisphere with a constant amplitude. An example of this on the N. hemisphere is shown by the black dashed line in Fig. 1. However, these kind of idealized circumglobal Rossby waves are extremely rare; instead, real-atmosphere Rossby waves tend to organize into so-called Rossby wave packets (RWP) (Wirth et al. , 2018). An example of an RWP is represented by the blue line in Fig. 1. The RWP has a clear maximum amplitude at the prime meridian, and its amplitude decays symmetrically to the west and east from there. The red line in Fig. 1 is a function that encloses the RWP and can therefore be a measure for the amplitude of the RWP at any given location (Zimin et al. , 2003). This function will further be referred to as RWP Envelope ( E ) and will be used to approximate RWP amplitude throughout this work. A real-world example of a RWP on the N. Hemisphere can be seen in Fig. 2 for 7 Aug 2002 00 UTC. In Fig. 2a, the presence of the jet streams is clearly visible over the N. Hemisphere represented by the ”channels” of large wind speeds. The jet stream over N. America is much wavier than over the Eurasian continent. This can also be observed from the 300 hPa meridional wind, where there is a much clearer north-south component present over N. America. (Fig. 2b). The calculated envelope of the 300 hPa meridional wind is shown in Fig. 2c, which shows the largest RWP amplitudes over N. America and the absence of RWP activity over Eurasia as well. Figure 2 is a very clear example that RWPs are not circumglobal entities and assuming that they always are, does not properly represent the general atmospheric flow and can obscure processes that RWPs can have an effect on (e.g. see Fragkoulidis et al. (2018) for a practical example). 1 Figure 1: An example of an idealized Rossby wave (black dashed line) and Rossby wave packet (RWP) (blue line) on the N. Hemisphere. The RWP is enveloped by a function that approximates the RWP amplitude (red line) In general, RWP activity is much higher in winter due to the higher pole-to-equator temperature gradient. However, there are notable differences between the N. Hemisphere and the S. Hemisphere. In general, RWP activity is higher and more zonally symmetric on the S. Hemisphere, whereas seasonal differences in RWP activity are much higher on the N. Hemisphere (Trenberth, 1991; Souders et al. , 2014; Fragkoulidis and Wirth, 2020). In the upper parts of the troposphere at the boundaries of warm, equatorial and cold, polar air masses, strong gradients of potential vorticity (PV) are often found (e.g. see Fig 1d) (Schwierz et al. , 2004). These strong PV gradients act as preferred pathways for RWPs to propagate along and because of that, they are often referred to as Rossby waveguides (Hoskins and Ambrizzi, 1993; Wirth, 2020). Therefore, it is crucial that the characteristics of Rossby waveguides are properly represented in numerical weather prediction (NWP) models. However, it has recently been found that NWP models exhibit an unrealistic, systematic decrease of the PV-gradient at the tropopause, that characterizes Rossby waveguides, during the first two days of the forecasts (Gray et al. , 2014; Giannakaki and Martius, 2016). A possible reason for this spurious decrease of tropopause sharpness is the poor representation of diabatic processes such as moisture transport, latent heat release and radiative cooling (Chagnon et al. , 2013; Saffin et al. , 2017). Furthermore, numerical diffusion, used to keep NWP model numerics stable, also causes a decrease in tropopause sharpness (Harvey et al. , 2018). These limitations can lead to a poor representation of the Rossby waveguide and, as a result, to the evolution of RWPs. First, Harvey et al. (2016) have shown in an idealized setup that broadening the Rossby waveg- uide causes a decrease in RWP phase speed and jet stream speed, which cancel each other out at first order, however, at second order, the error in jet speed starts to dominate, which means that RWP phase speed gets underestimated on a broader Rossby waveguide. Secondly, Gray et al. (2014) have shown that the decrease in tropopause sharpness causes a decrease in ridge area during the N. Hemisphere winter in the order of 10% for several NWP models. Harvey et al. (2018) showed that the combined effect that numerical diffusion has on the tropopause sharpness and rearrangement of PV in the model leads to an underestimation of Rossby wave amplitude as well. Mart ́ ınez-Alvarado et al. (2016) also found an underestimation in ridge area of about 10% for several NWP models as well. However, they also showed that increasing model resolution and improving the model’s dynamical core are both effective ways to improve 2 Figure 2: An example of a real world RWP on 7 Aug 2002 00 UTC taken from Wirth et al. (2018). (a) Magnitude of the horizontal wind ( V ) at 300 hPa (color shading, m/s). (b) Magnitude of the meridional wind ( v ) at 300 hPa (m/s) and isolines of 300-hPa geopotential height ( Z ) (black contour lines, every 150 m). (c) The envelope ( E ) of meridional wind at 300 hPa. (d) Ertel potential vorticity ( P V ) on the 330 K isentrope (color shading in PV units). the Rossby waveguide representation and decrease the error related to ridge area in the forecast. Finally, Saffin et al. (2017) suggested that improving parametrizations of diabatic processes may be a relatively cheap way of improving the tropopause sharpness as well. Another typical property of RWPs is that small local forecast errors in RWPs can grow into large errors on the synoptic scale (Zhang et al. , 2003; Baumgart et al. , 2019) and tend to travel at speeds similar to RWP group velocity (Langland et al. , 2002). Rodwell et al. (2013) have shown that forecast errors related to convection over N. America can affect RWP propagation and cause forecast busts over Europe. Also, RWPs are sensitive to the exact phasing when interacting with extratropically transitioning tropical cyclones and an erroneous forecast of the cyclone position can have large impacts on the RWP development in the forecast (Riemer and Jones, 2010). Although the mechanisms discussed in the previous paragraph mainly highlight some aspects that limit the practical predictability of the midlatitude jet stream and RWPs, it has also been hypothesized that coherent RWPs are related to increased predictability of the general atmospheric flow due to their large temporal and spatial scale. This was first suggested by Lee and Held 3 (1993), by stating “Because the packet can remain coherent despite chaotic internal dynamics, the packet envelope should be more predictable than the individual weather systems”. More recently, Glatt and Wirth (2014) found that RWPs are better represented in the forecast when they are already present during the initialization. Grazzini and Vitart (2015) found that the presence of long-lived, coherent RWPs is associated with a decreased ensemble spread and increased medium- range forecast skill. Whether the presence of RWPs contributes to a more predictable forecast or makes the forecast more uncertain, due to the previously discussed uncertainties, remains an open question (Wirth et al. , 2018). 1.2 Temperature Extremes The presence of RWPs has often been associated with the occurrence of extreme weather. These extremes include strong surface cyclones (Wirth and Eichhorn, 2014), heavy precipitation (Bar- ton et al. , 2016) and extreme winds (Wiegand et al. , 2011) among others. Temperature extremes are also often associated with RWP activity due to the stronger north-south component in the main flow and therefore the larger possibility that anomalous air masses are advected into the mid-latitudes (Screen and Simmonds, 2014; Fragkoulidis et al. , 2018; R ̈ othlisberger et al. , 2019). Fragkoulidis et al. (2018) found a direct link between RWP presence and the 2003 W. European summer heatwave and the 2010 Russian summer heatwave. More recent examples of links between Rossby wave activity and temperature extremes include the late-summer 2016 (Zschenderlein et al. , 2018) and early-summer 2018 (Kornhuber et al. , 2019) heatwaves over Europe and the 2014 cold- wave over N. America (Shi et al. , 2017). Chapter 7 of Wirth et al. (2018) provides a more extensive list of extreme (temperature) events that were related to the presence of RWPs. Temperature extremes can have large societal impacts and, especially heatwaves, are becoming a bigger problem due to global warming (IPCC, 2013). Therefore, it is extremely important that society is well-prepared for these events (Bittner et al. , 2013) and predicting these events well in advance plays an essential role in that. However, this is still difficult (White et al. , 2017). Despite the difficulty in predicting temperature extremes, Lavaysse et al. (2019) found that forecasts may be skilful up to 3 weeks when forecasting temperature extremes and that coldwaves in winter have higher predictability than heatwaves in summer. Wulff and Domeisen (2019) found that summer heatwaves may have increased predictability on the sub-seasonal timescale due to persistent flow patterns and decreased soil moisture content and that this increased predictability is not found for summer cold extremes. Zschenderlein et al. (2019) showed that heatwaves in summer are often are affected more strongly by the upper-tropospheric flow in NW Europe. Fragkoulidis et al. (2018) found that large temperature anomalies are often associated with large values for RWP amplitude ( E ) in general. In addition to that, Fragkoulidis and Wirth (2020) found a shift towards higher values for E and lower values for c p during persistent temperature extremes compared to short-lived extremes and that this difference is statistically significant. Since temperature extremes and RWP properties can be strongly related, forecast errors re- lated to RWP properties may have a large impact on the predictability of temperature extremes, especially when these errors are systematic. This effect may be even stronger for long-lived tem- perature extremes since c p d values related to long-lived temperature extremes are often very close to zero (Kysel ́ y, 2008) and one minor error may have a large impact on extreme event duration in the forecast. There have been plenty of studies on the predictability of temperature extremes as well as forecast errors related to RWPs separately. A lot less studies, if any, have investigated possible linkages between these two aspects. Given that the presence of RWPs is often associated with temperature extremes, it is also likely that forecast errors in the evolution of RWPs will have 4 an impact on temperature extreme predictability as well. Improving the forecast biases related to RWPs as presented by Gray et al. (2014) and Mart ́ ınez-Alvarado et al. (2016) may also increase the forecast skill related to temperature extremes. Furthermore, the aforementioned work related to RWP forecast biases focused on winter-time values on the N. Hemisphere only. Whether these forecast biases are also present during summer and on the S. Hemisphere is unclear. 1.3 Research objectives To fill up this research gap and shed light on the above considerations, this work aims to address the following objectives by comparing two reforecast models with reanalysis data: 1. Are there forecast biases regarding Rossby wave properties in summer and winter? If so, how do they vary with the season, the region and the forecast forecast time? 2. How well are temperature extremes predicted? Does the predictability of temperature change for different temperature extreme types and for different seasons? 3. Are good forecast of temperature extremes related to smaller forecast biases regarding RWP properties? And are good forecasts of temperature extremes more likely during specific conditions related to RWPs? Although the RWP forecast biases will be discussed at a global level, the bulk of this work is mainly focused on conditions over the N Atlantic - Europe region. This report is structured as follows. First, in section 2, the data and methods are presented to tackle the research questions. In section 3, the forecast biases regarding RWP properties in the reforecast models, the skill of the reforecast models to predict temperature extremes, and the relation between temperature extreme forecast quality and forecast biases related to RWPs are presented. Finally, in section 4, a summary, discussion and context of the results within existing literature is provided. 5 2 Data and Methods In this section, the tools used in this work are described. First, the data used in this work are presented. Secondly, a description is given of how RWP properties are computed. Lastly, the strategy to address the three main objectives is presented. 2.1 Data In this work, reforecast data from two NWP models are used. These are the Global Ensemble Forecast System Reforecast, Version 2 (GEFSRF) (Hamill et al. , 2013) and the ERA5 Reforecast (ERA5RF) (Hersbach et al. , 2020). GEFSRF is a product from the American National Centers for Environmental Prediction (NCEP), whereas ERA5RF is a product from the European Centre for Medium-Range Weather Forecasts (ECMWF). The reforecast models are compared to ERA5 renalysis data (Hersbach et al. , 2020) to assess their skill. For more details about the employed reforecast and reanalysis datasets, see Table 1. Table 1: Data specifications of the reforecast (GEFSRF and ERA5RF) and reanalysis (ERA5) datasets GEFSRF ERA5RF ERA5 Spatial Resolution 2 ° x2 ° 2 ° x2 ° 2 ° x2 ° Spatial Data Availability Whole Globe Northern Hemisphere Only Whole Globe Time Span Dec. 1984 - Nov. 2019 Jan. 1979 - Dec. 2019 Jan. 1979 - Dec. 2019 Temporal Resolution 24 hourly 12 hourly 6 hourly Forecast Times 0 - 252 h 0 - 240 h - Forecast Time resolution 6 hourly 6 hourly (0 - 120 h) 12 hourly (120 - 240 h) - Year of Development 2012 2016 2016 Although ERA5 and ERA5RF data are available from 1979 onward, only the dates for which GEFSRF is available (Dec. 1984 - Nov. 2019) are considered in order to facilitate a fair compari- son. Variables like temperature, geopotential height and zonal/meridional wind are readily available output from the models. However, RWP envelope ( E ) and RWP phase speed ( c p ) are diagnostic fields computed from the meridional wind field at 300 hPa (V300). The V300 field is well suited for RWP diagnosis, due to the clear perceptibility of northerlies and southerlies within the jet stream (Chang, 1993). As introduced earlier, a RWP can be assumed to be “enveloped” by an always positive function that defines the amplitude of an RWP at a given location (Fig. 1). In this work, E is computed by applying a Hilbert transform to latitude circles of the V300 field, as originally proposed by Zimin et al. (2003). In practice, this is done in spectral space by zeroing out the negative wavenumbers, doubling the positive ones and, finally, performing an inverse Fourier transform to the spatially- smoothed V300 anomaly field (Fragkoulidis and Wirth, 2020). This transformation yields an approximation of the local in space and time RWP amplitude by taking the modulus of the resulting analytic signal The zonal RWP phase speed ( c p ) denotes the speed of individual embedded troughs and ridges. The computation of local c p is based on the spatiotemporal evolution of the phase field, which 6 is again derived from the Hilbert transform as the argument of the analytic signal. Essentially, c p is defined as the ratio of the temporal phase derivative (i.e., angular frequency) to the zonal phase derivative (i.e., angular zonal wavenumber). This method yields meaningful results in areas of a clearly defined wave propagation; large deviations from the almost-plane wave paradigm may yield unphysical phase values. Therefore, values for c p are only defined in grid points where E exceeds 15 m/s. To further exclude any unrealistic values, cases where the c p magnitude exceeds 100 m/s are discarded. More details on the aforementioned steps can be found in Fragkoulidis and Wirth (2020). Given that the c p diagnosis requires a high temporal resolution (ideally higher than 6 hours), it is only performed up to forecast day 4 in the case of ERA5RF. 2.2 Forecast biases: reforecast vs. reanalysis To determine the forecast biases of the two models, reforecast data are compared to ERA5 re- analysis data, assuming that ERA5 represents the true state of the atmosphere. The forecasts are evaluated by comparing to the dates in ERA5 for which the forecast is valid. A hypothetical example for January 1st of a particular year can be seen in Fig. 3. Figure 3: Hypothetical example of a forecast evaluation on 1 January. The 24-hour forecast issued on 1 January is compared to the reanalysis data on 2 January, the 48hour forecast is compared to the reanalysis data on 3 January, and so on. This method allows calculating an error (forecast minus reanalysis) for each issued forecast at each valid time and every grid point. To calculate the bias in a given RWP property, the errors are averaged over all issued forecasts. Since the large-scale flow varies significantly with every season, the biases are computed separately for each season. 2.3 Temperature extreme forecast skill In this analysis, the predictive skill of the two reforecast models is investigated. To get a good overview for the entire European continent, eight regions over Europe are defined to do the analysis. These are shown in Fig. 4. Every region represents an area of 8 ° x 8 ° 7 Figure 4: All predefined European regions where the presence of temperature extremes are investi- gated. The coordinates of the regions are: FI (60-68 ° N, 22-30 ° E), NO (58-66 ° N, 6-14 ° E), RU (52-60 ° N, 56-54 ° E), UK (50-58 ° N, 0-8 ° W), DE (46-54 ° N, 6-14 ° E), UA (56-54 ° N, 26-34 ° E), ES (36-44 ° N, 0-8 ° W) and TR (34-42 ° N, 28-36 ° E) The extreme temperature definition is based on the 850 hPa temperature (T850). This isobaric level is chosen because near-surface temperatures are exposed to boundary layer processes and the effect of the upper-tropospheric large-scale environment can be overshadowed (Fragkoulidis et al. , 2018). On the other hand, focusing on the temperature at a level that is too high (e.g. 500 hPa) one could lose a significant connection to the actual conditions on the ground. Besides, the focus is on the type of extremes that affect a large area, rather than local episodes. An extreme day is defined as a day where T850 exceeds the 90th (10th) percentile daily climatol- ogy for warm (cold) temperature extremes. The 90th and 10th percentile daily T850 climatologies are computed using ERA5 data between December 1984 and November 2019. First, daily T850 averages are calculated for each grid point and subsequently, a latitude-weighted area average is applied (Eq. 2) to get one annual cycle for each box in Fig. 4. In Eq. 2, φ stands for the latitude of the grid cell, X stands for the variable that is averaged and N stands for the total number of grid cells. To compute a smooth climatology, a fast Fourier transform is applied to the daily climatologies and only the lowest four frequencies (1–4 year − 1 ) are kept. The latter technique is also used by Wulff and Domeisen (2019) and Fragkoulidis and Wirth (2020). The smoothed 90th (10th) T850 percentile climatologies for each European grid box constitute the threshold for the 8 definition of warm (cold) extreme days. ̄ X = ∑ N i =1 cos φ i ∗ X i ∑ N i =1 cos φ i (2) Since persistent and short-lived temperature extremes are often related to RWPs with different properties (Fragkoulidis and Wirth, 2020), they are hereby split into those two categories. Persis- tent extremes are defined as events where four or more consecutive are extreme in terms of T850. Short-lived extremes are defined as events where only one or two consecutive days are extreme. To make a clear distinction between persistent and short-lived extremes, extreme events with a length of exactly 3 consecutive days are discarded. This procedure is summarized in Table 2. An issue that could arise is that a persistent extreme event gets split up into multiple short-lived ex- tremes, because one or more non-extreme days that break the series of consecutive extreme days, though these events may be part of one persistent RWP. To account for this, all short-lived events occurring within 3 days of each other are discarded. Table 2: Definition of temperature extreme event types. 1 or 2 consecutive days 4 or more consecutive days T day > T day − clim − p 90 Short-lived warm extreme Persistent warm extreme T day < T day − clim − p 10 Short-lived cold extreme Persistent cold extreme The next step is to assess the skill of GEFSRF and ERA5RF to predict these extreme temper- ature events. First, daily averages for T850 are calculated for the forecast data for the forecasts that were issued 3 and 5 days before the extreme event onset. Because only data up to forecast day 10 are available, the last days of an extreme event that stretch over the 10-day forecast horizon are discarded. The quality of the forecast is quantified by the Gilbert skill score (GSS) (Wilks, 2011) using Eq. 3. In this equation hits stands for the number of extreme days that are forecasted correctly, misses stands for the extreme days that are missed by the forecast and f alse alarms stands for the days where the forecast raises an extreme day but it does not occur in the reanalysis. The hits c term accounts for the forecast hits that might arise due to chance, where total stands for the total number of forecast days which is 10 for every event. The GSS is suitable for extreme events because it does not consider non-extreme days that were correctly forecasted, which make up a large number of days with extreme events by definition, which can artificially improve a metric that takes well-forecasted non-event days into account. The GSS ranges from -1/3 to 1, where a score of 1 stands for a perfect forecast and 0 stands for a forecast with a skill that is equal to making predictions based on climatology only. GSS = hits − hits c hits + misses + f alse alarms − hits c , hits c = ( hits + misses )( hits + f alse alarms ) total (3) The GSS is calculated for every individual event and every grid box in Fig. 4. The next step is to classify whether a forecast is “good” or “bad”. Two methods are used to assess the forecast quality. 9 • First, a method based on the error in forecast intensity is introduced. This is done by com- puting the mean absolute error (MAE) of the T850 values in the forecast and the reanalysis and averaging over all daily MAE values to gain one mean daily MAE. Unlike the GSS, the mean daily MAE is affected by days where both the forecast and reanalysis are not extreme. To account for this, the MAE is only calculated on days where T850 in the reanalysis was extreme and belonged to the same event. The closer the mean daily MAE of the forecast is to 0, the better the intensity of the event is forecasted. To quantify the forecasts for every region, first, the events are grouped into their own season by event onset date and secondly, the forecast is split up into the half with the lowest mean daily MAE values (good) and a half with the highest mean daily MAE values (bad). If the number of events in each season consists of an odd number, the median event is left out. • Secondly, a method based on the error in event duration is introduced. This is done by calculating the event duration for each event in the reanalysis data, and subsequently cal- culating the longest consecutive streak of extreme days in the forecast, where it is assumed that this represents the length of the extreme event in the forecast. However, forecast data is only available up to forecast day 10, and this limits the ability to track overestimated event durations in the forecast. Therefore, the analysis is limited to find underestimations of 2 days or more only. Subsequently, this analysis can only be done for persistent extremes, because short-lived extremes are only 1 or 2 days by definition. Next, the forecast is labelled as good when the event duration in the forecast reproduces the exact same event length or it is off by only 1 day (either too short or too long). The forecast is labelled as bad when the event duration in the forecast is underestimated by 2 days or more. In contrast to the MAE method, the number of forecasts labelled as good or bad is not equal. In Fig. 5 the mean daily MAE and GSS are calculated for an imaginary event. Using the methods described above, the GSS is 0.25 and the mean daily MAE is 1.5 ° C (note, that only the days that are extreme in the reanalysis are used to calculate the MAE). Although the forecast is too early in producing the event, the event duration is the same as in the reanalysis, and therefore the forecast would be labelled ’good’ using the duration method. 10 Figure 5: An imaginary forecast and reanalysis time series for T850 ( ° C) where the extreme threshold is set at 12 ° C and a day is defined extreme when it is equal to or exceeds this threshold. The third row shows where the threshold is exceeded in the reanalysis.’Y’ means that the day is extreme and a ’N’ means the opposite. The GSS is calculated using the entire time series and the mean daily MAE is calculated by using the days where only the reanalysis is equal to or exceeds the extreme threshold (values in red). The fourth row shows the the amounts of hits (H), misses (M) and false alarms (F) of the forecast. 11 3 Results 3.1 Forecast biases in Rossby wave packet properties In this section, the biases in GEFSRF and ERA5RF reforecast data with respect to the ERA5 reanalysis for several RWP properties in the N. and S. Hemisphere are presented. First, the forecast biases in GEFSRF will be presented, and after that, the focus will be on forecast biases in ERA5RF. The section concludes with a comparison between the two reforecast datasets for an area over the N. Atlantic Ocean. 3.1.1 GEFSRF forecast biases in the N. Hemisphere In Fig. 6, the forecast biases in GEFSRF for the months December, January and February (DJF) in the time period from December 1984 till November 2019 over the N. Hemisphere. The first column of panels corresponds to the ERA5 reanalysis means of the respective fields. For the sake of brevity, the biases at forecast days 2, 4 and 8 are shown only. Evidently, the gradual growth in the biases means that focusing on these forecast days neither obscures information nor undermines the drawn conclusions. 12 Figure 6: ERA5 reanalysis and GEFSRF forecast biases in DJF for RWP amplitude ( E ) (a–d), RWP phase speed ( c p ) (e–h) and zonal wind speed ( u ) (i–l) on the N. Hemisphere. Panels a, e and i show the seasonal means in ERA5 reanalysis for the aforementioned variables in the time period December 1984 – November 2019. Forecast biases are shown for the day 2 (b,f,j), day 4 (c,g,k) and day 8 (d,h,l) forecasts in GEFSRF In Figs. 6 a, e, i, the mean climatologies of E , c p and u , respectively, are computed by averaging the ERA5 reanalysis data in DJF during the given time period. On average, the largest E values are found in the midlatitudes over the eastern part of N. Pacific and the N. Atlantic, that is, over the so-called storm track regions, as indicated by the mean E values over these areas (Fig. 6a). Smaller E values in the mid-latitudes are found over continental areas in general, especially over E. Asia. Also, smaller E values are found in the polar regions and over the (sub-)tropics, because RWPs propagate over these areas less often (Fragkoulidis and Wirth, 2020). In general, GEFSRF has difficulty in predicting the E values as can be observed in Figs. 6 b–d. Overall, GEFSRF shows a negative bias in E and this bias increases in magnitude over forecast time, meaning that GEFSRF gets less wavy compared to the reanalysis over forecast time. This is in agreement with findings by Gray et al. (2014) and Mart ́ ınez-Alvarado et al. (2016). However, some areas near the 60 ° N parallel exhibit far weaker biases over forecast time and some areas even 13 exhibit positive biases, meaning that the forecast gets wavier compared to the reanalysis. Notable regions in this regard are E. Canada, W. Russia and the W. Pacific. Figure 6e shows c p maxima over the Pacific, E. Asia and the N. Atlantic, meaning that the individual ridges and troughs embedded in RWPs move at relatively larger speeds in these areas compared to other areas. Lower values in the midlatitudes are found over the Rocky Mountains and Europe/W. Asia. Since the eddy-driven jet acts as a waveguide for RWPs (Hoskins and Ambrizzi, 1993), c p values are highest in the midlatitudes. As soon as RWPs exit the jet, they slow down and tend to become quasi-stationary (Chang and Yu, 1999). This is reflected by the lower c p values in the areas downstream, to the north and to the south of the jet exit regions. In Figs. 6 f–h large areas with positive biases can be observed above N. America and Europe, whereas negative biases mainly occur over E. Asia, the W. Pacific and the N. Atlantic. Unlike E , the c p bias pattern observed on the N. Hemisphere is less uniform. However, the biases that can be observed do grow in magnitude over forecast time, where positive biases become stronger over Europe, N. America, and some regions in Asia. Moreover, negative biases grow in magnitude over the N. Atlantic, S. Asia and most of the Pacific. In Fig. 6i the u climatology is shown. The patterns that emerge are in close agreement with the average position of the jet streams (Koch et al. , 2006). The highest u values generally occur in a band ranging from N. Africa up to the Eastern N. Pacific, which constitutes the subtropical jet, and another patch over the eastern part of N. America and the western part of the N. Atlantic, which forms the eddy-driven jet. Again, there is no clear pattern emerging in u biases, with some areas showing positive biases growing over forecast time, while other areas show growing negative biases over forecast time (Figs. 6 j–l). However, in some areas, a tripole pattern emerges where alternating positive and negative biases characterize an area. One such pattern can be observed over the N. Atlantic where there is an overestimation of u with an underestimation to the north and south of it. A comparable pattern can be seen over the Pacific. This could imply that the jet stream in the forecast is too zonal, which translates to a higher u (positive biases) at the centre of the jet. This would also mean that the jet does not penetrate into the areas to the south and north as much, causing a negative zonal wind speed bias. This would indicate a jet that is too “straight”, and would also agree with the findings that E is generally underestimated (Figs. 6b–d). In Fig. 7 the same seasonal analysis was done for GEFSRF as in Fig. 6, but now for the N. Hemisphere summer months June, July and August (JJA). The overall jet stream is much weaker due to the smaller equator-to-pole temperature gradient in summer. These changes are also reflected in the mean climatologies for E , c p and u , which are much lower in JJA. The areas with the largest E values are still the Atlantic and E. Pacific sectors (Fig. 7a). Although RWPs in these areas approximately exhibit the same amplitude in the two areas, these areas have a distinctly different c p climatology, with larger values over the Atlantic Ocean (Fig. 7e). This means there is a higher frequency of stationary troughs and ridges over the E. Pacific. In general, u is much lower compared to DJF, but still, two maxima are visible over the E. Pacific and the Atlantic (Fig. 7i). 14 Figure 7: Same as Fig. 6, but now for JJA In JJA again a general decrease in E over forecast time can be observed, indicating that this decrease is independent of the time of year in GEFSRF. However, there are again some exceptions to this pattern. Poleward of 60 ° N the bias in E is close to zero. Another noteworthy patch of near-zero E bias is present over the W. Pacific. Compared to Figs. 6 b–d, the overall E biases appear to be larger in magnitude though. Biases in c p show a slight general decrease over forecast time on the N. Hemisphere except for N. Europe and the Arctic Ocean, where c p is overestimated. The observed pattern appears to be slightly less heterogeneous compared to Figs. 6 f–h. Very strong signals can be observed near the Equator, but these should be interpreted with caution since RWPs over these latitudes are fairly rare in JJA. Although u biases are in general weaker in summer, the biases that are observed in Fig. 7 appear to be similar to u biases in DJF (Figs. 6 j–l). 3.1.2 GEFSRF forecast biases in the S. Hemisphere In contrast to the N. Hemisphere, the S. Hemisphere exhibits a more uniform zonally symmetric pattern regarding RWP properties (Fig. 8) in the S. Hemisphere winter (JJA). Climatology for 15 E (Fig. 8a) shows the largest E values in an area roughly between 40 ° S and 60 ° S on the E. Hemisphere. However, on the W. Hemisphere, a double E maximum emerges, where one maximum can be related to the eddy-driven jet stream and the other to the subtropical jet stream (Nakamura and Shimpo, 2004). A big contrast to the N. hemispher winter for E biases can also be observed from Figs. 8 b–d. Over the Antarctic region, a positive bias can be observed which is already quite substantial at forecast day 2 and sl