Evaluating the RELM Test Results

We consider implications of the Regional Earthquake Likelihood Models (RELM) test results with regard to earthquake forecasting. Prospective forecasts were solicited for M ≥ 4.95 earthquakes in California during the period 2006–2010. During this period 31 earthquakes occurred in the test region with M ≥ 4.95. We consider five forecasts that were submitted for the test. We compare the forecasts utilizing forecast verification methodology developed in the atmospheric sciences, specifically for tornadoes. We utilize a “skill score” based on the forecast scores λ f i of occurrence of the test earthquakes. A perfect forecast would have λ f i = 1, and a random (no skill) forecast would have λ f i = 2.86× 10−3. The best forecasts (largest value of λ f i) for the 31 earthquakes had values of λ f i = 1.24 × 10−1 to λ f i = 5.49 × 10−3. The best mean forecast for all earthquakes was λ f = 2.84 × 10−2. The best forecasts are about an order of magnitude better than random forecasts. We discuss the earthquakes, the forecasts, and alternative methods of evaluation of the performance of RELM forecasts. We also discuss the relative merits of alarm-based versus probability-based forecasts.


Introduction
Earthquakes do not occur randomly in space.Large earthquakes occur preferentially in regions where small earthquakes occur.Earthquakes are complex phenomena, but they do obey several scaling laws.One example is Gutenberg-Richter frequency-magnitude scaling.The cumulative number of earthquakes N with magnitudes greater than M in a region over a specified period of time is well approximated by the relation where b is a near universal constant in the range 0.8 < b < 1.1 and a is a measure of the level of seismicity.Small earthquakes can be used to determine a and (1) can be used to determine the probability of occurrence of large earthquakes.Kossobokov et al. [1] utilized the number of M ≥ 4 earthquakes in 1 • × 1 • areas to map the global seismic hazard.
A question that has been studied by many groups is whether there are temporal variations in seismicity that can be used to forecast the occurrence of future earthquakes.Earthquakes on major faults (say the San Andreas in California) occur quasiperiodically.A reasonable hypothesis would be that the rate of regional seismicity would accelerate during the period between the major earthquakes.There is no evidence that this occurs systematically.Background seismicity in California appears to be stationary.With the exception of years with large aftershock sequences, Rundle et al. [2] (Figure 1) showed that seismic activity in Southern California in the magnitude range 1.5 < m < 4 for the period 1983 to 2000 was well represented on a yearly basis by (1) taking a = 5.4 and b = 1.0.
Intermediate-term earthquake forecasting algorithms based on pattern recognition of variations in regional seismicity were developed by Keilis-Borok and colleagues [3].These forecasts were alarm based, when a threshold of anomalous behavior was reached a warning of a time of Figure 1: Map of the test region, the coast of California, major faults, and the 31 earthquakes with M ≥ 4.95 that occurred in the test region.The earthquakes are given in Table 1.
increasing probability (TIP) of an earthquake was issued.A relatively high success rate was found including the 1988 Armenian earthquake and the 1989 Loma Prieta earthquake [4], but there were also notable false alarms and failures to predict.
The focus of this paper is to study the implications of the RELM test of earthquake forecasts in California.This was a prospective test of forecasts for m > 5 earthquakes during the period 2006-2010.Forecast submission was required prior to the starting date.In our study of the RELM test results we will utilize the methodology developed in the atmospheric sciences [5], specifically for tornadoes.Tornado forecasts are alarm based.Two levels of alarms are issued: (1) a tornado watch is issued for a specified area and time if atmospheric conditions appear conducive to tornados, (2) a tornado warning is issued if one or more tornados have been observed.The evaluation of tornado forecasts is based on the number of failures to predict and on the number of false alarms.A quantitative measure of success is the skill score, the skill score is unity for a perfect forecast and zero for a random (no skill) forecast.RELM forecasts were probabilistic rather than alarm based, that is a continuous range forecast probabilities were required.In an alarm-based forecast an area of high risk is specified.We will discuss the implications of the two alternative approaches.
The forecasts submitted to the RELM test were primarily based on precursory seismic activity.There are a variety of approaches to the quantification of this activity.In Section 2 of this paper we will discuss the relative intensity (RI) and pattern informatics (PI) approaches.The RI approach extrapolates the occurrence of small earthquakes during a specified precursory time window.High activity (activation) indicates high risk.The PI approach is related but includes both activation and quiescence.In Section 3 the problems with retrospective forecasts are discussed.In Section 4 the RELM test is discussed and the test earthquakes are described in Section 5.The submitted forecasts are discussed in Section 6 and are evaluated in Section 7.
An objective of this paper is to understand the relationship of the forecasts to the distribution of seismicity during the test period.We discuss what we believe is a well-defined precursory activation.

PI and RI
A pattern informatics (PI) approach to earthquake forecasting was proposed by Rundle et al. [2,6] and Tiampo et al. [7].In forecasting M ≥ 5 earthquakes a region is divided into a grid of 0.1 • × 0.1 • regions.The rates of seismicity in the regions are studied to quantify anomalous behavior.Precursory changes that include either increases or decreases in seismicity are identified during a prescribed time interval.If changes exceed a prescribed threshold hot-spots are defined.The forecast is that future M ≥ 5 earthquakes will occur in the hot spot regions in a 10-year time window.Thus, the PI method is alarm based.Utilizing the PI method Rundle et al. [8] made a forecast of California hot spots valid for the period 2000-2010.Holliday et al. [9] reported that 16 of the 18 earthquakes that occurred during the period 2000-2005 occurred in hot spot regions.The PI forecast is time dependent because it is based on temporal changes in background seismicity.
A closely related forecasting technique is the relative intensity (RI) approach.The RI forecast is based on the direct extrapolation of the rate of occurrence of small earthquakes using (1).The RI forecast can be time dependent if the time span of the background seismicity is relatively short.The success of the PI method described above led to a discussion as to whether the PI method is significantly better than the RI method.Comparisons of these approaches have come to different conclusions regarding their validity [10,11].These comparisons emphasize the difficulties in evaluating the performance of seismicity forecasts.

Prospective versus Retrospective Forecasts
A prospective forecast is a true forecast of future earthquakes.No knowledge of these earthquakes exists.A retrospective forecast is a forecast of earthquakes that have occurred in the past (say 2000-2010) based on data available before the start of the period.The existence of the forecast earthquake is known.In principal a retrospective forecast can be carried out fairly; however, in many cases these forecasts are biased by the existence of the forecast earthquakes.
The PI forecast by Rundle et al. [8] was prospective.However, the successful forecast of 16 out of 18 earthquakes in California led to a retrospective challenge of the results [11].
It became clear that it would be desirable to sponsor a contest in which research groups would provide prospective forecasts of earthquakes under well-defined conditions.This was the origin of the RELM test, which will be described in the next section.Some of the rules were based on the prospective forecast made by Rundle et al. [8].The test region was California.Forecasts were made for M > 5 earthquakes on a grid of 0.1 • × 0.1 • forecast cells.The forecast period was 1 January 2006 to 31 December 2010.The results will also be summarized in this paper.

RELM Test
In order to test methods for forecasting future earthquakes the Southern California Earthquake Center (SCEC) formed the working group for Regional Earthquake Likelihood Models (RELM) in 2000 [12].For the first time a competitive test of prospective earthquake forecasts was to be carried out.Research groups were encouraged to submit forecasts of future earthquakes in California.At the end of the test period, the forecasts would be compared with the actual earthquakes that occurred.
The ground rules for the RELM test were as follows.
(1) The test region to be studied was the state of California; however the selected region extended somewhat beyond the boundaries of the state as shown in Figure 1.
(2) The objective was to forecast the largest earthquakes for which a reasonable number could be expected to occur in a reasonable time period.A five-year time period for the test was selected extending from 1 January 2006 to 31 December 2010.Earthquakes with M ≥ 5 were to be forecast.This magnitude cutoff was chosen because at least 20 M ≥ 5 earthquakes could be expected in this period.For M ≥ 6, only about 2 would be expected so the 5-year period would be much too short.The applicable magnitudes were taken from the Advanced National Seismic System (ANSS) online catalog (http://www.ncedc.org/anss/anss-detail.html).
(3) Participants were required to submit the number of earthquakes expected to occur in specified spatial cells and magnitude bins during the test period.In order to do this, the test region was subdivided into N c = 7682 spatial cells with dimensions 0.1 • × 0.1 • (approximately 10 km × 10 km).These spatial cells were further divided into 41 magnitude bins: 4.95 ≤ M < 5.05, 5.05 ≤ M < 5.15, 5.15 ≤ M < 5.25,. .., 8.85 ≤ M < 8.95, and 8.95 ≤ M < ∞.The participants were required to specify the forecast number of earthquakes N f mi in magnitude bin m (m − 0.05 < M < m + 0.05) that would occur during the test period in cell i.
It is important to note that the RELM forecasts were continuous (probabilistic) rather than alarm based.The numbers of earthquakes expected to occur in each spatial cell and each magnitude bin was required.Continuous and alarm-based forecasts each have advantages and disadvantages.Continuous forecasts are useful for setting insurance premiums but the numbers of predicted earthquakes are so small that they have little meaning to the general public.
Alarm-based forecasts specify where earthquakes are most likely to occur.
Nineteen forecasts were submitted by eight groups.Before discussing these forecasts in some detail we will discuss the earthquakes that occurred in the test region during the test period with M ≥ 4.95.

The Earthquakes
During the test period 1 January 2006 to 31 December 2010, there were N e = 31 earthquakes in the test region with M ≥ 4.95.The times of occurrence, locations, and magnitudes of these earthquakes are given in Table 1.The locations of the test earthquakes are also shown in Figure 1.
The 31 earthquakes occurred in N ce = 22 cells.The association of earthquakes with cells is given in Table 2. Five of the 22 cells had multiple earthquakes.The occurrence of five test earthquakes in cell A is not surprising since this is in the Cerro Prieto geothermal area that is recognized as having a high level of seismicity.Earthquakes occurred in 22 of the 7682 0.1 • × 0.1 • test cells in the test area.
The major earthquake that occurred during the test period was the M = 7.2 El Mayor-Cucapah earthquake on 4 April 2010 (event 22 in Table 1).This earthquake was on the plate boundary between the North American and Pacific plates.The epicenter was about 50 km south of the Mexico-United States border, but occurred within the test region as shown in Figure 1.Events 23, 24, 25, 26, 27, 28, 29, and 31 are well-defined aftershocks of the El Mayor-Cucapah earthquake.Events 1, 7, 8, 9, 10, 14, 16, and 19 constitute a precursory swarm of eight test earthquakes in this region in the magnitude range 4.97 to 5.80, including four in the 10-day period between 9 February and 19 February 2008 (events 7-10).These events were located some 5 km to 20 km north of the subsequent epicenter of the El Mayor-Cucapah earthquake and lie outside the primary aftershock region of that event.This swarm of earthquakes certainly cannot be considered foreshocks due to their relatively small magnitudes and early occurrence but may represent a seismic activation.We will discuss this activation in terms of AMR later in this paper.
Another swarm of earthquakes occurred in the northwest corner of the test region adjacent to Cape Mendocino.This sequence (events 23, 4, 5, 20, and 21) had magnitudes in the range 5.0 to 6.5.This is a region of high seismicity, and this concentration of events is expected.Event 21 may or may not be an aftershock of event 20.The pair of earthquakes 17 and 18 are interesting.It is very likely that the M = 5.0 earthquake on 1 October 2009 was a foreshock of the M = 5.19 earthquake on 3 October 2009.

Submitted Forecasts
The submitted forecasts have been discussed in some detail [13].The nineteen forecasts submitted by eight groups are available on the RELM website (http://relm.cseptesting.org/).In order to have a common basis for comparison, we will only consider forecasts that cover the entire test region.Seven forecasts were submitted that gave the predicted number, N f mi , for M ≥ 4.95 earthquakes in 0.1 magnitude bins during the five-year test period for all N c = 7682 0.1 The submitted forecasts are based on a variety of approaches.The Bird and Liu forecast [14] was based on a kinematic model of neotectonics.The Ebel et al. forecast [15] was based on the average rate of M ≥ 5 earthquakes in 3 • × 3 • cells for the period 1932 to 2004.The Helmstetter et al. forecast [16] was based on the extrapolation of past seismicity.The Holliday et al. forecast [17] was based on the extrapolation of past seismicity using a modification of the pattern informatics (PI) technique.The Wiemer and Schorlemmer forecast [18] was based on the asperity-based likelihood model (ALM).
We will now discuss the Holliday et al. forecast in somewhat greater detail.The basis of this RELM forecast followed the format introduced in the PI forecast methodology [7,8].The magnitude range M ≥ 5 and the cell dimensions 0.1 • × 0.1 • were the same.However, the PI method was alarm based.Earthquakes were forecast to either occur or not occur in specified regions (hotspots) in a specified time period.In the PI-based RELM forecast, all hotspot cells are given equal probabilities of an earthquake.For the values in Table 2, λ f i = 3.32 × 10 −2 .Instead of being alarm based, the RELM test was based on probabilities of occurrence of an earthquake in each cell in the test region.This required a continuous assessment of risk rather than a binary, alarmbased assessment.To do this, the Holliday et al. [17] forecast introduced a uniform probability of occurrence for hotspot regions and added smaller probabilities for nonhotspot regions based on the relative intensity (RI) of seismicity in the region.A map of the Holliday et al. [17] forecast is given in Figure 2.
As stated in our description of the RELM test, each participant submitted a forecast for the number of earthquakes N f mi in magnitude bin m that would occur in cell i.Thus 41 × 7682 = 314962 values of N f mi were submitted in each forecast.In order to better understand the implications of the forecasts we sum the probabilities in the magnitude bins for each spatial cell to give the number of forecast earthquakes N f i in cell i with magnitude M ≥ 4.95: The reason we carry out this sum is so that we can directly apply the "skill score" methodology developed in the atmospheric sciences.In terms of forecasting tornadoes, the question is whether a tornado occurs, not its strength.Since the RELM test was for earthquakes with M ≥ 4.95 our scoring is whether such an earthquake occurs or does not occur in a spacial cell.The sum of the N f i over all cells is the total number of earthquakes N f with M ≥ 4.95 forecast to occur during the test period: where N c is the total number of cells.Our objective is to separate the forecast of the total number of earthquakes from the forecast of their locations.In order to do this we introduce a cell score λ f i defined by where N ce is the number of cells in which an earthquake occurred during the test period.Note that from (3) and ( 4) we have Thus, the sum of λ f i over all cells is the same for each submitted forecast.The cell score λ f i is a direct measure of the probability of occurrence of a test earthquake in cell i.A perfect forecast (a perfect skill score) would have λ f i ≥ 1 for the cells in which earthquakes occur and λ f i = 0 for all other cells.In principal λ f i can be as big as N ce .However, because we are only concerned with whether an earthquake occurs in a cell, not how many occur-a point we discuss in the next paragraph-all values of λ f i > 1 are just treated as 1 for that particular cell.In practice this does not occur due to the small values of N f mi provided by the RELM forecasts.Since the forecasts are for specific 0.1 • × 0.1 • cells, it is necessary to consider how to handle the forecasts when more than one earthquake occurs in a cell.As stated above, in our analysis a cell in which more than one earthquake occurred is treated the same as a cell in which only one earthquake occurred.This follows the practice used in tornado forecasting.How many tornadoes occur in a region during the forecast period is not considered, only whether one or more occur.For the test earthquakes given in Table 1, events 1, 7, 8, 16, and 24 occurred in the same cell, similarly International Journal of Geophysics   for events 9 and 10, events 17 and 18, events 22, 25, and 28, and events 23 and 26.This multiplicity is shown in Table 2. Thus, we will consider forecasts made for 22 cells.

Test region Faults
Taking the actual number of cells in which earthquakes occurred to be N ce = 22 and the total number of earthquakes forecast in each submission N f using (3), we obtained the forecast scores λ f i using (4).
The seven submitted forecasts included two submissions with separate forecasts with and without aftershocks.Different numbers of events were forecast but the relative scores of locations were the same.Thus, we consider five submissions.The forecast scores λ f i for each of the five submissions are given in Table 2 for the N ce = 22 cells in which an earthquake occurred.A perfect forecast in which only the 22 cells were forecast to have earthquakes would have λ f i = 1 in each of the 22 cells.A random forecast in which all N c = 7682 cells were given the same N f i = a would yield The submitted forecast scores in Table 2 have a wide range of values from λ f i = 1.58 × 10 −7 to λ f i = 1.24 × 10 −1 .

Evaluation of Results
During the formulation of the RELM project a comprehensive testing strategy was also developed [19].A suite of likelihood tests were proposed, which would be implemented through a testing center [20].The approach utilized an Ltest, an N-test, and an R-test.These tests were applied to the raw submitted data.This approach was applied to the first 2.5 years of RELM results by Schorlemmer et al. [13].Zechar et al. [21] recognized a problem with the original proposed likelihood tests and proposed a modification.This is certainly one approach to the evaluation of results, the primary purpose of this paper is to present a complementary approach.Our approach has the advantage that the evaluation of the numbers of earthquakes forecast can be separated from the forecast of their locations.
Lee et al. [22] proposed the modified approach to the evaluation of RELM test results that is used in this paper.In their short paper they compared the forecasts that had been submitted for all of California.In this paper we consider a subset of those forecasts and relate the results to the concept of alarm-based forecasts.
The results given in Table 2 can be used to compare the forecast scores for each of the cells in which earthquakes occurred.The highest scores between the models are shown in bold.Clearly there are many ways in which to evaluate the results of the forecasts.There is a tradeoff between good forecasts with large λ f i and poor forecasts with small λ f i .We first consider the forecasts that had the highest forecast scores.The Holliday et al. [17] forecast had the largest λ f i for 8 of the 22 cells in which (target) earthquakes occurred.The Wiemer and Schorlemmer [18] forecast had 6 of the largest λ f i .The Helmstetter et al. [16] forecast had 4 of the largest λ f i .Finally, the Bird and Liu [14] forecast had 3 of the largest λ f i .These values are also given in Table 3.The range of the highest cell scores was from λ f i = 1.24 × 10 −1 for event 1 to λ f i = 5.49 × 10 −3 for event 11.
It is also of interest to compare the mean cell forecast scores for the 22 cells in which earthquakes occurred.These values λ f are given in Table 3.The Helmstetter et al. [16] forecast had the highest λ f = 2.84 × 10 −2 , the Wiemer and Schorlemmer [18] forecast had λ f = 2.66 × 10 −2 , and the Holliday et al. [17] forecast had λ f = 2.45 × 10 −2 .The Helmstetter et al. [16] forecast did the best in an average sense but did relatively poorly in providing the best cell forecasts.It should be noted that the best average forecast λ f = 2.84 × 10 −2 is one order of magnitude better than the random (no skill) forecast λ random As noted above, the Holliday et al. [17] forecast is primarily an alarm-based (hotspot) forecast.The PI method was used to determine the cells in which earthquakes were most likely to occur (hotspots).In the cell forecasts given in Table 2, these cells had forecast scores λ f i = 3.32 × 10 −2 and consisted of 8.3% of the total area of the test region (637 of the 7682 cells).Of the 22 cells in which earthquakes occurred, 17 occurred in hotspot cells.In 8 of the 17 cells, the forecast cell scores given by the Holliday et al. [17] forecast were the highest.

Discussion
The RELM test provides a well-defined set of prospective earthquake forecasts and a well-defined set of test earthquakes.In this paper we present a method for evaluating the RELM forecasts.We believe our approach has significant advantages but look forward to comparing our results with those obtained by other authors.
RELM forecasts provide the numbers N f mi of earthquakes expected to occur in magnitude bins m and spatial cells i.The basis of our approach is (1) to use (2) to determine the forecast number N f i of earthquakes with M ≥ 4.95 expected to occur in spatial cell i, (2) to use (3) to determine the total forecast number N f of earthquakes, (3) to use (4) to determine the cell score λ f i .
We first compared the actual number of earthquakes that occurred during the test period, 31 with the forecast values.The closest forecast values were those of Holliday et al. [17] with N f = 30 as shown in Table 3.
We next compared the forecast scores λ f i of an earthquake with M ≥ 4.95 occurring in cell i.We noted that the values of λ f i were the same for the two submissions in which both main shocks and aftershocks plus main shocks were submitted.These forecasts gave different values for the numbers N f mi , N f i , and N f of earthquakes but the forecast distributions in space were identical.
In a perfect forecast the forecast score would have been λ f i = 1 for each of the 22 cells in which one or more earthquakes occurred and λ f i = 0 in the other 7660 cells.The mean forecast scores for the 22 cells in which earthquakes occurred for the five forecasts ranged from a high value λ f = 2.84 × 10 −2 to a low value of λ f = 1.53 × 10 −2 .The range of values was relatively small, about a factor of two.The random (no skill) forecast assuming equal probabilities for the 7682 cells in the test region gives a forecast score λ random f i = 2.86 × 10 −3 for all cells.The best forecast score λ f i = 2.84 × 10 −2 was about a factor of 10 better than the random forecast but a factor of 100 worse than a perfect forecast.
As we have previously discussed earthquake forecasts can be either probabilistic or alarm based.The submission rules for RELM were probabilistic.The only forecast that had an alarm-based distribution of forecasts was that of Holliday et al. [17].A question of interest for future tests of earthquake forecasts is whether they should be alarm or probability based.A systematic study of alarm-based forecasts could be of considerable interest.
Another interesting question is whether the forecasts have a temporal component.Is there a time-dependent component in the data used that changes forecast probabilities significantly?As discussed previously, eight of the test earthquakes were aftershocks of the El Mayor-Cucapah earthquake and eight of the test earthquakes were associated with a precursory swarm.Thus, 17 of the 31 of the test earthquakes were associated with this earthquake.It appears reasonable to conclude that precursory activation prior to the El Mayor-Cucapah earthquake may have played a significant role in the success of forecasts.
Another swarm of 6 earthquakes during the test period adjacent to Cape Mendocino did not lead to a subsequent larger event during the test period.Swarms of activity in this region occur regularly.In terms of precursory activation this activity would lead to a false alarm.The contrast between the two regions (Cape Mendocino and El Mayor-Cucapah) is an indication of the difficulties in forecasting earthquakes utilizing precursory activation.

Figure 2 :
Figure 2: Map of the normalized probabilities λ f i given for the testregion by Holliday et al. [17] using their PI-based forecast.The "hotspots" are shown in red.The test earthquakes are also shown.
Bin magnitude m − 0.05 ≤ M ≤ m + 0.05 N e : Number of actual earthquakes N f : Number of forecast earthquakes N c : Numberofcells N ce : Number of cells with earthquakes N f i : Number of forecast earthquakes in cell i N f mi : Number of forecast earthquakes in magnitude bin m and cell i λ f i : Forecast score, related to the probability that an earthquake with M ≥ 4.95 will occur in cell i λ f : Mean forecast score for the 22 cells in which earthquakes occurred λ random f i : A random (no skill) forecast λ random f i = 2.86 × 10 −3 N λ max : The number of maximum cell scores.

Table 1 :
Times of occurrence, locations, and magnitudes of the 31 earthquakes in the test region with M ≥ 4.95 from 1 January 2006 until 31 December 2010.The M = 7.2 El Mayor-Cucapah earthquake is in bold.

Table 2 :
Cell scores λ f i of an earthquake with M ≥ 4.95 for the 22 cells in which earthquakes occurred during the test period.The association of cell IDs (A-V) with the earthquake IDsfrom Table 1 is given.Five submitted forecasts are given: (1) Bird and Liu (B and L), (2) Ebel et al. (Ebel), (3) Helmstetter et al. (Helm.),(4) Holliday et al. (Holl.), and (5) Wiemer and Schorlemmer (W and S).The highest (best) scores are in bold.

Table 3 :
Comparisons of the forecasts: Column 1. the number of maximum cell scores N λ max .Column 2: the mean cell scores forecast λ f .Column 3: the number of earthquakes N f predicted by each forecast.The best scores in each category are in bold.