-
PDF
- Split View
-
Views
-
Cite
Cite
Lorna McKellar, Abigail McQuatters-Gollop, Tomas Chaigneau, Cristina Vina-Herbon, Sebastian Valanko, Jan Geert Hiddink, Evaluating methods for setting thresholds for good status in marine ecosystems, ICES Journal of Marine Science, Volume 82, Issue 3, March 2025, fsaf019, https://doi.org/10.1093/icesjms/fsaf019
- Share Icon Share
Abstract
Estimating thresholds to distinguish between good and degraded ecosystem states is key for assessing and managing marine environments. Numerous methods are used to estimate thresholds; however, there is no standardized framework to evaluate their accuracy and reliability, which reduces the consistency and transparency of thresholds estimated for ‘good’ status. Statistical robustness of four methods was evaluated by varying stochastic noise, sample size, and shape of the pressure-state relationship of simulated indicator data. Range of natural variation and statistically detectable change methods, which quantify natural variation in undisturbed reference conditions, reliably estimated status thresholds for noisy, small datasets, but thresholds were lower than what would have been estimated without noise present or with a greater sample size. Tipping points and distance to degradation methods, which estimate the point at which a system is about to reach, or has reached, a degraded state, failed to estimate thresholds or fit models that were consistent with the underlying relationship as noise increased and sample size decreased. Therefore, for small or noisy datasets, range of natural variation is most suitable to estimate ecologically meaningful, reliable, and transparent thresholds for good status, while for larger datasets with low noise levels, all four methods are likely to be useful.
Introduction
Policy objectives to improve the condition of our marine ecosystems are increasing in number and ambition. The EU’s Marine Strategy Framework Directive (MSFD) (European Commission 2020) and the UK’s Marine Strategy (UKMS) (DEFRA 2019) state that Good Environmental Status must be achieved by 2030. Under the EU’s Nature Restoration Law, at least 30% of the extent of habitats must be restored from poor to good condition by 2030, increasing to 90% by 2050 (European Commission 2022). Target 2 of the UN Kunming-Montreal Global Biodiversity Framework (KM-GBF) states that 30% of all degraded ecosystems must be restored by 2030, such that biodiversity, functioning, and services are all improved (UN CBD 2023), and the UNCLOS Agreement on Marine Biodiversity of Areas beyond National Jurisdiction (BBNJ) calls for the establishment of threshold setting processes to assess ‘significant and harmful’ environmental impacts (United Nations 2023). Together, such policies aim to restore and conserve nature while promoting sustainable use and integrating environmental protection with human activities. To inform managers on progress towards reaching these overarching goals, science provides informative ecosystem assessments that evaluate changes in overall status by using monitoring data from environmental indicators known to be sensitive to anthropogenic pressures and assessing against predefined categories of condition (i.e. good, degraded) (Rombouts et al. 2013, McQuatters-Gollop et al. 2022). ‘Thresholds’ (Huggett 2005), ‘targets’ (Rossberg et al. 2017), ‘boundaries’ (DEFRA 2014), ‘limits’ (Borja et al. 2013), and ‘triggers’ (Hilton et al. 2022) are all terms which refer to values that distinguish between such categories of ecosystem condition (Smit et al. 2021). The term ‘threshold’ is used within the MSFD, UKMS, BBNJ agreement, and academic literature (Huggett 2005, Hitchin et al. 2023) to define values that distinguish between ecosystem states and, as such, will be used here to ensure relevancy and consistency.
Good and degraded states can be defined differently depending on one’s values and objectives. Within this theoretical analysis, the definition of good status depends on the approach (i.e. method) being used. We acknowledge that there are social-ecological trade-offs that must be made in real-world systems for which ‘good’ may refer to a system that is not considered within the approaches used here (e.g. rapid recovery to a pre-disturbed level under the MSFD and UKMS). There are several reviews describing how to select a good indicator (Link 2005, Rees et al. 2008) and create indices to reflect complex ecosystem functions (Rice and Rochet 2005, Tam et al. 2017, Bundy et al. 2019), but there is far less instruction for decision-makers on how to select a method for estimating thresholds against these carefully selected indicators. Reviews suggest that methods should be ecologically meaningful, easily interpretable, objective, and repeatable (ICES 2022, Hiddink et al. 2023). Methods using quantitative data are often prioritized as they are inherently less subjective when compared to using expert opinion; however, it’s important to highlight that expert knowledge can consider multiple variables and the contextual nuances of social-ecological ecosystems (i.e. KM-GBF targets), which cannot be addressed by quantitative methods alone. Methods that model the recovery of a system from a degraded state (i.e. outside of the range of natural variation), when pressures are theoretically removed, are important for assessing status (Rossberg et al. 2017), particularly under changing climatic conditions (Thorpe et al. 2023). However, thresholds here assess current ecosystem states, and whether this meets good status now, relative to legislative standards of the relevant jurisdiction (e.g. MSFD provision 26 for European waters).
Methods for estimating status thresholds that use ecological indicator data can require monitoring time series datasets, including reference conditions, or a known relationship between ecosystem state and anthropogenic pressure. In the marine environment, pressure-state relationships can resemble three basic shapes depending on the sensitivity of the indicator (we use this shorthand: concave, convex, and linear: Fig. 1). Indicators which decline quickly with small increases in pressure (convex), can be seen in the proportion of sentinel species (those typical of a habitat and sensitive to pressures) at different levels of physical disturbance (Plaza-Morlote et al. 2023). Other indicators can withstand some level of pressure and only decline more rapidly at higher levels of pressure (concave). Finally, an indicator can linearly decline with increasing pressure, which has been seen in pelagic fish responses to increased fishing pressure (Fu et al. 2020). The ‘tipping points’ method uses pressure-state relationship data and identifies the point at which the ecosystem tips into a degraded state for systems demonstrating concave pressure-state relationships (Fig. 1). These rapid declines in state with a small increase in pressure (Huggett 2005) have been documented in shallow lake systems as they pass a potentially irreversible threshold of eutrophication, turning a clear lake into a turbid, algal-dominated system (Scheffer et al. 1993). The ‘distance to degradation’ method has been used to estimate benthic status thresholds in response to fishing pressure by identifying the point at which an ecosystem has already reached a degraded state for convex pressure-state relationships, before adding an arbitrary percentile to move the threshold away from a degraded state (Fig. 1) (Plaza-Morlote et al. 2023). Importantly, this method identifies whether rapid declines in the system will occur but, as it uses an arbitrary percentile to define the final threshold, it does so in a less ecologically meaningful way than the tipping points method (ICES 2022). Another group of methods uses reference conditions to estimate thresholds. The ‘range of natural variation’ method estimates good status as the lower end of the natural variability of an indicator in a theoretically undisturbed (but more likely least impacted) condition (Symstad and Jonas 2014) and has been used to estimate large fish index (Rossberg et al. 2017) and benthic invertebrate faunal thresholds (van Hoey et al. 2015). Finally, the ‘statistically detectable change’ method, which has been used to estimate potting intensity thresholds (Rees et al. 2021), defines good status as the lower 95% confidence interval, determining what is good based more on a statistical, not ecological, concept, compared to the range of natural variation (ICES 2022).

(a) Pressure-state datasets with underlying linear, convex, and concave relationships and associated critical angles. (b) The tipping point threshold is the point of greatest curvature. Only models fitting a concave pressure-state relationship were used to estimate tipping point thresholds because convex pressure-state models do not reflect the point just before an ecosystem tips into a degraded state. (c) Distance to degradation adds an arbitrary percentile (e.g. 50th) to the degradation point to estimate the threshold for good status. Models fitting convex pressure-state relationships were used to estimate thresholds using this method because the degradation point is being estimated, not the tipping point. (d) Range of natural variation uses the lower 5th percentile of reference conditions, whereas statistically detectable change uses the lower 2.5th confidence interval as the threshold for good status.
Monitoring programs often collect data for specific biodiversity components or pressures under financial constraints and across different agencies, which means that the data collected can be infrequent, limited, noisy, and missing a standardized approach (ICES 2022). Therefore, the effect that noise from sampling error, small-scale spatial variation, and lack of specificity to a single pressure has on estimated thresholds should be evaluated, as even low levels of noise have been shown to preclude the identification of tipping point relationships in pressure-state datasets (Hillebrand et al. 2020). Finally, the number of sampling stations or continuous sampling years (i.e. sample size) will vary across different indicators and regions, as budget constraints, different agencies, and approaches are involved, and this may impact thresholds differently depending on the method being used (Snyder et al. 2014).
Here, we evaluate the statistical robustness of four methods (range of natural variation, statistically detectable change, tipping points, and distance to degradation) in response to pressure-state and reference condition data with different types of variation (pressure-state relationship, noise, and sample size). Datasets were simulated across combinations of these factors to assess the sensitivity of each method and the relative impact on thresholds. Understanding how each method performs will aid in the process of selecting a method that accurately and reliably estimates thresholds within the constraints set by available data and work towards developing a consistent framework for estimating thresholds within ecosystem assessments.
Methods
Pressure-state and reference condition data
Four methods for setting thresholds were evaluated, requiring two types of data. Time series of reference conditions were simulated for the range of natural variation and statistically detectable change methods, and data on ecosystem state as a function of increasing pressure (pressure-state relationship) were simulated for the tipping points and distance to degradation methods. A sequence of pressure values was computationally generated from 0 to 3 in increments of 0.2, for example, representing increasing trawling pressure. Ecosystem state varies as a function of increasing pressure in three pressure-state relationship shapes (see Fig. 1). State values decreased from 1 to 0, for example, representing decreasing fish abundance as a result of increasing fishing pressure.
The ecosystem state when free of human pressures was used as the reference condition. Random log-normally distributed values were generated with a mean value of 1 and a standard deviation of the distribution on the log scale = 0.105 to produce a longitudinal time series dataset that was representative of the mean and variation in abundance in a theoretically undisturbed ecosystem. If log-transformed back, reference conditions would fit a standard normal distribution reasonably well. Although there may be some systems that show more variation than that used here (=0.105), this represents a baseline level of noise that must be applied to datasets from 5 to 80 years, and, the impact of noisy reference conditions was addressed by applying additional sampling error, as described below.
Five arbitrary sample sizes, doubling within bounds (5–80) considered representative of empirical reference conditions and pressure-state marine datasets (Table S8), were used to account for variation in sample size (n = 5, 10, 20, 40, 80). An additional sample size of n = 160 was tested (Table S5 and Fig. S2) to demonstrate asymptotic findings beyond n = 80.
Standard deviation levels
Adding noise to simulated pressure-state and reference condition datasets was done to replicate sampling error, small-scale spatial variation, and low specificity often present in empirical datasets. To ensure noise was reflective of that in empirical data, the residual standard deviation (assuming a log-normal distribution of the residuals) of forty-five benthic biomass and trawling pressure datasets was used (Hiddink et al. 2017). To ensure comparability, datasets were standardized so that the state equalled one when the pressure equalled zero. Five percentiles from the distribution of residual standard deviation were used for simulated data: 10th, 30th, 50th, 70th, and 90th (Fig. S1), as these were considered a wide enough range to capture the impact of increasing noise on thresholds. A final level of 0% standard deviation was included to determine estimated thresholds without noise present (i.e. a ‘true’ threshold value). Standard deviation levels were applied to reference condition and pressure-state datasets.
Datasets with added noise were generated by taking a sample with n = 1 for each point in the datasets from a log-normal distribution with the desired standard deviation, where the mean was the value before adding noise. This approach means that the magnitude of the variation on the natural scale increases with the mean, as is expected for real abundance datasets, and that negative and zero values were not possible.
Methods for estimating thresholds
Tipping points estimate the threshold just before the point at which an ecosystem tips into a degraded state, which is the point on the pressure-state curve that has the greatest angle of curvature (ICES 2022). First, linear and quadratic regression models were fitted to simulated pressure-state data, and BIC model selection was used to determine the best fit model for the underlying pressure-state relationship. The tipping points method cannot be used on datasets with underlying linear pressure-state relationships because there is no maximum angle of curvature on a linear slope. On convex (versus concave) pressure-state relationships, the maximum point of curvature is not representative of the point that distinguishes between good and degraded states (Fig. 1). Therefore, datasets were only used to estimate a tipping point if the best fit model had a concave pressure-state relationship (negative quadratic coefficient).
Regression coefficients from datasets where a concave pressure-state relationship was fit were used to determine at what point on the curve the pressure axis reached the critical tipping point angle before determining what the corresponding state value was. The angle of the tipping point varies depending on the shape of the data. If the x- and y-axes had equal units, then the largest curvature would be 45° (ICES 2022). Here, state sequences ranged from 0 to 1, and pressure sequences from 0 to 3. The pressure value that corresponded to the point at which the curve reached a 36° angle was identified using the axis of symmetry formula, which determines the vertex (x-axis) of a parabola (Table S3).
Distance to degradation identifies the point that the ecosystem is mostly degraded before adding an arbitrary percentile to define the threshold for good status. The degradation point was calculated using the tipping point method outlined above, but for concave and convex relationships only. The percentile added to the degradation point (36° angle on the pressure-state curve) is based on the sensitivity of the habitat (Plaza-Morlote et al. 2023), with more sensitive habitats having a greater percentile added to move the threshold further away from a degraded state. However, as the sensitivity of the ecosystem state cannot be defined on simulated data, the 50th percentile was chosen from a suggested range of 33rd, 50th, and 66th (Plaza-Morlote et al. 2023) and applied to all datasets.
Statistically detectable change estimates the threshold at the point that is statistically significantly different from reference conditions (ICES 2022). Reference condition data used here represents the ecosystem in an undisturbed (or more likely, least impacted) condition. Therefore, a statistically significant change from a least impacted, and therefore good, reference condition indicates a negative change in ecosystem state. Here, we arbitrarily assumed that ‘statistically detectable change’ referred to any value outside of the 95% confidence intervals, although other levels may be used (75th, 90th, 99th). Here, the lower 2.5th percentile of the confidence interval was used as the lower threshold and was calculated by subtracting the margin of error from the dataset mean.
Range of natural variation estimates thresholds using the natural variability of an indicator in a least impacted reference condition. The natural variability was calculated as the values within the 90th percentile range to reduce the impact of potential outliers. Any value outside of this range is not considered representative of a good state; therefore, the 5th percentile was used as the lower threshold to distinguish between good and degraded states (Rossberg et al. 2017).
Threshold analysis
Data simulation and threshold estimation steps were run 100 times for each type of data (4 state∼pressure/reference condition relationships × 6 standard deviation levels × 5 sample sizes). ANOVAs identified factors that accounted for the greatest variance in estimated threshold values for each method. All data analysis was carried out in R Studio (R Core Team 2022). Fitting rates (Table S4) denote the fraction of thresholds estimated from methods using pressure-state datasets (tipping points and distance to degradation) across one hundred simulation runs. Thresholds may not have been estimated if the method could not fit a pressure-state model to data that was particularly noisy or had a small sample size. Fitting rates are not included for range of natural variation and statistically detectable change as these methods used reference condition data (not pressure-state models) so fitting rates were always 100%. When reporting the results, if the shape of the best fit model was different from the underlying pressure-state relationship used to generate the data (i.e. the level of noise prevented the fitting of the correct shape relationship), the threshold was reported for the underlying pressure-state relationship, not the relationship of the fitted model (see the ‘Results’ section for more information on ‘miss fitting’).
Results
Figure 2 gives an example of the simulated datasets and the fitted thresholds using different methods. Thresholds range from 1 (highest state) to 0 (lowest), along either a gradient of increasing pressure (pressure-state data) or a longitudinal time series (reference condition data). Noisy datasets (high standard deviation) do not have fitted thresholds for distance to degradation and tipping points because a linear regression was fitted through the noisy data and no inflection point exists on a straight line (Fig. 2). It is important to note that thresholds >1 (maximum ecosystem state) are technically implausible because an indicator (e.g. crustacean abundance) cannot reach a value >100% of its state on average (i.e. 1). Therefore, methods which estimate thresholds >1 (tipping points and distance to degradation) demonstrate a flaw in their methodologies by producing estimates that cannot possibly be used to assess condition.

Example of simulated data across three pressure-state relationships and reference condition datasets (n = 80) with thresholds estimated by four methods (indicated by dashed lines). Five levels of added noise (standard deviation) were used to evaluate the impact on estimated thresholds. Pressure-state curves fitted by tipping points and distance to degradation are denoted by solid black curved lines. Values >2.0 are not included to display relationships clearly.
Comparing thresholds across methods
The difference between thresholds estimated on data with and without added noise (increasing percentiles of standard deviation to represent sampling error, small-scale spatial variation, and low specificity) demonstrates the magnitude of the effect that increased noise alone has on thresholds. The greater the impact of added noise on estimated thresholds, the less accurate the method is considered to be in response to noisy data. Statistically detectable change and range of natural variation estimated lower thresholds as the level of added noise increased (Fig. 3). Range of natural variation estimated thresholds of 0.8 on data with no added noise, which dropped to 0.57 on data with 10% added noise (n = 80) and to 0.06 on data with 90% added noise, demonstrating the strong effect of increased noise on this method. This effect was seen in thresholds estimated by statistically detectable change on datasets with lower sample sizes (n = 10) (Fig. 4). Thresholds were estimated at 0.8 on data with 10% added noise and at 0.17 on data with 90% added noise. Noisy data had less of an impact on average threshold values estimated by distance to degradation and tipping points (Fig. 3). Tipping points estimated thresholds at 0.75 on data with concave underlying relationships (n = 80) and no added noise and demonstrated very little change at 90% added noise (0.73). Similarly, thresholds estimated by distance to degradation, on datasets with a concave underlying relationship (n = 80), demonstrated little change from no added noise (0.88) to 90% added noise (0.87). However, at low sample sizes (n = 10) and 90% added noise, both methods estimated impossible thresholds >1 for concave pressure-state datasets (Table S5). For datasets with a convex underlying relationship (n = 80), distance to degradation estimated thresholds at 0.63 with no added noise and at 0.54 with 90% added noise. A similar pattern can be seen with increasing sample size, as both tipping points and distance to degradation estimated similar average thresholds across all sample sizes (Fig. 5). Based on these results, tipping points and distance to degradation appear to estimate more consistent thresholds in response to noisy, limited data, when compared to the range of natural variation and statistically detectable change. The level of added noise accounted for the most variance in thresholds from methods using reference conditions (range of natural variation: F = 40450.956; statistically detectable change: F = 8989.7) (Table 1) and accounted for far less in thresholds estimated by methods using pressure-state data (tipping points: F = 122.0410; distance to degradation: F = 55.7786) (Table 1).

Ecosystem state thresholds estimated using four methods (statistically detectable change, range of natural variation, distance to degradation, and tipping points) on computationally generated data (n = 80) with 6 levels of added standard deviation (0%–90%). Boxplots are coloured according to the type of underlying pressure-state (linear, concave, convex) or reference condition relationship of the data. For tipping points, only models fitting a concave pressure-state relationship were used, therefore any threshold estimated by tipping points for a dataset with a linear or convex underlying relationship has been miss fit with a concave pressure-state model. The same applies to thresholds estimated by distance to degradation for datasets with a linear underlying pressure-state relationship. Fitting rates are numerical values displayed above each boxplot for the methods using pressure-state data, which denote the proportion of thresholds estimated out of one hundred runs (i.e. 0.2 equates to 20 thresholds out of 100 simulations) and are colour-coded to the underlying pressure-state relationship.

Thresholds estimated on datasets with n = 10 rather than 80; otherwise, all information is the same as in Fig. 3 (thresholds >1.5 removed to display data clearly: 0.09%, n = 4).

Thresholds estimated for datasets with 50% standard deviation for five sample sizes (5, 10, 20, 40, and 80); otherwise, all information is the same as in Fig. 3 (thresholds >1.5 removed to display data clearly: 0.06%, n = 2).
Analysis of variance of factors accounting for variability in threshold estimates from different methods for setting thresholds ordered by F value
Method . | Factor/interaction (:) . | Df . | F value . | P value . | adj. R2 . |
---|---|---|---|---|---|
Tipping points | Standard deviation (SD) | 1 | 122.0410 | <.001 | 0.12 |
SD:size | 1 | 69.7156 | <.001 | ||
Relationship | 2 | 37.7252 | <.001 | ||
Size | 1 | 6.0487 | .014 | ||
SD:relationship | 2 | 4.9726 | .007 | ||
Size:relationship | 2 | 0.6846 | .505 | ||
SD:size:relationship | 2 | 0.2277 | .796 | ||
Distance to degradation | Relationship | 2 | 1047.7324 | <.001 | 0.45 |
SD | 1 | 55.7786 | <.001 | ||
SD:relationship | 2 | 21.8139 | <.001 | ||
SD:size:relationship | 2 | 6.4558 | .002 | ||
Size | 1 | 3.0399 | .081 | ||
Size:relationship | 2 | 2.3594 | .095 | ||
SD:size | 1 | 1.5954 | .207 | ||
Range of natural variation | SD | 2 | 40 450.956 | <.001 | 0.84 |
Size | 1 | 1843.845 | <.001 | ||
SD:size | 1 | 38.434 | <.001 | ||
Statistically detectable change | SD | 2 | 8989.7 | <.001 | 0.62 |
Size | 1 | 4321.9 | <.001 | ||
SD:size | 1 | 1338.9 | <.001 |
Method . | Factor/interaction (:) . | Df . | F value . | P value . | adj. R2 . |
---|---|---|---|---|---|
Tipping points | Standard deviation (SD) | 1 | 122.0410 | <.001 | 0.12 |
SD:size | 1 | 69.7156 | <.001 | ||
Relationship | 2 | 37.7252 | <.001 | ||
Size | 1 | 6.0487 | .014 | ||
SD:relationship | 2 | 4.9726 | .007 | ||
Size:relationship | 2 | 0.6846 | .505 | ||
SD:size:relationship | 2 | 0.2277 | .796 | ||
Distance to degradation | Relationship | 2 | 1047.7324 | <.001 | 0.45 |
SD | 1 | 55.7786 | <.001 | ||
SD:relationship | 2 | 21.8139 | <.001 | ||
SD:size:relationship | 2 | 6.4558 | .002 | ||
Size | 1 | 3.0399 | .081 | ||
Size:relationship | 2 | 2.3594 | .095 | ||
SD:size | 1 | 1.5954 | .207 | ||
Range of natural variation | SD | 2 | 40 450.956 | <.001 | 0.84 |
Size | 1 | 1843.845 | <.001 | ||
SD:size | 1 | 38.434 | <.001 | ||
Statistically detectable change | SD | 2 | 8989.7 | <.001 | 0.62 |
Size | 1 | 4321.9 | <.001 | ||
SD:size | 1 | 1338.9 | <.001 |
Analysis of variance of factors accounting for variability in threshold estimates from different methods for setting thresholds ordered by F value
Method . | Factor/interaction (:) . | Df . | F value . | P value . | adj. R2 . |
---|---|---|---|---|---|
Tipping points | Standard deviation (SD) | 1 | 122.0410 | <.001 | 0.12 |
SD:size | 1 | 69.7156 | <.001 | ||
Relationship | 2 | 37.7252 | <.001 | ||
Size | 1 | 6.0487 | .014 | ||
SD:relationship | 2 | 4.9726 | .007 | ||
Size:relationship | 2 | 0.6846 | .505 | ||
SD:size:relationship | 2 | 0.2277 | .796 | ||
Distance to degradation | Relationship | 2 | 1047.7324 | <.001 | 0.45 |
SD | 1 | 55.7786 | <.001 | ||
SD:relationship | 2 | 21.8139 | <.001 | ||
SD:size:relationship | 2 | 6.4558 | .002 | ||
Size | 1 | 3.0399 | .081 | ||
Size:relationship | 2 | 2.3594 | .095 | ||
SD:size | 1 | 1.5954 | .207 | ||
Range of natural variation | SD | 2 | 40 450.956 | <.001 | 0.84 |
Size | 1 | 1843.845 | <.001 | ||
SD:size | 1 | 38.434 | <.001 | ||
Statistically detectable change | SD | 2 | 8989.7 | <.001 | 0.62 |
Size | 1 | 4321.9 | <.001 | ||
SD:size | 1 | 1338.9 | <.001 |
Method . | Factor/interaction (:) . | Df . | F value . | P value . | adj. R2 . |
---|---|---|---|---|---|
Tipping points | Standard deviation (SD) | 1 | 122.0410 | <.001 | 0.12 |
SD:size | 1 | 69.7156 | <.001 | ||
Relationship | 2 | 37.7252 | <.001 | ||
Size | 1 | 6.0487 | .014 | ||
SD:relationship | 2 | 4.9726 | .007 | ||
Size:relationship | 2 | 0.6846 | .505 | ||
SD:size:relationship | 2 | 0.2277 | .796 | ||
Distance to degradation | Relationship | 2 | 1047.7324 | <.001 | 0.45 |
SD | 1 | 55.7786 | <.001 | ||
SD:relationship | 2 | 21.8139 | <.001 | ||
SD:size:relationship | 2 | 6.4558 | .002 | ||
Size | 1 | 3.0399 | .081 | ||
Size:relationship | 2 | 2.3594 | .095 | ||
SD:size | 1 | 1.5954 | .207 | ||
Range of natural variation | SD | 2 | 40 450.956 | <.001 | 0.84 |
Size | 1 | 1843.845 | <.001 | ||
SD:size | 1 | 38.434 | <.001 | ||
Statistically detectable change | SD | 2 | 8989.7 | <.001 | 0.62 |
Size | 1 | 4321.9 | <.001 | ||
SD:size | 1 | 1338.9 | <.001 |
Sample size and pressure-state relationship
Smaller sample sizes increased the variability of thresholds estimated by all methods, but particularly for tipping points and distance to degradation (Fig. 5). The average threshold value estimated by both methods, split by relationship, was not noticeably impacted by decreasing sample size (tipping points: F = 6.049; distance to degradation: F = 3.04) (Table 1) (Fig. 5); however, the range of thresholds estimated for smaller sized (n = 5) datasets, with 50% added noise and a concave underlying relationship, was 2.73, which is much higher than that estimated for datasets with larger sample sizes (n = 80: 0.46). The interaction between standard deviation and sample size accounted for more than half of the variance in tipping point thresholds, than that explained by standard deviation alone (Table 1). Thresholds estimated by statistically detectable change were more noticeably impacted by a decrease in sample size compared to the range of natural variation (Fig. 5) (statistically detectable change: F = 4321.9; range of natural variation: F = 1843.85) (Table 1), which reflects differences in methodologies, as statistically detectable change uses the lower 95% CI of reference conditions, which produces wider intervals for smaller sample sizes and therefore lower, and more variable, thresholds. Range of natural variation uses the lower 5th percentile of the full range of reference condition values and is therefore not impacted in the same way.
The shape of the underlying pressure-state relationship was only relevant for tipping points and distance to degradation, which used pressure-state datasets. The type of pressure-state relationship explained the majority of variance in thresholds estimated by distance to degradation (F = 1047.73) (Table 1), as thresholds were consistently higher for datasets with an underlying concave pressure-state relationship, compared to those with an underlying convex pressure-state relationship (Figs 3 –5). This reflects the fact that the critical angle on a model fitting a concave pressure-state relationship will always be situated at a higher ecosystem state than on a convex pressure-state relationship (Fig. 1). The tipping points method only used datasets with a concave underlying pressure-state relationship, and therefore, there were no differences in thresholds estimated for different relationship types to compare across.
Fitting and miss fitting rates
Fitting rates indicate the number of thresholds produced across one hundred simulation runs of a single dataset. If the fitting rate was 0.2 for a dataset with 50% added noise and n = 10, only 20 thresholds were estimated out of 100 simulated runs. This equates to a 20% chance of estimating a threshold that was appropriate to the dataset using that method if the available data has a similar level of added noise and sample size. We propose an arbitrary minimum acceptable fitting rate of 75%, which must be met for a method to be considered resource-effective for decision-makers. Tipping points only met this fitting rate at low levels of added noise and at the highest sample size (Table S4). For distance to degradation, the fitting rate was met for datasets with an underlying concave pressure-state relationship at low levels of added noise and high sample sizes, and for datasets with an underlying convex pressure-state relationship, for almost all levels of added noise at high sample sizes (Table S4). As the level of added noise increased and sample size decreased, fitting rates for tipping points and distance to degradation declined because the best fit model for the pressure-state relationship could not be identified, or a model was fit with a different pressure-state relationship that was not carried forward (Fig. 2). A non-parametric kernel regression approach was tested (Text S1) to address the low levels of fitting demonstrated by these methods under the parametric approach used here. The kernel regression improved the fitting rates of both methods (Figs S3–S6) by allowing much more flexibility in the shape of the pressure-state relationship, and therefore models could be fitted more often through noisy data. However, this resulted in thresholds being estimated for all datasets, including linear pressure-state relationships with no added noise where no threshold should be estimated, demonstrating that tipping points were being estimated whether or not they were actually present (Fig. S3).
The miss fitting of models to datasets with different underlying pressure-state relationships explains why for distance to degradation, datasets with underlying linear, and for tipping points, underlying linear and convex, pressure-state relationships, have had thresholds estimated despite it not being methodologically plausible. Ideally, methods for setting thresholds would demonstrate fitting rates at 100% across all datasets. However, these methods should not be estimating thresholds for datasets with underlying linear, or for tipping points only, convex, pressure-state relationships. Fitting rates should be as low as possible (ideally zero) for these types of datasets because a high rate of miss fitting would not be accurately identifying the threshold for good status based on the ecosystem state response to an increase in pressure. Thresholds estimated by distance to degradation and tipping points on datasets with a large sample size and an underlying linear, or for tipping points, convex, pressure-state relationships had low fitting rates at around 10% or less (Fig. 3), which suggests that miss fitting is occurring infrequently and unlikely to be detrimental to the overall accuracy of the thresholds. However, at low sample sizes, the rate of correctly fitting models to datasets with the same underlying pressure-state relationship decreased to nearly the same rates as miss fitting (Fig. 5).
Discussion
Here, we evaluated the statistical robustness of four methods for estimating thresholds for good status. We found the range of natural variation and statistically detectable change methods reliably estimated thresholds for noisy and small datasets, but these were lower than those estimated without noise present or with a greater sample size. The tipping points and distance to degradation methods estimated similar average thresholds across all datasets regardless of noise and sample size but increasingly failed to estimate thresholds or fit models that were consistent with the underlying relationship, as the level of added noise increased, and the sample size decreased.
Reliability of methods
Varying the amount of noise and sample size of simulated data impacted thresholds differently depending on the method used. Range of natural variation estimated much lower thresholds for datasets with high levels of added noise, as it is unable to distinguish between natural variability and sampling error. This aligns with results from grassland terrestrial systems, finding that large intervals estimated by range of natural variation on noisy data can mask changes in ecosystem state (Symstad and Jonas 2014). Thresholds estimated by statistically detectable change demonstrated the same decline when estimated on datasets with smaller sample sizes, reflecting patterns found in simulations of a similar detectable change method across increasing levels of sampling effort and coefficients of variation (Synder et al. 2014). The average threshold value, estimated by tipping points and distance to degradation, remained stable across increasing levels of added noise and decreasing sample sizes, but the variability in threshold estimates increased and included impossible values (>1). If there are limited indicator data, as is often the case, we would have less confidence in the accuracy of thresholds estimated by tipping points or distance to degradation and whether they truly represent the ecosystem state as ‘good’ (as would be defined without noise and small sample sizes).
Furthermore, tipping points and distance to degradation only achieved an acceptable fitting rate (75%) on datasets with high sample sizes and low levels of noise (Fig. 6), supporting the finding that tipping points are rarely detectable even when they are present in the data (Hillebrand et al. 2020). In some cases, the rate that models were being miss fit was the same as that of models being correctly fit, which means that thresholds would not accurately represent the boundary between good and degraded based on the relationship between ecological state and human pressure. Although, managers may need to reconsider whether the uncertainty associated with data containing especially high levels of noise, such as the highest levels simulated here, accurately capture the relationship between pressure and state. These results suggest that tipping points and distance to degradation should only be used when data fit the necessary prerequisites (i.e. low level of noise, high sample size, evidence for the shape of the underlying pressure-state relationship), as otherwise thresholds are likely to be spurious, and not representative of good status, or even missing entirely.

Fitting rate (i.e. proportion of thresholds estimated across 100 simulations) of methods for estimating thresholds. A tick denotes that the method met an arbitrary acceptable fitting rate of 75% for that type of dataset (i.e. 75 thresholds were estimated out of 100 runs). Exact fitting rates can be found in Table S4.
What method should be used?
The results found here show that the level of noise, sample size, and type of pressure-state relationship of data impacts thresholds differently depending on the method used. If data have low levels of noise, a large sample size, and there is sound justification that some specific underlying pressure-state relationship exists, tipping points could be suitable for setting thresholds for good status. Whilst statistically robust if the aforementioned criteria are met, distance to degradation estimates the final threshold value using an arbitrary percentile, and therefore, is not an ecologically meaningful approach. If these data requirements are not met, range of natural variation and statistically detectable change should be the preferred choice as they will consistently and reliably estimate thresholds with less variability and greater transparency. As has been highlighted in earlier reviews (ICES 2022, Hiddink et al. 2023), statistically detectable change uses a statistical concept (confidence intervals) to define good status, and as such, should be used with caution when drawing assumptions about ecological state boundaries. However, in the face of particularly noisy data, this method estimates higher state thresholds than the range of natural variation. Noisy data may be the result of opportunistic responses to anthropogenic pressure, this response may depend on the spatial scale or magnitude of how the pressure acts determining the levels of resources made available and the magnitude of release from competitive interactions, which permit opportunists, fugitive or pioneer species to flourish and create noise in the response data (Norkko et al. 2006). The challenge when using the range of natural variation method is partitioning between the natural variability of an indicator and the sampling noise present in the data (Symstad and Jonas 2014), of which the latter would result in lower thresholds for good status than would have been estimated had there been less sampling noise present. Using the recommendations made here, those in management positions should make an informed decision on which method to use and carefully consider the sensitivities that different methods demonstrate in response to limited data.
Conclusions
Evaluating the statistical robustness of methods for estimating thresholds for good status presents a useful step towards improving the accuracy and reliability of ecosystem assessments by providing a clear rationale for choosing one method over another, based on the type of data available. Tipping points and distance to degradation methods were often unreliable and estimated spurious thresholds in response to limited data and should therefore be used cautiously unless data with low levels of noise, a large sample size, and information on the type of pressure-state relationship are available. Range of natural variation and statistically detectable change methods transparently and reliably estimated thresholds but can result in lower thresholds in response to noisy and small sample-sized data. If the threshold for good status is lower than the true natural variability of the system, then the indicator can be in an ecologically degraded state, but under policy objectives it would be considered good (Symstad and Jonas 2014). Therefore, the type and quality of data should be evaluated to inform which method is most suitable for estimating thresholds under marine legislation (e.g. EU’s MSFD), so we can more confidently establish the condition of marine environments and track progress that specific management measures are having towards achieving policy goals ‘restoring and conserving nature while promoting sustainable use, integrating environmental protection with human activities’.
Funding
This work has received funding from the Natural Environment Research Council (NERC) through United Kingdom Research and Innovation (UKRI) for the Centre of Doctoral Training in Sustainable Management of UK Marine Resources (CDT SuMMeR) under grant agreement NE/W007215/1.
Acknowledgements
We thank two reviewers, especially Professor Jake Rice, for their in-depth analysis and constructive comments, which undoubtedly improved the final manuscript.
Author contribution
Lorna McKellar (Conceptualization, Methodology, Formal analysis, and Writing – original draft), Jan Geert Hiddink (Conceptualization, Methodology, Formal analysis, and Writing – original draft), Abigail McQuatters-Gollop (Supervision and Writing – review & editing), Tomas Chaigneau (Supervision and Writing – review & editing), Cristina Vina-Herbon (Supervision and Writing – review & editing), and Sebastian Valanko (Supervision and Writing – review & editing)
Conflict of interest
None declared.
Data availability
No new data were collected for this paper.