The generalized gamma is a flexible distribution that outperforms alternatives when modelling catch rate data

Abstract

Fisheries scientists use regression models to estimate population quantities, such as biomass or abundance, for use in climate, habitat, stock, and ecosystem assessments. However, these models are sensitive to the probability distribution used to characterize observation error. Here, we introduce the generalized gamma distribution (GGD), which has not been widely used in fisheries science. The GGD has useful properties: (1) it reduces to the lognormal distribution when the shape parameter approaches zero, (2) it reduces to the gamma distribution when the shape and scale parameters are equal, and (3) the coefficient of variation is independent of the mean. We assess the relative performance and robustness of the GGD to estimate biomass density across different observation error types in a simulation experiment. When fit to data generated from the GGD, lognormal, gamma, and Tweedie families, the GGD had low bias, high predictive accuracy, and appropriate confidence interval coverage under most scenarios. Finally, we fit spatiotemporal index standardization models using the R package sdmTMB to 14 species from three trawl surveys from the Gulf of Alaska and coast of British Columbia, Canada. When the Akaike information criterion weight was compared among fits using the lognormal, gamma, and Tweedie families, the GGD was the most commonly selected model. When compared via cross validation, predictive performance favoured the GGD for most (75%) Gulf of Alaska species but was less clear for the smaller British Columbia surveys. We conclude that the GGD is a flexible and effective choice for modelling fisheries-independent survey data, offering several advantages over commonly used distributions. Given its strong performance in both simulation and empirical analyses, we recommend that practitioners consider the GGD when selecting error distributions for these models.

generalized gamma distribution, gamma distribution, lognormal distribution, index standardization, generalized linear model

Introduction

A primary challenge of fisheries research is the accurate and precise estimation of fish population characteristics (e.g. population size and variance) for inclusion in assessments of fish stocks, habitat, and the status of marine ecosystems. Regression models provide researchers with a versatile tool for fitting predictions to various data sources (e.g. observations from fisheries-independent surveys), but these models depend on the skill of the researcher to specify the ‘best’ probability distribution for the data. We define ‘best’ as a distribution with the fewest parameters (to avoid overfitting) that conforms to current understanding of a fish population and that fits observed data well over its range (i.e. residuals are small and homoscedastic). The selection of different probability distributions may lead to sizable differences in the estimation of the scale of a fish population (Thorson et al. 2021). While fisheries researchers commonly apply the lognormal or gamma distributions to model variables observed with a continuous positive response (Maunder and Punt 2004), distributions with more than two parameters may be overlooked or ignored.

Introducing the generalized gamma distribution

The generalized gamma distribution (GGD) is a three-parameter distribution that provides several flexibilities for modelling fisheries data. Amoroso (1925) described a continuous function with four parameters to analyse the distribution of income data. The Amoroso distribution has four parameters (a location parameter, a scale parameter, and two shape parameters) and includes special cases for several useful distributions, such as the lognormal and gamma distributions (Crooks 2019), which are commonly used by fisheries researchers. Stacy (1962) removed the location parameter from the Amoroso distribution and described a generalization of the gamma distribution with three parameters, which also includes the special case for the lognormal distribution. While the properties of the Amoroso and Stacy parameterizations are useful for biological analyses, the parameterization in Stacy (1962) does not lend itself to maximum likelihood estimation (MLE) methods (Prentice 1974). Prentice (1974) applied two transformations to the Stacy (1962) parameterization: a transformation of shape parameter k into a new parameter Q (⁠|$Q = 1/\sqrt{k}$|⁠) and a log transformation of the probability density function (PDF). This parameterization improves numerical stability when estimating |$Q \rightarrow 0$| and allows for |$Q < 0$| (Jackson 2016).

As of the writing of this paper, the GGD has previously been implemented in R packages flexsurv (Jackson 2016), ggamma (Saldanha and Suzuki 2019), and gamlss (Rigby and Stasinopoulos 2005) and in the Python library SciPy (Virtanen et al. 2020), making the VAST (Thorson 2019) and sdmTMB (Anderson et al. 2024) R package implementations described in this paper the first to include this distribution for fisheries analysis.

Fisheries applications

Because the GGD has three parameters, it provides more flexibility in fitting fisheries data than the more commonly used lognormal or gamma distributions. This is advantageous if the reduction in estimation error offsets the increase in model complexity and the potential rise in generalization error. Fisheries researchers may select the GGD for model-based estimates (including spatial, temporal, or spatiotemporal models) of the mean and variance of biomass or abundance of fish species, to standardize indices of abundance, or to specify error distributions in fish stock assessments. Another benefit of the GGD is that MLE methods are able to estimate distribution parameters for lognormal and gamma distributions, neither of which have special cases for the other.

Abundance and biomass data often contain zeros, which can be fit using two-part models (‘hurdle’ or ‘delta’ models; Aitchison 1955, Maunder and Punt 2004) that include two linear predictors: one to model the encounters vs. non-encounters and a distribution such as the lognormal, gamma, or GGD to model positive catch rates or densities. An additional method to handle positive continuous data that contain zeros is to use a compound Poisson–Gamma distribution or Tweedie distribution (Tweedie 1984, Foster and Bravington 2013), where the compound Poisson–Gamma is the Tweedie with its power parameter |$1 < p < 2$|⁠. The Tweedie distribution has three distribution parameters and uses a single linear predictor to model both zero and positive continuous values, which reduces the number of parameters relative to a delta model but also reduces model flexibility.

Here, we compare the performance of models that use the lognormal, gamma, Tweedie, and GGD in (1) a simulation experiment and (2) when applied to time series of observations of effort-standardized catch (i.e. catch per unit effort) from fishery-independent surveys in the Gulf of Alaska (GOA) and off the coast of British Columbia, Canada. By demonstrating the efficiency of the GGD and its ability to resemble both the lognormal and gamma distributions, we propose that fisheries researchers explore using the GGD for modelling fish populations.

Methods

GGD parameterization

We used a mean-coefficient of variation (CV) form of the Prentice (1974) GGD parameterization, where we use the mean (⁠|$\bar{x}$|⁠) instead of location (⁠|$\mu$|⁠) (Table 1). Internally, the generalized linear model linear predictor |$\eta$| (i.e. |$\log (\bar{x})$|⁠) is converted into |$\log (\mu )$| as

$$\begin{eqnarray} \log (\mu ) = \eta - \log \Gamma \left(\frac{k \beta +1}{\beta } \right) + \log \Gamma (k) + \frac{\log (k)}{\beta }, \end{eqnarray}$$

(1)

where |$k = Q^{-2}$|⁠, |$\beta = Q / \sigma$|⁠, and |$\log \Gamma$| is the log gamma function. Special cases of the GGD are that it reduces to the lognormal distribution when |$Q \rightarrow 0$| (Fig. 1b, e, h), and it reduces to the gamma distribution when |$\sigma = Q$| (e.g. Fig. 1f). Thus, with the addition of one parameter, relative to the lognormal or gamma distributions, the GGD can model either distribution, a distribution in between the two, a distribution with lighter tails than the gamma, or a distribution with heavier tails than the lognormal. An additional special case for the GGD is the Weibull distribution when |$Q = 1$| (Fig. 1c, f, i). Finally, the coefficient of variation for the GGD is a function of the |$\sigma$| and Q parameters, unrelated to |$\bar{x}$| (Table 1).

$Comparison of densities for three distributions across three values of the generalized gamma Q shape parameter (columns) and three scale parameter values $\sigma$ (rows). The lognormal (long-dash) and gamma (short-dash) are the same across rows. The location parameter $\mu$ is fixed at 1.8.$

Figure 1.

Comparison of densities for three distributions across three values of the generalized gamma Q shape parameter (columns) and three scale parameter values |$\sigma$| (rows). The lognormal (long-dash) and gamma (short-dash) are the same across rows. The location parameter |$\mu$| is fixed at 1.8.

Open in new tab Download slide

Table 1.

Open in new tab

Selected properties of the GGD.

Property	Equation
PDF	\|$\frac{\|Q\|k^{k}}{x\sigma\Gamma(k)} \exp\left[ k(Qw - e^{Qw}) \right]$\| \|$\text{where } w = \frac{log(x) - \mu}{\sigma}$\|
PDF collapses to gamma when	\|$Q = \sigma$\|
PDF collapses to lognormal as	\|$Q \to 0$\|
Mean	\|$\Gamma \left(\frac{k\beta + 1}{\beta }\right) \cdot \frac{1}{\Gamma (k)} \cdot e^{\mu - \frac{\log (k)}{\beta }}, \quad \text{for } Q > 0$\|
Variance	\|$\left(e^{\mu -\frac{\log (k)}{\beta }}\right)^{\!2} \cdot \left[ \frac{\Gamma \left(\frac{k\beta +2}{\beta }\right)}{\Gamma (k)} - \left(\frac{\Gamma \left(\frac{k\beta +1}{\beta }\right)}{\Gamma (k)} \!\right)^{\!2} \right], \, \, \text{for } Q > 0$\|
CV	\|$\sqrt{ \Gamma(k) \cdot \Gamma\left( \frac{k\beta + 2}{\beta} \right) \cdot \left[ \! \Gamma\left( \frac{k\beta + 1}{\beta}\! \right) \right]^{\!-2}- 1}, \, \,\text{for } Q > 0$\|

Property	Equation
PDF	\|$\frac{\|Q\|k^{k}}{x\sigma\Gamma(k)} \exp\left[ k(Qw - e^{Qw}) \right]$\| \|$\text{where } w = \frac{log(x) - \mu}{\sigma}$\|
PDF collapses to gamma when	\|$Q = \sigma$\|
PDF collapses to lognormal as	\|$Q \to 0$\|
Mean	\|$\Gamma \left(\frac{k\beta + 1}{\beta }\right) \cdot \frac{1}{\Gamma (k)} \cdot e^{\mu - \frac{\log (k)}{\beta }}, \quad \text{for } Q > 0$\|
Variance	\|$\left(e^{\mu -\frac{\log (k)}{\beta }}\right)^{\!2} \cdot \left[ \frac{\Gamma \left(\frac{k\beta +2}{\beta }\right)}{\Gamma (k)} - \left(\frac{\Gamma \left(\frac{k\beta +1}{\beta }\right)}{\Gamma (k)} \!\right)^{\!2} \right], \, \, \text{for } Q > 0$\|
CV	\|$\sqrt{ \Gamma(k) \cdot \Gamma\left( \frac{k\beta + 2}{\beta} \right) \cdot \left[ \! \Gamma\left( \frac{k\beta + 1}{\beta}\! \right) \right]^{\!-2}- 1}, \, \,\text{for } Q > 0$\|

We define μ as the location parameter, σ as the scale parameter, and Q as the family or shape parameter. In this form of the PDF, μ > 0 and σ > 0, and Q ∈ (−∞, +∞). We further define two temporary variables to simplify notation: β = Q/σ and k = Q ⁻². The symbol Γ is the gamma function. Our parameterization is based on Prentice (1974), Lawless (1980), and Jackson (2016). Moments for |$Q > 0$| are derived in Stacy and Mihram (1965).

Table 1.

Open in new tab

Selected properties of the GGD.

Property	Equation
PDF	\|$\frac{\|Q\|k^{k}}{x\sigma\Gamma(k)} \exp\left[ k(Qw - e^{Qw}) \right]$\| \|$\text{where } w = \frac{log(x) - \mu}{\sigma}$\|
PDF collapses to gamma when	\|$Q = \sigma$\|
PDF collapses to lognormal as	\|$Q \to 0$\|
Mean	\|$\Gamma \left(\frac{k\beta + 1}{\beta }\right) \cdot \frac{1}{\Gamma (k)} \cdot e^{\mu - \frac{\log (k)}{\beta }}, \quad \text{for } Q > 0$\|
Variance	\|$\left(e^{\mu -\frac{\log (k)}{\beta }}\right)^{\!2} \cdot \left[ \frac{\Gamma \left(\frac{k\beta +2}{\beta }\right)}{\Gamma (k)} - \left(\frac{\Gamma \left(\frac{k\beta +1}{\beta }\right)}{\Gamma (k)} \!\right)^{\!2} \right], \, \, \text{for } Q > 0$\|
CV	\|$\sqrt{ \Gamma(k) \cdot \Gamma\left( \frac{k\beta + 2}{\beta} \right) \cdot \left[ \! \Gamma\left( \frac{k\beta + 1}{\beta}\! \right) \right]^{\!-2}- 1}, \, \,\text{for } Q > 0$\|

Property	Equation
PDF	\|$\frac{\|Q\|k^{k}}{x\sigma\Gamma(k)} \exp\left[ k(Qw - e^{Qw}) \right]$\| \|$\text{where } w = \frac{log(x) - \mu}{\sigma}$\|
PDF collapses to gamma when	\|$Q = \sigma$\|
PDF collapses to lognormal as	\|$Q \to 0$\|
Mean	\|$\Gamma \left(\frac{k\beta + 1}{\beta }\right) \cdot \frac{1}{\Gamma (k)} \cdot e^{\mu - \frac{\log (k)}{\beta }}, \quad \text{for } Q > 0$\|
Variance	\|$\left(e^{\mu -\frac{\log (k)}{\beta }}\right)^{\!2} \cdot \left[ \frac{\Gamma \left(\frac{k\beta +2}{\beta }\right)}{\Gamma (k)} - \left(\frac{\Gamma \left(\frac{k\beta +1}{\beta }\right)}{\Gamma (k)} \!\right)^{\!2} \right], \, \, \text{for } Q > 0$\|
CV	\|$\sqrt{ \Gamma(k) \cdot \Gamma\left( \frac{k\beta + 2}{\beta} \right) \cdot \left[ \! \Gamma\left( \frac{k\beta + 1}{\beta}\! \right) \right]^{\!-2}- 1}, \, \,\text{for } Q > 0$\|

Simulation testing

We assess the relative performance and robustness of the GGD to estimate biomass density across different observation error types. In a factorial-simulation experiment, we examine the goodness-of-fit, bias in estimates, and predictive accuracy of the GGD, lognormal, gamma, and Tweedie families. The Tweedie family is not a special case of the GGD but we have included it in our analyses because it is widely used in fisheries applications (e.g. Shono 2008, Foster and Bravington 2013, Anderson et al. 2019) and produces index scale estimates that are similar to design-based indices (Thorson et al. 2021).

Using sdmTMB, we simulated 10,000 points across a 100 × 100 cell grid using a spatial model of the form:

$$\begin{eqnarray} \label{cross-sim-eqn} \begin{aligned} \mathbb{E}[y_{\boldsymbol{s}}] &= \mu_{\boldsymbol{s}},\\ \mu_{\boldsymbol{s}} &= g^{-1} \left(\alpha + \omega_{\boldsymbol{s}} \right),\\ \boldsymbol{\omega} &\sim \operatorname{MVNormal} \left( \boldsymbol{0}, \boldsymbol{\Sigma}_{\omega} \right), \end{aligned} \end{eqnarray}$$

(2)

where the expected value of catch |$\mathbb{E}[y_{\boldsymbol{s}}]$| at spatial coordinates |$\boldsymbol{s}$| is the mean at that point in space, given by |$\mu_{\boldsymbol{s}}$|⁠. The mean is equal to an inverse link function |$g^{-1}$| applied to the observed catch |$\alpha$| plus a spatial Gaussian Markov random field (GMRF) value |$\omega_{\boldsymbol{s}}$|⁠. The Gaussian Markov Random Field is structured with a covariance (inverse precision) matrix |$\boldsymbol{\Sigma}_{\omega}$| that approximates the Matérn correlation function with a marginal standard deviation of 1 and range (distance at which correlation decays to ≈ 0.13; Lindgren et al. 2011) of 0.5. From the simulated expected values, we sampled observations for 500 locations and added observation errors from one of our four families.

Our four families used either the Tweedie distribution with a single linear predictor or delta-models, which had two linear predictors. We used a Tweedie distribution with power parameter |$p = 1.5$|⁠. For the delta-models, we simulated encounters (mean encounter probability = 0.5), using a Bernoulli distribution and a logit link. The positive catch rates (mean catch rate = 1) for the delta-models were sampled from one of gamma, lognormal, or the generalized gamma distribution. We standardized the scale parameter of each observation error distribution so that the CV of the observation error was 0.95, which was similar to the mean CV observed across species in the multi-stock analysis. We also included a sequence of Q values (Q = -2, -1, -0.5, -0.001, 0.001, 0.5, 0.95, 1, 2) to assess the families against data with a range of heavy to light tails, to assess the ability for the GGD models to estimate Q, and illustrate the equivalence of the GGD to the lognormal at |$Q = 0$| and to the gamma when |$Q = \sigma$| (here at |$Q = 0.95$|⁠).

We then fit the model specified in equation (2) to each simulated dataset using the delta-GGD, delta-lognormal, delta-gamma, and Tweedie families. Using these model fits, we summed density predictions across a prediction grid to generate an area-weighted biomass index (e.g. Thorson et al. 2015), using a generic bias-correction estimator (Thorson and Kristensen 2016). We repeated this sampling, model fitting, and index calculation 1,000 times, each time generating a new GMRF, new sampling locations, and new observation errors. For each simulation iteration, we calculated the relative error (RE) of the predicted index value (⁠|$\hat{I}$|⁠) compared to the true index (⁠|$I_\mathrm{true}$|⁠), |$\mathrm{RE} = \left(\hat{I} - I_\mathrm{true}\right) / I_\mathrm{true}$|⁠, and calculated the marginal AIC (Akaike Information Criterion; Akaike 1973) weight .

These AIC weights (Burnham and Anderson 2002) can be calculated as

$$\begin{eqnarray} \text{AIC}_i &=& -2 \log(L_i) + 2K_i, \\ \Delta_i &=& \text{AIC}_i - \text{AIC}_{\min}, \\ w_i &=& \frac{\exp\left(-\frac{1}{2} \Delta_i\right)}{\sum_{r=1}^{R} \exp\left(-\frac{1}{2} \Delta_r\right)}, \end{eqnarray}$$

(3)

where |$\log (L_i)$| is the marginal (natural) log likelihood for model i, |$K_i$| is the number of fixed effect parameters, and |$\Delta _i$| is the difference between the Akaike information criterion (AIC) for model i and the model with the minimum AIC. In the denominator, those differences are summed over all R candidate models.

To illustrate goodness of fit in the tails of the distributions, we performed an additional single simulation with an increased sample size. We simulated observations at 2000 points, fit each of the four models, and then calculated randomized quantile residuals (RQR; Dunn and Smyth 1996) for the full dataset (Tweedie) or the positive linear predictor (all other families) making sure to sample random effects from their approximate (multivariate normal) distribution rather than use their empirical Bayes estimates (Waagepetersen 2006, Thorson and Kristensen 2024).

Multi-stock analysis

We compared the performance of the GGD relative to the lognormal, gamma, and Tweedie when applied to trawl survey data from the GOA and off the coast of British Columbia, Canada. We fit models to 14 species from three regions: the GOA, Hecate Strait and Queen Charlotte Sound (HS-QCS), and West Coast Vancouver Island (WCVI). These species occurred in each of the three regions and were also well-sampled in the HS-QCS and WCVI regions, meaning they were encountered in at least 20% of samples over time. We used this encounter cutoff for the HS-QCS and WCVI regions because they have much smaller spatial coverage and lower annual data availability than the GOA (Fig. S1). In the GOA, even with a low encounter rate (e.g. Spotted Ratfish |$\sim 7$|%, 49 encounters/year), the mean annual number of encounters can be comparable to a species that has a much higher encounter rate in a smaller region like WCVI (e.g. Walleye Pollock |$\sim 37$|%, 44 encounters/year; Table S1). Time series spanned 18–20 years in BC survey regions (10–12 sampled years) and 30–33 years (15 sampled years) in the GOA. For the HS-QCS and WCVI surveys, we used an SPDE mesh with a ‘cutoff’ (minimum triangle edge length) of 8 km (Lindgren 2024), and for the GOA surveys, we used a cutoff value that resulted in a mesh with 500 knots (⁠|$\approx$|26.5 km). We fit the following model for each family, with a constant spatial GMRF and independent GMRFs by year:

$$\begin{eqnarray} \begin{aligned} \mu _{\boldsymbol {s},t} &= g^{-1} \left(\alpha _t + O_{\boldsymbol {s},t} + \omega _{\boldsymbol {s}} + \epsilon _{\boldsymbol {s},t} \right),\\ \boldsymbol {\omega } &\sim \operatorname{MVNormal} \left( \boldsymbol {0}, \boldsymbol {\Sigma}_\omega \right),\\ \boldsymbol {\epsilon }_{t} &\sim \operatorname{MVNormal} (\boldsymbol {0}, \boldsymbol {\Sigma}_{\epsilon }), \end{aligned} \end{eqnarray}$$

(4)

where the expected mean |$\mu _{\boldsymbol {s},t}$| at spatial coordinates |$\boldsymbol {s}$| at time (year) t, is equal to an inverse link function |$g^{-1}$| applied to the linear combination of an annual intercept |$\alpha _t$|⁠, offset |$O_{\boldsymbol {s},t}$| for log area swept (for the Tweedie or delta model second linear predictors), and spatial (⁠|$\omega _{\boldsymbol {s}}$|⁠) and spatiotemporal (⁠|$\epsilon _{\boldsymbol {s},t}$|⁠) random effects values. The |$\boldsymbol {\omega }$| and |$\boldsymbol {\epsilon }$| random effect vectors are drawn from GMRFs with covariance (inverse precision) matrices |$\boldsymbol {\Sigma}_\omega$| and |$\boldsymbol {\Sigma}_\epsilon$| structured to approximate a Matérn correlation function while estimating a shared decorrelation rate |$\kappa$| and independent precision-per-distance parameters |$\tau _\omega$| and |$\tau _\epsilon$|⁠. The shared Matérn range |$\rho$| can be calculated as |$\sqrt{8}/\kappa$| (Lindgren et al. 2011).

As in the simulation experiment, we fit the models as either a single linear predictor and the Tweedie distribution or a delta model that had two linear predictors (one predicting encounter probability and one predicting catch conditional on encounter) for the lognormal, gamma, and GGD distributions. We considered models to have converged if the maximum absolute marginal log-likelihood gradient with respect to all fixed effect parameters was |$<0.005$|⁠, the Hessian was positive definite, no random field marginal standard deviations were on a boundary (defined as |$< 0.01$|⁠), and the mean coefficient of variation in the estimated index was |$< 1$|⁠. If the model failed to converge, we checked if any of the spatial or spatiotemporal random fields had collapsed to zero (i.e. the marginal field SD |$< 0.01$|⁠), and if they had, we omitted those random fields (set to zero) and refit the model. In three cases where the delta-GGD model did not converge, we refit the models in two ways: We tested the use of a weak penalty that constrained the year estimates, where |$\alpha _{t} \sim \operatorname{N}(0, 30^{2})$|⁠, or we scaled the catch variable by 0.01 kg because the estimation of the GGD may be sensitive to scale and/or the starting values used for |$\sigma$|⁠. We include these results in the Supplementary Materials (see GGD Model Convergence) to offer readers practical options for troubleshooting convergence issues when fitting the GGD model.

We compared the models using two metrics: AIC and cross validation. First, we calculated marginal AIC weights (⁠|$w_i$|⁠) across all of the fitted models for each stock. Second, we compared the models with 50-fold cross validation. We divided the data randomly into 50 folds: |$D_1, D_2, \dots , D_{50}$|⁠. Then, for each fold j (where |$j = 1, \dots , 50$|⁠), we trained the model on the data excluding |$D_j$| and evaluated the model by calculating the sum of the log likelihood of the out-of-sample data |$D_j$|⁠. We then compared the relative differences in these log likelihoods between families for each stock.

For each stock, we compared the estimated biomass indices across the four models. We chose three example species from the GOA that spanned a range of Q values (Arrowtooth Flounder, Pacific Ocean Perch, and Pacific Spiny Dogfish) to illustrate in the main text and provide comparisons for all other stock combinations in the Supplementary Materials.

Results

Simulation testing

The GGD (‘gengamma’ in figures following the R family function name) performed well across the simulated datasets. RQR from the models fitted with the GGD had residuals consistent with the expected distribution across the simulations from different distributions (Fig. 2). The Tweedie distribution also performed well across most of the simulated datasets in terms of the RQR, though it did not perform as well as the GGD when |$Q < 0$| (i.e. data that were heavier tailed than the lognormal; Fig. S2). Similarly, the gamma distribution performed poorly when |$Q < 0$|⁠, failing to generate residuals consistent with the model expectations at both the upper and lower tails. As expected, residuals from models using the lognormal family (Fig. 2) indicated overprediction of large values as the tails of the simulated data became lighter. For very heavy tails (⁠|$Q < -0.5$|⁠), the GGD had better residual diagnostics than the lognormal.

Figure 2.

Quantile–quantile (QQ) plot of randomized quantile residuals for example simulated data. Columns correspond to the simulated observation likelihood, and rows correspond to the fitted model observation likelihood. Columns are ordered from most heavy-tailed (Q = −1) to most light-tailed (Q = 2). The residuals have been transformed to be normally distributed if the model was consistent with the data. The diagonal line corresponds to the 1:1 line.

Open in new tab Download slide

The GGD produced unbiased estimates of biomass density, had high accuracy for predicting biomass density, and had reliable confidence interval (CI) coverage across different observation error types. The GGD, gamma, and Tweedie families had median relative errors near zero, indicating that indices were not biased regardless of the simulation families used (Fig. 3a). Only the lognormal family had a large relative error when fit to data simulated from |$Q > 0$| or the Tweedie distribution (Fig. 3a). As the simulated data became lighter-tailed (i.e. as Q increased), the mean estimated by the lognormal increased relative to the true mean.

Figure 3.

(a) Relative error, (b) AIC weights, and (c) proportion of 50% CIs that contain the true value (‘CI coverage’) from 1000 simulation-estimation replicates of biomass density. Columns represent the simulated observation likelihood and y-axis labels represent the fitted observation likelihood. Columns are ordered from most heavy-tailed (Q = −1) to most light-tailed (Q = 2). In (a) and (b) points indicate the median value, and violins indicate density. The dashed vertical line in (c) indicates the nominal coverage of 50%. Note that in (a), to improve visual clarity of the median bias near zero, the x-axis has been limited to a maximum value of 1.25, which truncates the right tail of the density for the fitted lognormal relative error.

Open in new tab Download slide

The predictive accuracy of the GGD was robust to the underlying observation process (Fig. 3b). Across values of Q the GGD had the highest AIC weights, except for the special cases where |$Q \rightarrow 0$| and |$Q = \sigma$| (Fig. S3). The lognormal (⁠|$Q \rightarrow 0$|⁠) and gamma (⁠|$Q = \sigma$|⁠) families had the highest AIC weights when fit to their matching simulated data; however, the GGD model had median AIC weights of |$\approx 31$|% (Fig. 3), which is close to the value |$\approx 27 \% = \frac{e^{-1}}{1 + e^{-1}}$| that is expected when the generalized gamma results in the same marginal likelihood as the corresponding model but has a |$\Delta$|AIC that is larger by two units due to the additional parameter Q.

Overall, the GGD had 50% CI coverage that was comparable to or better than the other families, except in the case where the data were simulated using the Tweedie distribution, where the GGD had under-coverage (⁠|$\approx 37$|%; Fig. S3c). When the simulated data were heavy-tailed (⁠|$Q = -1$|⁠), the GGD was the only distribution with the expected CI coverage, whereas the other families showed under-coverage. Coverage for models fitted with the lognormal had under-coverage when tails were lighter, which reflects the increase in relative error seen in Fig. S3a.

Multi-stock analysis

Across most species in each region, the delta-GGD was selected as the best or close to the best model by AIC (Fig. 4). When |$Q \rightarrow 0$| the AIC weight was split between the lognormal and GGD models, which matches the results of the simulation testing. At higher values of Q, in some species (e.g. Shortspine Thornyhead, Arrowtooth Flounder), the AIC weight was split between the gamma and GGD. This happened where the difference between |$\sigma$| and the estimated Q was smallest, and in the HS-QCS and WCVI regions, which were less data-rich than the GOA (Fig. S1). Cross validation results within the GOA usually, but not always, selected the same model as AIC; 8 out of 12 species had the highest out-of-sample log likelihood with the GGD (Fig. 4). Predictive performance was most similar between the lognormal and the GGD near Q = 0 when the distributions are equivalent. In the smaller British Columbia surveys, the GGD was less dominant in predictive performance, but was the best or within 10 units of log likelihood from the best family in |$\approx$|56% of survey-species combinations (Fig. 4).

$Multi-stock AIC weights and relative differences in the summed log likelihood ($\Delta \log L$) of out-of-sample data from 50-fold cross validation. Within each survey, species are ordered by estimated generalized gamma Q value (from least to most heavy-tailed; shown in the right column). The $\Delta \log L$ axis is truncated to −500 to show differences in the best-performing models; missing points indicate models that did not converge in the cross validation. Stocks that did not converge using the GGD on the full dataset were omitted. GOA: Gulf of Alaska; HS-QCS: Hecate Strait and Queen Charlotte Sound; WCVI: West Coast Vancouver Island.$

Figure 4.

Multi-stock AIC weights and relative differences in the summed log likelihood (⁠|$\Delta \log L$|⁠) of out-of-sample data from 50-fold cross validation. Within each survey, species are ordered by estimated generalized gamma Q value (from least to most heavy-tailed; shown in the right column). The |$\Delta \log L$| axis is truncated to −500 to show differences in the best-performing models; missing points indicate models that did not converge in the cross validation. Stocks that did not converge using the GGD on the full dataset were omitted. GOA: Gulf of Alaska; HS-QCS: Hecate Strait and Queen Charlotte Sound; WCVI: West Coast Vancouver Island.

Open in new tab Download slide

Three illustrative examples of spatiotemporal-model-derived area-weighted indices of biomass from the GOA (Arrowtooth Flounder, Pacific Ocean Perch, and Pacific Spiny Dogfish) show that the GGD performed as expected, i.e. resembling the gamma or lognormal distributions depending upon the estimated value for Q. When the estimated Q was high, the index estimated by the delta-GGD model tended to better match the delta-gamma model (e.g. Arrowtooth Flounder in Fig. 5). As Q approached 0, the index estimated by the GGD model became similar in trend and magnitude as the index estimated by the delta-lognormal model (Fig. 5, Figs S7–S9). The scale estimated by the GGD was similar to the design-based scale for Arrowtooth Flounder (Table S2). However, for Pacific Ocean Perch and Pacific Spiny Dogfish, although the GGD had the highest AIC weight, the scale of the delta–gamma was closer to the design-based index (Table S2). Results for all other survey-species combinations are shown in Figs S7–S9.

Figure 5.

Examples of spatiotemporal-model-derived area-weighted indices of biomass for three groundfish species in the GOA across four observation likelihoods. Left column shows the indices of biomass (mean and 95% CI) with AIC weight percentages shown at the top. Right columns illustrate associated quantile–quantile plots for randomized quantile residuals that would be distributed as standard normal if the model were consistent with the data. The diagonal line corresponds to the 1:1 line. The panels are ordered from top to bottom in order of decreasing Q value (i.e. light-tailed to heavy-tailed). For Pacific Spiny Dogfish, we have truncated the y-axis and omitted values of the index estimated using the Tweedie |$>10^4$| to improve visualization of the differences in the indices estimated using the delta-lognormal, delta–gamma, and delta–gamma. The biomass indices for all other stocks are shown in Figs S7–S9.

Open in new tab Download slide

Discussion

We present a parameterization of the GGD in terms of the mean and coefficient of variation that allows its incorporation into existing software for spatiotemporal modelling. In a simulation experiment, the GGD demonstrated robust performance against distribution misspecification, and when tested across 14 species in three regions, the GGD had the highest or close to the highest AIC weight in most cases. Compared to other commonly used distributions (e.g. lognormal, gamma, Tweedie), the GGD offers a flexible alternative for fitting fisheries-independent survey observations.

Across a range of simulated Q values, and when fit to data simulated from other distributions, the GGD performed as expected in the simulation analysis. The GGD closely resembled the lognormal distribution for lognormal-simulated data and the gamma distribution for gamma-simulated data. In contrast, bias increased as simulated Q increased (tails became lighter than lognormal) when the lognormal distribution was fitted. The GGD offers greater flexibility by fitting the lognormal distribution when appropriate but avoiding the bias seen with the lognormal model as tail behaviour diverges from lognormal. Moreover, the GGD’s ability to estimate distributions that are heavier tailed than the lognormal (⁠|$Q < 0$|⁠), with tails between the lognormal and gamma (⁠|$\sigma > Q > 0$|⁠), or with tails lighter than the gamma (⁠|$Q > \sigma$|⁠) adds a new and practical option for use in comparative studies (e.g. Dick 2004, Thorson et al. 2021) and real-world index-standardization. Lack-of-fit in the tails of the distribution when incorrectly fitting the gamma and lognormal distributions was clearly seen in quantile–quantile plots. These plots therefore appear useful to diagnose instances when the GGD might be appropriate.

In practice, the GGD demonstrated high predictive accuracy across the species and regions we studied. The GGD performed similarly to, or better than, the lognormal and gamma models. This suggests that the extra parameter does not lead to overfitting or overparameterization in most cases. Furthermore, the GGD indicates that for some species, a distribution that is more heavy-tailed than the lognormal is more appropriate for index estimation. The risk of selecting the lognormal or gamma distribution for assessing fish populations may be heightened by unpredictable demographic changes over time, and selecting one of these distributions at a point in time may prevent the detection of such changes in the future. From our results, it seems prudent for fisheries researchers to use the GGD instead, which could estimate a different Q as new data are added. In some cases, it may be useful to allow Q to vary through time such as by fitting separate GGD parameters for each year of the survey to detect whether there are shifts in the shape of tails over time.

Despite the additional flexibility of the GGD, in cases where index scale is important, such as when the catchability coefficient is fixed a priori in an assessment model, it is important to compare the model-based index scale with a design-based scale or to a priori use the gamma or Tweedie distributions if a design-based index is unavailable. In our analysis, we show that for GOA Pacific Ocean Perch and GOA Pacific Spiny Dogfish, AIC and the estimated GGD Q parameter identify the lognormal or a distribution that is more heavy-tailed than lognormal, respectively, as parsimonious. However, for both stocks, the index estimated using a gamma distribution more closely matches the design-based index. Therefore, we reiterate the caution from Thorson et al. (2021) that specific purposes might be better suited by a priori model selection and avoid using AIC (or cross validation) as the only criterion for model selection.

We note that using the mean-CV parameterization of the GGD has many potential applications outside of the spatiotemporal model presented here. For example, Albertsen et al. (2017) compared the GGD against other likelihoods when fitting abundance-at-age or alternative age-composition data. However, that study used the Prentice (1974) parameterization (i.e. using a central tendency that does not match the distribution mean). Existing R packages, such as flexsurv, parameterize the GGD using the location (⁠|$\mu$|⁠), scale (⁠|$\sigma$|⁠), and shape (Q) parameters, whereas the packages VAST and sdmTMB use the mean-CV parameterization shown in Table 1. We recommend that further applications consider using the mean-CV parameterization, so that model parameters are estimated to match the mean of data (rather than a central tendency that is difficult to interpret and depends on details of the parameterization). For example, Cadigan and Myers (2001) explored the gamma and lognormal for sequential population models, and Jiao et al. (2004) explored these options for modelling variation in recruitment. Both of these applications could potentially benefit from testing the GGD. Additionally, Cacciapaglia et al. (2024) explored the generalized-gamma distribution for index standardization, using the implementation available in VAST that we document for the first time in the present paper. That study found the GGD was selected by AIC for three of four stocks, and had |$\Delta \mathrm{AIC} < 2$| for the remaining stock compared with the selected lognormal distribution. Finally, Monnahan et al. (unpublished data) have used the GGD to weight indices in stock assessment models, again using the parameterization derived from the present study.

We recommend future research to test the performance of species distribution models using the GGD for other applications such as characterizing ecosystem drivers and projecting spatial distributions under alternative climate scenarios. We also recommend further comparative research to explore biological attributes that are associated with survey catches that have larger outliers than predicted by the lognormal (i.e. |$Q < 0$|⁠). Previous studies have called these ‘extreme catch events’ (ECEs) and have represented ECEs using a mixture distribution that arises from a mixture of dispersed and shoaling habitats (Thorson et al. 2012) or with multivariate-t random fields (Anderson and Ward 2019). These ECEs appear to arise predictably for aggregating and shoaling species (e.g. some rockfishes, dogfish), and indices for these species then typically have substantially higher uncertainty. We therefore recommend that the GGD be used as a generalized method for identifying biological characteristics associated with these ECEs.

Acknowledgements

We thank C.C. Monnahan, who identified the potential role of the generalized gamma distribution for index standardization while contributing to the Discussion section of Thorson et al. (2021). We thank K. Kristensen and the developers of Template Model Builder, without which this work would not be computationally feasible. We thank C.N. Rooper, L.A.K. Barnett, and two anonymous reviewers for helpful comments that improved this manuscript.

Author contribution

J.C.D.: Data curation, Formal analysis, Investigation, Software, Validation, Visualization, Writing—original draft, Writing—review & editing. J.C.: Data curation, Formal analysis, Investigation, Methodology, Writing—original draft, Writing review & editing, Software, Validation. S.C.A.: Methodology, Project administration, Software, Supervision, Validation, Writing—review & editing. J.T.T.: Conceptualization, Formal analysis, Methodology, Project administration, Software, Supervision, Writing review & editing.

Conflict of interest

None declared.

Data availability

Code and data for this analysis are available at https://github.com/pbs-assess/gengamma and are archived at 10.5281/zenodo.15028629.

References

Aitchison

On the distribution of a positive random variable having a discrete probability mass at the origin

J Am Stat Assoc

1955

;

901

Month:	Total Views:
April 2025	317
May 2025	24

Article Contents

The generalized gamma is a flexible distribution that outperforms alternatives when modelling catch rate data

Abstract

Introduction

Introducing the generalized gamma distribution

Fisheries applications

Methods

GGD parameterization

Simulation testing

Multi-stock analysis

Results

Simulation testing

Multi-stock analysis

Discussion

Acknowledgements

Author contribution

Conflict of interest

Data availability

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only