A comparative analysis of alternative approaches to fitting species-abundance models

Connolly, Sean R.; Thibaut, Loïc M.

doi:10.1093/jpe/rtr043

Abstract

Aims

Fits of species-abundance distributions to empirical data are increasingly used to evaluate models of diversity maintenance and community structure and to infer properties of communities, such as species richness. Two distributions predicted by several models are the Poisson lognormal (PLN) and the negative binomial (NB) distribution; however, at least three different ways to parameterize the PLN have been proposed, which differ in whether unobserved species contribute to the likelihood and in whether the likelihood is conditional upon the total number of individuals in the sample. Each of these has an analogue for the NB. Here, we propose a new formulation of the PLN and NB that includes the number of unobserved species as one of the estimated parameters. We investigate the performance of parameter estimates obtained from this reformulation, as well as the existing alternatives, for drawing inferences about the shape of species abundance distributions and estimation of species richness.

Methods

We simulate the random sampling of a fixed number of individuals from lognormal and gamma community relative abundance distributions, using a previously developed ‘individual-based’ bootstrap algorithm. We use a range of sample sizes, community species richness levels and shape parameters for the species abundance distributions that span much of the realistic range for empirical data, generating 1 000 simulated data sets for each parameter combination. We then fit each of the alternative likelihoods to each of the simulated data sets, and we assess the bias, sampling variance and estimation error for each method.

Important Findings

Parameter estimates behave reasonably well for most parameter values, exhibiting modest levels of median error. However, for the NB, median error becomes extremely large as the NB approaches either of two limiting cases. For both the NB and PLN, >90% of the variation in the error in model parameters across parameter sets is explained by three quantities that corresponded to the proportion of species not observed in the sample, the expected number of species observed in the sample and the discrepancy between the true NB or PLN distribution and a Poisson distribution with the same mean. There are relatively few systematic differences between the four alternative likelihoods. In particular, failing to condition the likelihood on the total sample sizes does not appear to systematically increase the bias in parameter estimates. Indeed, overall, the classical likelihood performs slightly better than the alternatives. However, our reparameterized likelihood, for which species richness is a fitted parameter, has important advantages over existing approaches for estimating species richness from fitted species-abundance models.

Poisson lognormal, negative binomial, Poisson gamma, species-abundance distributions, species richness

Issue Section:

research articles

INTRODUCTION

Ecologists have sought to understand what drives patterns in the commonness and rarity of species for much of the last century (Magurran and Henderson 2011). In high-diversity assemblages, species-by-species estimates of demographic and other ecological parameters, such as competitive ability, cannot be made feasibly for the entire community, due both to the sheer number of species that must be studied and to the fact that the overwhelming majority of species in such communities are rare (McGill et al. 2007). Consequently, studies of commonness and rarity in high-diversity assemblages often focus on the overall frequency distributions of common and rare species, usually termed the ‘species-abundance distribution’ or ‘relative abundance distribution’, without respect to species identity. Such data are frequently fitted with alternative models, either as part of an attempt to identify the mechanisms that drive biodiversity maintenance in the community (Hubbell 2001; Lande et al. 2003) or to assess community properties for conservation purposes (Engen 2007; Gray 1981).

Two probability distributions that commonly recur in theory for species abundance distributions are the lognormal and the gamma distributions (Lande et al. 2003). These two distributions were first proposed when the analysis of commonness and rarity was still in its infancy, largely for phenomenological reasons (Fisher et al. 1943; Preston 1948). More recently, they have been derived from more mechanistic models of community structure. For instance, the lognormal and gamma arise as important special cases of stochastic abundance models (Diserud and Engen 2000; Engen and Lande 1996a, 1996b). The lognormal distribution also arises as a limiting case of the occupancy of high-dimensional niche space (Connolly et al. 2009; May 1975) and as an approximation for second-order departures from pure neutrality (Pueyo 2006). The gamma distribution, or its discrete analogue, the negative binomial (NB) distribution, is generated by several formulations of neutral theory (Etienne et al. 2007; Volkov et al. 2003, 2007).

Most species abundance distributions record counts of individuals, and they are incomplete samples from a community, rather than complete censuses. Thus, in most contemporary approaches to the analysis of species-abundance models, sampling theory is applied to derive a distribution for the probability that a species is represented by a particular number of individuals in a random sample from a community with a particular underlying community abundance distribution (Connolly and Dornelas 2011):

(1)

where P(n|x) is the probability of observing n individuals of a species in a sample, given that its mean abundance in the community is x, and f(x) is the probability distribution of mean abundance (i.e. it approximates the variation among species in mean abundance).

When P(n|x) is a Poisson distribution (i.e. sampling is random) and f(x) is a lognormal distribution, then the corresponding sample follows a Poisson lognormal (PLN) distribution:

(2)

where n is a natural number (n = 0, 1, 2, …), and μ and σ are the mean and standard deviation (SD) of log abundance in the community. When f(x) is the gamma distribution, then the sample follows an NB distribution:

(3)

where a and k are the scale and shape parameters of the gamma distribution, and m = ak is the mean of the resulting NB distribution.

In most cases, when the PLN and NB are fitted to species-abundance data, the log-likelihood that is used has a so-called ‘zero-truncated’ form and implicitly assumes that all species’ abundances are independent of one another:

(4)

In this expression, θ is the vector of parameter values (μ and σ for the PLN, and m and k for NB), S_obs is the number of species observed in the sample, N is the total number of individuals in the sample, s_n is the number of species with abundance n and p(n) is the PLN or NB probability from equation (2) or (3). is thus the probability that a species has abundance n, conditional on the fact that it is observed at least once in the sample (i.e. its abundance is not zero). The product follows from the assumption that all species’ abundances are statistically independent of one another, and the multinomial coefficient in front of the product accounts for all the different ways that the S_obs species can have the observed set of abundances.

There are two unrealistic assumptions associated with the likelihood in equation (4) that are potentially problematic. One is the implicit assumption that a species’ abundance n can be arbitrarily large. Another is the assumption that species’ abundances are statistically independent of one another. In ecological samples, however, the abundance of a species in a sample will be constrained by total sampling effort or sample size and the abundances of the other species. One way to explicitly incorporate these two assumptions in the likelihood is to make it conditional on the total number of individuals observed in the sample:

(5)

where n_i is the abundance of species i in the sample (Etienne and Olff 2005). Here, the numerator is the unconditional likelihood from equation (4), and the denominator normalizes this likelihood by all possible combinations of abundances of the S_obs observed species that would add up to a total sample size of N individuals. Etienne and Olff (2005) applied this likelihood to the PLN, but the expression applies equally well to the NB.

Although the likelihoods in equations (4) and (5) consider only the species that appear in the sample being analysed, they can be used to estimate the total number of species in the community. That is, the fitted parameters θ can be used in equation (2) or (3) to estimate p(0) and the size of the community species pool obtained from:

(6)

(Pielou 1975). However, because this is a derived parameter (i.e. it is not estimated directly by an optimization procedure when the likelihoods in equation (4) or (5) are fitted to species-abundance data), estimating the sampling variance for this potentially ecologically important quantity requires a computationally intensive approach (Connolly et al. 2005; Engen 2007). Moreover, one may wish to constrain multiple samples from the same community to have the same S or to specify a covariate model for variation in S among different communities. When S is a derived rather than a fitted parameter, such constrains are extremely difficult to impose.

Recently, Izsak (2008) proposed re-formulating the PLN likelihood with three fitted parameters, to explicitly estimate the total number of species in the community, S:

(7)

Note that, here, the products run from 0 to N, rather than from 1 to N, to incorporate the species with abundance zero, and the probabilities are not normalized by 1 − p(0). An analogous approach could be adopted for the NB, estimating m, k and S. In addition to requiring estimation of an additional parameter (which we might expect to increase the sampling variance of parameter estimates), this approach has the disadvantage of incorporating hypothetical data (i.e. the unobserved species) into the likelihood. The incorporation of unobserved species is potentially problematic for model selection: if two competing models estimate different numbers of unobserved species, the one that estimates the most unobserved species will tend to have a smaller likelihood, even if the two models fit the observed part of the distribution equally well (see Appendix).

Because of the growing interest in fitting species abundance models to data, we here comparatively evaluate the statistical performance of different approaches to fitting the PLN and NB to species abundance data. First, we propose an alternative parameterization of the PLN and NB, which includes species richness as an estimated parameter, but which has only two fitted parameters, and for which unobserved species do not contribute to the likelihood. We then conduct a simulation study to assess error of parameter estimates using this new likelihood, as well as the classical (equation 4), conditional (equation 5) and three-parameter (equation 7) likelihoods for both these distributions. We also evaluate how differences in bias and sampling variance of parameter estimates contribute to these differences. In particular, we are interested in whether the failure to condition on sample size increases the error associated with the likelihoods that omit this normalization and on whether a reparameterization of the zero-truncated likelihoods provides a satisfactory alternative to the three-parameter likelihood of equation (7) when direct estimation of S is desired. Finally, we identify regression models that explain variation in estimate error among a broad range of parameter sets in terms of functions of those parameters.

APPROXIMATE TWO-PARAMETER LIKELIHOOD WITH SPECIES RICHNESS

The first moment of a PLN distribution is

(Shaban 1988). Consequently, if the true number of species in the community is S, then the expected total number of individuals, given μ and σ, is

(8)

We can solve this to express μ as a function of S and σ, if we substitute the observed sample size, N, for the expected sample size:

(9a)

(9b)

By substituting the right-hand side of equation (9b) for μ in equation (2), the classical (zero-truncated) PLN is reparameterized as a function of the two community-level parameters, σ and S, rather than μ, which depends on sampling intensity. Note that this is a kind of hybrid moment–maximum likelihood approach because the observed total sample size N has been used in place of E(N) in equation (9a).

We can reparameterize the NB in similar fashion by expressing the mean, m, in terms of S and N:

(10)

Thus, S can be explicitly estimated without hypothetical, unobserved species contributing to the likelihood. Also, in contrast to the three-parameter formulation, there are only two fitted parameters (S and the shape parameter σ or k).

METHODS

To evaluate the bias and precision of the parameter estimates produced by the four alternative likelihoods of the PLN and NB, we conduct parametric bootstrap simulations. That is, we simulate data using a set of ‘true’ parameter values, and we compare the estimated values of our parameters with these true values. The data are simulated using a previously developed ‘hypergeometric’ bootstrapping algorithm, in which a fixed number of individuals N is sampled from an underlying relative abundance distribution of S_true species that is either lognormal with shape parameter σ_true or gamma with shape parameter k_true (see Connolly et al. 2009 for a detailed description).

Bootstrap simulations were conducted for a range of parameter values designed to capture much of the realistic range for species abundance data likely to be fitted with species-abundance models (Fig. 1). We consider sample sizes N of 100, 1 000 and 10 000 because this spans the orders of magnitude of sample size exhibited by most species abundance data sets, at least for macroscopic organisms. For example, the largest samples from individual locations (e.g. Dornelas and Connolly 2008; Volkov et al. 2007) are on the order of tens of thousands of individuals. We consider S_true values of 20, 100 and 500.

Figure 1:

illustration of the expected abundance distributions associated with the parameter sets explored in this paper. The left column of panels shows the PLN and the right column the NB. In each panel, color indicates the shape parameter, and line type indicates the sample size (in number of individuals), for species richness levels of (a, b) S = 20, (c, d) S = 100 and (e, f) S = 500.

Open in new tab Download slide

For the PLN, we consider σ_true values of 1, 2 and 3 for the standard deviation (SD) of log abundance. Lande et al. (2003) report σ = 2 as a typical value for natural communities. Moreover, analyses of ensembles of species abundance distributions suggest that estimates very rarely fall below 1 or above 3. For instance, in Connolly et al. (2005), all but one local coral and fish abundance distribution had 1 < σ < 3 (also see Engen et al. (2002, 2008) for insects and Gaston and Blackburn (2000) for birds). From equation (8), these 27 parameter combinations yield estimates of the sampling intensity parameter μ_true that range from about −6 to 6. We excluded one combination of parameter values (N = 100, S_true = 20, σ_true = 3) because the expected number of observed species (i.e. those with abundance >0) was <10. We think it unlikely that anyone would seek to fit a species-abundance model to such a dataset at least in isolation.

For the NB, we wished to explore small values of the shape parameter, k, which produce log series-like distributions (i.e. distributions without a mode on a log scale, k < 1); as well as cases with k > 1, which produce distributions of log abundance with an internal mode. Here, we choose k_true values of 0.02, 0.2, 2 and 20. Ideally, we would have liked to explore even smaller values of k_true. In particular, the log series represents a limiting case of the NB for which k → 0 and fitted estimates of the NB often yield values of k that are extremely small (Connolly et al. 2009; Volkov et al. 2007). However, for very small values of k, p(0) tends to unity (see equation 3). This tends to cause significant problems with convergence of optimization algorithms. These problems are manageable when fitting one or a few data sets but not when tens of thousands of fits are involved, as in the present study. In addition, several parameter combinations led to expected numbers of observed species <10; as with the PLN, these parameter combinations were excluded. These included all cases with k = 0.02 and S = 20, as well as the combination N = 100, k = 0.02 and S = 100.

For all simulated data sets, parameters were estimated by numerical maximization of the model’s log-likelihood, using the optimization function nlminb() in R (R Core Development Team 2011). Inspection of likelihood surfaces suggested that a commonly used alternative, optim(), tended to falsely report successful convergence for abundance distributions with small μ for the PLN, or small or large k for the NB, too frequently to yield reliable sampling distributions of parameter estimates. For the NB, we used the function dnbinom() to calculate p(n) in equation (3). For the PLN, convergence problems were frequently encountered when using the R function integrate() to evaluate equation (2). Therefore, we used Bulmer’s (1974) closed-form approximation for p(n), for n ≥ 25:

(11)

Using a smaller cutoff abundance value (e.g. n > 10, as recommended by Bulmer 1974) increased the prevalence of convergence problems. Conversely, increasing the threshold further (e.g. to n = 100) did not measurably improve either convergence or sampling distributions of parameter estimates. To calculate p(n), we used the function dpolono() in package VGAM (Yee 2010), which allows user specification of the cutoff abundance value from which to apply equation (11). For the constrained PLN and NB likelihoods, we applied a fast Fourier transform to approximate the quantity in the denominator of equation (5) (cf. Etienne and Olff 2005), using the R function fft(). Note that, in fitting the three-parameter likelihood, we convert factorials to gamma functions, in order to estimate S as a real number (rather than an integer, as in Izsak 2008): this helps maintain comparability with the other likelihoods, for which S is a derived parameter and can take non-integer values.

We used the frequency distributions of estimates of the natural logarithm of the community-level parameters of the model (⁠

\hat{S}

and the shape parameters

\hat{σ}

or

\hat{k}

⁠) from the fits to simulated data to evaluate the performance of each of the different likelihood approaches. For the classical and conditional likelihoods, we calculated, from the other fitted parameters (equation 6). In general, we found that these frequency distributions were not consistently skewed, but they were frequently leptokurtic, with a large number of outliers. Therefore, we quantify estimator performance using absolute median error, rather than mean-squared error:

(12)

where

{\hat{θ}}_{i}

is the estimate of parameter θ (log(σ), log(S) or log(k)) for simulated data set i and θ_true is the value of the parameter that was actually used to simulate the data. Similarly, we quantify bias using the median:

(13)

For the same reason, we compared sampling variability in parameter estimates based on the interquartile range (IQR), rather than the SD, of parameter estimates (see, e.g. Cabrera and Watson 1997).

To assess whether estimators were reasonably statistically well behaved (i.e. biases were small relative to their stochastic variability), we normalized MB to IQR. To compare the performance of the different likelihoods, we analysed their absolute median errors across all the parameter sets examined, using paired Wilcoxon signed-ranks tests (with Bonferroni corrections), along with graphical inspection. We inspect median bias and IQR to understand the extent to which systematic differences between likelihood methods in absolute median error are due to differences in bias or sampling variance. We also examine absolute median error as functions of the parameters used to generate the data, in order to identify the characteristics of species abundance distributions that are likely to be associated with reliable or unreliable parameter estimates. Finally, we develop a regression model to better understand what drives variation in absolute median error among our true parameter sets.

RESULTS

Summary statistics

Box plots of parameter estimates (on a natural logarithm scale and normalized to the true parameter values) are shown in Supplementary DataSupplementary Data (see online supplementary material). Median biases are shown in Supplementary Data and Supplementary Data (see online supplementary material), IQR in Supplementary Data and Supplementary Data (see online supplementary material) and absolute median error in Supplementary Data and Supplementary Data (see online supplementary material). Convergence information is reported in Supplementary Data (see online supplementary material).

Comparison of methods: bias and variance

Frequency distributions of the median bias, relative to IQR, across parameter sets, indicated that parameter estimates were well behaved in most cases (Fig. 2). In particular, for the PLN, median biases were always less than IQR, with the exception of two parameter sets for the reparameterized model (Fig. 2a and b). These corresponded to (N = 1 000 and N = 10 000, σ_true = 3, S_true = 100), for which $\hat{S}$ was more strongly negatively biased than it was for the other likelihoods (Supplementary Data, see online supplementary material). For the NB, the majority of parameter sets also produced relatively well-behaved estimates, except when k_true = 0.02 (Fig. 2c and d). In these cases, for both parameters, there were four cases for the three-parameter likelihood and one for the constrained likelihood, for which bias exceeded IQR for one or both parameters. For the three-parameter likelihood, these cases all involved large positive biases in k, and negative biases in S, whereas the opposite was the case for the constrained likelihood (Supplementary Data and Supplementary Data, see online supplementary material).

Figure 2:

median bias, expressed as a proportion of IQR, for (a) log(σ) and (b) log(S) of the PLN and (c) log(k) and (d) log(S) of the NB. Each panel shows the frequency distribution (across parameter sets) of the ratio of median bias to IQR, for each of the four likelihoods considered in this paper.

Open in new tab Download slide

For the NB, parameter estimates also failed to converge in a large proportion of simulations for three parameter sets with N = 100: k_true = 20 and S_true = 100, k_true = 2 and S_true = 500 and k_true = 20 and S_true = 500 (Supplementary Data, see online supplementary material). In these cases, $\hat{K}$ sometimes wandered off towards very large positive values, indicating that the simulated data were close to a Poisson distribution. For these cases, IQRs in k were very large for the classical, reparameterized and constrained likelihoods but not the three-parameter likelihood (Supplementary Data, see online supplementary material).

Comparison of methods: absolute median error

For the PLN, the classical likelihood performed best, exhibiting low absolute median error for both σ and S (Fig. 3; Supplementary Data, see online supplementary material). For σ, the constrained likelihood performed worst, with absolute median error consistently and significantly higher than those of the other likelihoods. Conversely, there was no clear best-performing likelihood, with the classical and reparameterized likelihoods exhibiting similar levels of absolute median error, and the three-parameter likelihood only marginally worse. Poor performance of the constrained likelihood was due to its larger IQR than the alternatives, rather than larger bias (Supplementary Data, see online supplementary material). For S, the classical likelihood had significantly lower median error than all the alternatives, while the other three likelihoods did not exhibit significant differences in median error (Fig. 3; Supplementary Data, see online supplementary material). The good performance of the classical likelihood was due mainly to its lower bias in S, compared to the reparameterized and three-parameter likelihoods (Supplementary Data, see online supplementary material), but to its lower IQR, compared to the constrained likelihood (Supplementary Data, see online supplementary material).

Figure 3:

comparison of absolute median error in PLN parameter estimates across the four likelihoods. Median error in log(σ) is shown in the upper left half of the figure; median error in log(S) is shown in the bottom right. In each panel, the coordinates of each point represent the absolute median error (across simulations) of the two likelihoods being compared, for one of the parameter sets used to simulate the data. In each panel, these median errors were recalculated using only those simulated datasets for which fits converged for both of the likelihoods being compared. The solid line is the unity line. Thus, points above and left of the unity line indicate that the method plotted on the vertical axis exhibits greater median error than that on the horizontal axis and vice versa for points below and right of the unity line. P values report the results of a Wilcoxon signed-ranks test for paired observations. P values indicating statistically significant differences between methods (after Bonferroni correction: critical α = 0.083) are in bold.

Open in new tab Download slide

For the NB, the four likelihoods all exhibited very similar levels of median error. There was some suggestion of greater error in k for the three-parameter likelihood, relative to the classical and reparameterized likelihoods, but this was statistically significant only for the reparameterized likelihood. For S, there was some suggestion of lower error overall for the constrained likelihood, although this was statistically significant only in comparison to the reparameterized model. However, there were two cases for which the constrained likelihood performed particularly poorly (Fig. 4; Supplementary Data6, see online supplementary material), due to a combination of high bias and high sampling variance (Supplementary Data, see online supplementary material). These cases corresponded to the parameter combinations with the largest expected number of unobserved species (>100; parameter sets N = 100 and N = 1 000, k = 0.02, S = 500).

Figure 4:

comparison of absolute median error in NB parameter estimates across the four likelihoods, plotted in the same way as in Fig. 3. Median error in log(k) is shown in the upper left half of the figure; median error in log(S) is shown in the bottom right.

Open in new tab Download slide

Determinants of median error

Patterns of variation in median error as a function of sample size, species richness and shape parameter were qualitatively identical across methods for both the PLN and NB. Therefore, we illustrate here in the main text the patterns using the classical likelihood (Fig. 5). For the PLN, relatively low levels of error (∼20% of the true parameter value or less) for both parameters occurred when one of two conditions were met: either N = 10 000, or N = 1 000 and σ_true < 3 (Fig. 5a and b). For both parameters, median error is an increasing function of σ_true and a decreasing function of N. For σ, when N = 1 000 or 10 000, error decreased with increasing S_true. Conversely, for S, error increased with S_true. For the NB, error tended to be greater overall than for the PLN, particularly when k_true < 1 (Fig. 5c and d). Low levels of median error (∼20% or smaller) in both parameters occurred only for simulations with k_true > 1 and N > 100, and k_true = 0.2, N = 10 000. For k, error decreased as N increased. In general, error also decreased to a minimum at k_true = 2, except when sample size was small and the number of species was large. In this case, sampling variances were extremely large. Error also tended to increase with S_true when k_true > 1 or N = 100. For S, the patterns were similar. However, error remained low as k_true increased from 2 to 20, rather than increasing. Also, the biases tended to be smaller than for k, particularly the bias in S for the small N, high S_true cases with k_true = 2 and k_true = 20, for which the error was large, but not outrageously so.

Figure 5:

absolute median error of (a) log(σ) and (b) log(S) for the PLN and (c) log(k) and (d) log(S) of the NB, for each of the parameter combinations used. Values of the relevant shape parameter are shown on the horizontal axis. Other parameter values are as follows: N = 100 (black), N = 1 000 (red), N = 10 000 (green), S_true = 20 (circles), S_true = 100 (upward-pointing triangles) and S_true = 500 (downward pointing triangles).

Open in new tab Download slide

Examination of the patterns in Fig. 5 led us to consider a regression model, in which variation in median error depends upon three quantities: the incompleteness of the distribution (here measured as p₀, the expected fraction of species not observed in the sample), the expected number of observed species, E(S_obs), and the extent to which the distribution departs from a Poisson distribution. As a measure of the latter, we simply summed the absolute differences in p(n) between PLN or NB and a Poisson distribution with the same mean (i.e. we calculated one minus the percentage overlap in the two distributions). A regression model incorporating these three explanatory variables explained >90% of the variation in median error among parameter sets for both the PLN and NB. The results of this regression using the classical likelihood are reported in Table 1 and Fig. 6, but similar results were obtained using the other likelihoods.

Table 1:

Open in new tab

regressions of absolute median error in the fitted parameters of the PLN and NB

Model	Absolute median error in	Explanatory variables	Coefficients	Standard error	P	R²
PLN	log σ	p(0)	0.983	0.097	9 × 10⁻¹⁰*	0.97
		dist.pois	−0.988	0.178	1 × 10⁻⁵*
		log[E(S_obs)]	−0.483	0.024	2 × 10⁻¹⁵*
		Intercept	−0.006	0.174	0.974
	log S	p(0)	0.862	0.051	2 × 10⁻¹⁴*	0.93
		log[E(S_obs)]	−0.052	0.014	0.0014*
		Intercept	0.254	0.064	0.0006*
NB	log k	p(0)	2.632	0.131	4 × 10⁻¹⁸*	0.97
		log(dist.pois)	−0.513	0.034	6 × 10⁻¹⁵*
		log[E(S_obs)]	−0.490	0.042	3 × 10⁻¹²*
		Intercept	−0.242	0.181	0.192
	log S	p(0)	1.215	0.064	1 × 10⁻¹⁷*	0.94
		log(dist.pois)	0.048	0.016	0.007*
		log[E(S_obs)]	−0.048	0.020	0.025*
		Intercept	0.273	0.088	0.004*

Model	Absolute median error in	Explanatory variables	Coefficients	Standard error	P	R²
PLN	log σ	p(0)	0.983	0.097	9 × 10⁻¹⁰*	0.97
		dist.pois	−0.988	0.178	1 × 10⁻⁵*
		log[E(S_obs)]	−0.483	0.024	2 × 10⁻¹⁵*
		Intercept	−0.006	0.174	0.974
	log S	p(0)	0.862	0.051	2 × 10⁻¹⁴*	0.93
		log[E(S_obs)]	−0.052	0.014	0.0014*
		Intercept	0.254	0.064	0.0006*
NB	log k	p(0)	2.632	0.131	4 × 10⁻¹⁸*	0.97
		log(dist.pois)	−0.513	0.034	6 × 10⁻¹⁵*
		log[E(S_obs)]	−0.490	0.042	3 × 10⁻¹²*
		Intercept	−0.242	0.181	0.192
	log S	p(0)	1.215	0.064	1 × 10⁻¹⁷*	0.94
		log(dist.pois)	0.048	0.016	0.007*
		log[E(S_obs)]	−0.048	0.020	0.025*
		Intercept	0.273	0.088	0.004*

Absolute median error has been log-transformed for the shape parameters σ and k and square-root transformed for species richness S. p(0) is the probability that a species is not observed in the species-abundance sample. E(S_obs) is the expected number of observed species in the sample and ‘dist.pois’ is one minus the percent overlap of the true NB or PLN distribution and a Poisson distribution with the same mean.

*

P values are significant with a confidence level of α = 0.05.

Table 1:

Open in new tab

regressions of absolute median error in the fitted parameters of the PLN and NB

Model	Absolute median error in	Explanatory variables	Coefficients	Standard error	P	R²
PLN	log σ	p(0)	0.983	0.097	9 × 10⁻¹⁰*	0.97
		dist.pois	−0.988	0.178	1 × 10⁻⁵*
		log[E(S_obs)]	−0.483	0.024	2 × 10⁻¹⁵*
		Intercept	−0.006	0.174	0.974
	log S	p(0)	0.862	0.051	2 × 10⁻¹⁴*	0.93
		log[E(S_obs)]	−0.052	0.014	0.0014*
		Intercept	0.254	0.064	0.0006*
NB	log k	p(0)	2.632	0.131	4 × 10⁻¹⁸*	0.97
		log(dist.pois)	−0.513	0.034	6 × 10⁻¹⁵*
		log[E(S_obs)]	−0.490	0.042	3 × 10⁻¹²*
		Intercept	−0.242	0.181	0.192
	log S	p(0)	1.215	0.064	1 × 10⁻¹⁷*	0.94
		log(dist.pois)	0.048	0.016	0.007*
		log[E(S_obs)]	−0.048	0.020	0.025*
		Intercept	0.273	0.088	0.004*

Model	Absolute median error in	Explanatory variables	Coefficients	Standard error	P	R²
PLN	log σ	p(0)	0.983	0.097	9 × 10⁻¹⁰*	0.97
		dist.pois	−0.988	0.178	1 × 10⁻⁵*
		log[E(S_obs)]	−0.483	0.024	2 × 10⁻¹⁵*
		Intercept	−0.006	0.174	0.974
	log S	p(0)	0.862	0.051	2 × 10⁻¹⁴*	0.93
		log[E(S_obs)]	−0.052	0.014	0.0014*
		Intercept	0.254	0.064	0.0006*
NB	log k	p(0)	2.632	0.131	4 × 10⁻¹⁸*	0.97
		log(dist.pois)	−0.513	0.034	6 × 10⁻¹⁵*
		log[E(S_obs)]	−0.490	0.042	3 × 10⁻¹²*
		Intercept	−0.242	0.181	0.192
	log S	p(0)	1.215	0.064	1 × 10⁻¹⁷*	0.94
		log(dist.pois)	0.048	0.016	0.007*
		log[E(S_obs)]	−0.048	0.020	0.025*
		Intercept	0.273	0.088	0.004*

Absolute median error has been log-transformed for the shape parameters σ and k and square-root transformed for species richness S. p(0) is the probability that a species is not observed in the species-abundance sample. E(S_obs) is the expected number of observed species in the sample and ‘dist.pois’ is one minus the percent overlap of the true NB or PLN distribution and a Poisson distribution with the same mean.

*

P values are significant with a confidence level of α = 0.05.

Figure 6:

observed versus predicted median error of (a) log(σ) and (b) log(S) for the PLN and (c) log(k) and (d) log(S) of the NB, illustrating the fit of the regression model from Table 1. The solid line is the unity line, and each point represents one of the parameter combinations used to simulate the data. Note that absolute median error is log-transfomed for the shape parameters σ and k and square root transformed for species richness S.

Open in new tab Download slide

DISCUSSION

For most parameter sets, all the likelihoods perform similarly well, exhibiting relatively low median error and low bias relative to sampling variance. However, for the NB in particular, median error can be extremely large either when k is very small (k = 0.02 in this study) or when N is small (N = 100 in this study; Fig. 5), patterns that are reflected in very high median biases (Supplementary Data, see online supplementary material). While these biases were particularly large for estimates of the shape parameter k, they were also large for the parameter S, at least when k was small. Because fits of the NB to species-abundance data frequently suggest log series-like (i.e. k << 1) distributions (Connolly et al. 2009; Volkov et al. 2007), this suggests that the traditional maximum likelihood approach of fitting the NB to a single species-abundance distribution is problematic.

Our regression models facilitate an explanation of what factors drive the variation in estimator error that we found in this study. Our three explanatory variables were the probability of a species having abundance zero, the expected number of observed species, and the difference between the abundance distribution and the closest Poisson distribution. p₀ is an inverse measure of the ‘completeness’ of the expected abundance distribution, as a representation of the true underlying lognormal or gamma relative abundance distribution. The lower p₀ is, the more ‘unveiled’ that underlying distribution is (sensuPreston 1948). The expected number of observed species is a measure of how complete a picture of this observable part of the abundance distribution the sample will constitute (i.e. how ‘filled in’ the expected abundance distribution will be with observed species abundances). Thus, these two quantities together indicate how effectively the observed data reveal the shape of the underlying community abundance distribution. However, each of our models is derived from assuming Poisson sampling from an underlying gamma or lognormal pattern of relative abundances. For some parameter values (e.g. low μ, low σ_true for the PLN; low m, high k_true for the NB), the Poisson sampling component dominates the variability in observed species abundances, and the overlap between the NB or PLN and the Poisson distribution is high. In such cases, the shape of the sample abundance distribution is less sensitive to the parameters of the underlying gamma or lognormal distribution, and median error is greater. The predictability of absolute median error suggests that a similar set of explanatory variables may be used to develop a comprehensive bias correction technique for the PLN and NB distributions (see Cabrera and Watson 1997 for a similar approach).

Comparisons of the performance of the different likelihoods suggest that, on balance, considering both the PLN and NB, the classical formulation of the likelihood performs better than the alternatives, even though S is a derived rather than fitted parameter in that formulation. It was the only likelihood for which median bias was always less than IQR (Fig. 2). For the PLN, the classical likelihood also yields the lowest error in estimates of σ and equal lowest error in estimation of S. For the NB, there was little to differentiate the likelihoods: the three-parameter likelihood was marginally worse than the classical and reparameterized likelihoods for k, whereas the constrained likelihood was marginally better than the non-constrained likelihoods for S. Only one of these differences was statistically significant, however (Fig. 4).

Our comparisons provide a direct test of whether the statistical performance of parameter estimates for the classical, reparameterized and three-parameter likelihoods are biased by their failure to account for the fact that the potential shapes that a sample abundance distribution can take are constrained by the total number of individuals in the sample. Previous work has investigated this possibility indirectly, by comparing parameter bias for unconstrained likelihoods fitted to two different types of simulated data: data for which the total number of individuals sampled is fixed (as in the simulations used here) and data for which each species’ abundance is an independent Poisson random variable (Connolly and Dornelas 2011; Connolly et al. 2009). The results in this paper are consistent with the conclusions drawn from the earlier simulation studies: biases do not appear to be systematically greater when constraints on total sample size are ignored by the likelihood.

Indeed, contrary to our expectation, the constrained likelihood actually performed worse than the unconstrained likelihoods for the PLN. It is not immediately obvious why consistently larger IQR in σ would result from the normalization that differentiates the constrained likelihood from the other candidates examined in this paper. It may well be that the poorer performance of the constrained likelihood arises from numerical error, rather than the likelihood per se. For example, calculating the normalization constant for the PLN requires numerical evaluation of the integral in equation (2) up to n = 25 for a large range of values of p(n). Thus, any numerical approximation errors would be amplified, relative to the other likelihoods. The fact that the constrained likelihood did not perform worse than the alternatives for the NB would be consistent with such an explanation (assuming that numerical approximation error associated with evaluating the gamma functions in the NB likelihood are small by comparison). Regardless, however, there is certainly limited evidence that the lack of a constraint on total community size in the classical, reparameterized, or three-parameter likelihoods leads to materially poorer parameter estimates.

Our results do suggest that the reparameterized likelihood provides a viable alternative to the three-parameter likelihood, when the direct estimation of total species richness is desired. Indeed, there is some support for using this likelihood in preference to the three-parameter likelihood: the reparameterized likelihood exhibits significantly lower median error in k (but not S) for the NB, and there is a suggestion that it also exhibits lower median error in σ for the PLN (this difference is only marginally non-significant after a Bonferroni correction). We had expected the additional parameter used in the three-parameter likelihood to lead to greater sampling variance, compared to two-parameter likelihoods. Indeed, there does appear to be slightly greater IQR for the three-parameter model for k and σ (Supplementary Data, see online supplementary material), but lower bias also appears to contribute to the better performance of the reparameterized likelihood (Supplementary Data, see online supplementary material).

Because the three-parameter likelihood, all other things being equal, will tend to be a decreasing function of the predicted number of unobserved species ( Appendix), we conjectured that the three-parameter likelihoods would be a poor choice when model selection is one of the goals of analysis, and the competing models predict different numbers of unobserved species. Preliminary support for this conjecture becomes apparent when we fit both the NB and PLN to data simulated with the NB hypergeometric bootstrap and then compare model selection results based on the reparameterized versus the three-parameter likelihood (Table 2). In this case, the true NB distribution consistently estimates a larger proportion of unobserved species, and indeed, this does appear to impair model selection by the three-parameter likelihood. In 31 of 32 parameter sets, the median log-likelihood ratio comparing the NB and PLN fits is greater when model selection is based on the reparameterized likelihood than when it is based on the three-parameter likelihood. In other words, the reparameterized approach favors the true NB distribution more strongly than the three-parameter likelihood (Table 2). Moreover, across parameter sets, the magnitude of the underperformance of the three-parameter likelihood is correlated with the difference in estimated by the NB and the PLN. Indeed, if we exclude those parameter sets where for both the PLN and NB (in such cases, the sampling variance of the difference in is very large, relative to its magnitude), the difference in explains almost all the variation in the relative model selection performance of the two likelihoods (Spearman’s ρ = 0.96, P < 0.0001).

Table 2:

Open in new tab

model selection results for data simulated with the hypergeometric NB bootstrap

True parameters			median(ℒ_NB / ℒ_PLN)			$median [p (0)^]$
N	k	S	Repar	Threepar	Performance difference	NB	PLN
100	0.2	20	0.29	−0.06	0.35	0.160	0.073
1 000	0.2	20	0.91	0.13	0.78	0.134	0.033
10 000	0.2	20	1.84	0.46	1.39	0.090	0.017
100	2	20	0.12	0.00	0.12	0.040	0.030
1 000	2	20	0.57	0.52	0.05	0.001	<0.001
10 000	2	20	0.61	0.59	0.02	<0.001	<0.001
100	20	20	0.00	0.00	0.00	0.007	0.009
1 000	20	20	−0.05	−0.09	0.04	<0.001	<0.001
10 000	20	20	0.05	0.05	0.00	<0.001	<0.001
1 000	0.02	100	0.74	−0.10	0.84	0.328	0.077
10 000	0.02	100	1.44	0.04	1.39	0.408	0.060
100	0.2	100	0.30	0.06	0.24	0.479	0.292
1 000	0.2	100	1.67	1.05	0.62	0.380	0.117
10 000	0.2	100	4.85	2.91	1.94	0.241	0.047
100	2	100	0.03	0.00	0.03	0.379	0.371
1 000	2	100	1.14	0.75	0.39	0.024	0.011
10 000	2	100	3.34	3.21	0.14	<0.001	<0.001
100	20	100	0.00	0.00	0.00	0.375	0.348
1 000	20	100	0.03	0.03	0.00	<0.001	<0.001
10 000	20	100	0.21	0.21	0.00	<0.001	<0.001
100	0.02	500	0.27	−0.02	0.29	0.602	0.304
1 000	0.02	500	1.16	0.54	0.62	0.748	0.195
10 000	0.02	500	3.57	1.74	1.83	0.742	0.139
100	0.2	500	0.06	0.00	0.06	0.741	0.617
1 000	0.2	500	2.27	1.83	0.44	0.587	0.256
10 000	0.2	500	11.00	9.41	1.59	0.382	0.101
100	2	500	0.00	0.00	0.00	0.804	0.757
1 000	2	500	0.50	0.40	0.10	0.244	0.200
10 000	2	500	9.03	8.25	0.78	0.008	0.002
100	20	500	0.00	0.00	0.00	0.854	0.832
1 000	20	500	0.00	0.00	0.00	0.146	0.148
10 000	20	500	−13.96	−13.87	−0.08	<0.001	<0.001

True parameters			median(ℒ_NB / ℒ_PLN)			$median [p (0)^]$
N	k	S	Repar	Threepar	Performance difference	NB	PLN
100	0.2	20	0.29	−0.06	0.35	0.160	0.073
1 000	0.2	20	0.91	0.13	0.78	0.134	0.033
10 000	0.2	20	1.84	0.46	1.39	0.090	0.017
100	2	20	0.12	0.00	0.12	0.040	0.030
1 000	2	20	0.57	0.52	0.05	0.001	<0.001
10 000	2	20	0.61	0.59	0.02	<0.001	<0.001
100	20	20	0.00	0.00	0.00	0.007	0.009
1 000	20	20	−0.05	−0.09	0.04	<0.001	<0.001
10 000	20	20	0.05	0.05	0.00	<0.001	<0.001
1 000	0.02	100	0.74	−0.10	0.84	0.328	0.077
10 000	0.02	100	1.44	0.04	1.39	0.408	0.060
100	0.2	100	0.30	0.06	0.24	0.479	0.292
1 000	0.2	100	1.67	1.05	0.62	0.380	0.117
10 000	0.2	100	4.85	2.91	1.94	0.241	0.047
100	2	100	0.03	0.00	0.03	0.379	0.371
1 000	2	100	1.14	0.75	0.39	0.024	0.011
10 000	2	100	3.34	3.21	0.14	<0.001	<0.001
100	20	100	0.00	0.00	0.00	0.375	0.348
1 000	20	100	0.03	0.03	0.00	<0.001	<0.001
10 000	20	100	0.21	0.21	0.00	<0.001	<0.001
100	0.02	500	0.27	−0.02	0.29	0.602	0.304
1 000	0.02	500	1.16	0.54	0.62	0.748	0.195
10 000	0.02	500	3.57	1.74	1.83	0.742	0.139
100	0.2	500	0.06	0.00	0.06	0.741	0.617
1 000	0.2	500	2.27	1.83	0.44	0.587	0.256
10 000	0.2	500	11.00	9.41	1.59	0.382	0.101
100	2	500	0.00	0.00	0.00	0.804	0.757
1 000	2	500	0.50	0.40	0.10	0.244	0.200
10 000	2	500	9.03	8.25	0.78	0.008	0.002
100	20	500	0.00	0.00	0.00	0.854	0.832
1 000	20	500	0.00	0.00	0.00	0.146	0.148
10 000	20	500	−13.96	−13.87	−0.08	<0.001	<0.001

The NB and PLN were both fitted to all simulated data sets, using the reparameterized (Repar) and three-parameter (Threepar) likelihoods. Then, model selection performance was assessed by calculating, for each simulated dataset, the ratio of the NB maximum log-likelihood to that of the PLN (so that larger positive values indicate stronger support for the true NB distribution over the PLN). This was done once using the reparameterized likelihood and once using the three-parameter likelihood. Performance difference is the difference between the log-likelihood ratios obtained using these two approaches (positive values indicate that model selection based on the reparameterized likelihood supports the true model more strongly than model selection based on the three-parameter likelihood). The final two columns report the estimated fraction of unobserved species, calculated from the maximum likelihood estimates for the three-parameter NB and PLN likelihoods. All medians are calculated across the simulated data sets for a given combination of true parameter values.

Table 2:

Open in new tab

model selection results for data simulated with the hypergeometric NB bootstrap

True parameters			median(ℒ_NB / ℒ_PLN)			$median [p (0)^]$
N	k	S	Repar	Threepar	Performance difference	NB	PLN
100	0.2	20	0.29	−0.06	0.35	0.160	0.073
1 000	0.2	20	0.91	0.13	0.78	0.134	0.033
10 000	0.2	20	1.84	0.46	1.39	0.090	0.017
100	2	20	0.12	0.00	0.12	0.040	0.030
1 000	2	20	0.57	0.52	0.05	0.001	<0.001
10 000	2	20	0.61	0.59	0.02	<0.001	<0.001
100	20	20	0.00	0.00	0.00	0.007	0.009
1 000	20	20	−0.05	−0.09	0.04	<0.001	<0.001
10 000	20	20	0.05	0.05	0.00	<0.001	<0.001
1 000	0.02	100	0.74	−0.10	0.84	0.328	0.077
10 000	0.02	100	1.44	0.04	1.39	0.408	0.060
100	0.2	100	0.30	0.06	0.24	0.479	0.292
1 000	0.2	100	1.67	1.05	0.62	0.380	0.117
10 000	0.2	100	4.85	2.91	1.94	0.241	0.047
100	2	100	0.03	0.00	0.03	0.379	0.371
1 000	2	100	1.14	0.75	0.39	0.024	0.011
10 000	2	100	3.34	3.21	0.14	<0.001	<0.001
100	20	100	0.00	0.00	0.00	0.375	0.348
1 000	20	100	0.03	0.03	0.00	<0.001	<0.001
10 000	20	100	0.21	0.21	0.00	<0.001	<0.001
100	0.02	500	0.27	−0.02	0.29	0.602	0.304
1 000	0.02	500	1.16	0.54	0.62	0.748	0.195
10 000	0.02	500	3.57	1.74	1.83	0.742	0.139
100	0.2	500	0.06	0.00	0.06	0.741	0.617
1 000	0.2	500	2.27	1.83	0.44	0.587	0.256
10 000	0.2	500	11.00	9.41	1.59	0.382	0.101
100	2	500	0.00	0.00	0.00	0.804	0.757
1 000	2	500	0.50	0.40	0.10	0.244	0.200
10 000	2	500	9.03	8.25	0.78	0.008	0.002
100	20	500	0.00	0.00	0.00	0.854	0.832
1 000	20	500	0.00	0.00	0.00	0.146	0.148
10 000	20	500	−13.96	−13.87	−0.08	<0.001	<0.001

True parameters			median(ℒ_NB / ℒ_PLN)			$median [p (0)^]$
N	k	S	Repar	Threepar	Performance difference	NB	PLN
100	0.2	20	0.29	−0.06	0.35	0.160	0.073
1 000	0.2	20	0.91	0.13	0.78	0.134	0.033
10 000	0.2	20	1.84	0.46	1.39	0.090	0.017
100	2	20	0.12	0.00	0.12	0.040	0.030
1 000	2	20	0.57	0.52	0.05	0.001	<0.001
10 000	2	20	0.61	0.59	0.02	<0.001	<0.001
100	20	20	0.00	0.00	0.00	0.007	0.009
1 000	20	20	−0.05	−0.09	0.04	<0.001	<0.001
10 000	20	20	0.05	0.05	0.00	<0.001	<0.001
1 000	0.02	100	0.74	−0.10	0.84	0.328	0.077
10 000	0.02	100	1.44	0.04	1.39	0.408	0.060
100	0.2	100	0.30	0.06	0.24	0.479	0.292
1 000	0.2	100	1.67	1.05	0.62	0.380	0.117
10 000	0.2	100	4.85	2.91	1.94	0.241	0.047
100	2	100	0.03	0.00	0.03	0.379	0.371
1 000	2	100	1.14	0.75	0.39	0.024	0.011
10 000	2	100	3.34	3.21	0.14	<0.001	<0.001
100	20	100	0.00	0.00	0.00	0.375	0.348
1 000	20	100	0.03	0.03	0.00	<0.001	<0.001
10 000	20	100	0.21	0.21	0.00	<0.001	<0.001
100	0.02	500	0.27	−0.02	0.29	0.602	0.304
1 000	0.02	500	1.16	0.54	0.62	0.748	0.195
10 000	0.02	500	3.57	1.74	1.83	0.742	0.139
100	0.2	500	0.06	0.00	0.06	0.741	0.617
1 000	0.2	500	2.27	1.83	0.44	0.587	0.256
10 000	0.2	500	11.00	9.41	1.59	0.382	0.101
100	2	500	0.00	0.00	0.00	0.804	0.757
1 000	2	500	0.50	0.40	0.10	0.244	0.200
10 000	2	500	9.03	8.25	0.78	0.008	0.002
100	20	500	0.00	0.00	0.00	0.854	0.832
1 000	20	500	0.00	0.00	0.00	0.146	0.148
10 000	20	500	−13.96	−13.87	−0.08	<0.001	<0.001

The NB and PLN were both fitted to all simulated data sets, using the reparameterized (Repar) and three-parameter (Threepar) likelihoods. Then, model selection performance was assessed by calculating, for each simulated dataset, the ratio of the NB maximum log-likelihood to that of the PLN (so that larger positive values indicate stronger support for the true NB distribution over the PLN). This was done once using the reparameterized likelihood and once using the three-parameter likelihood. Performance difference is the difference between the log-likelihood ratios obtained using these two approaches (positive values indicate that model selection based on the reparameterized likelihood supports the true model more strongly than model selection based on the three-parameter likelihood). The final two columns report the estimated fraction of unobserved species, calculated from the maximum likelihood estimates for the three-parameter NB and PLN likelihoods. All medians are calculated across the simulated data sets for a given combination of true parameter values.

Although biases in parameter estimates tended to be less than the IQR, especially for the classical likelihoods (Fig. 2), there were some parameter combinations where both biases and sampling variances were large, and thus, median error in parameter estimates was high. This was particularly the case for the NB distribution when sample size was small, species richness was large and the shape of the observed abundance distribution was relatively insensitive to changes in k (i.e. k << 1, where distributions approach the log series, and k >> 1, where distributions approach the Poisson. If the goal is simply to fit the observed species-abundance distribution, this high median error is relatively unimportant because the shape of the distribution is relatively insensitive to changes in parameter values in this range. However, because it is also possible to estimate species richness or infer demographic parameters from species abundance (e.g. k is the immigration rate in the neutral model of Volkov et al. 2007), some exploration of alternative approaches is needed in future work. We suspect that the most productive avenues will include the integration of multiple sources of information. For instance, if there are multiple local species-abundance samples from the same community, it is possible that better parameter estimates can be obtained by fitting multivariate species abundances (e.g. Etienne 2009). Alternatively, if independent sources of information about NB parameters are available (e.g. estimates of S from presence-absence data or species range maps), then this information may be incorporated within a Bayesian framework.

We conclude with several recommendations for species abundance analysis using the PLN and NB. Firstly, when the goal is simply to fit a distribution and draw inferences from the parameters, the classical formulation of the PLN likelihood appears to perform best overall, even though S is not estimated directly. Secondly, conditioning the likelihood on the total number of individuals in the sample appears to have a relatively small effect on the quality of parameter estimation. There is certainly no evidence of consistently greater median error in parameter estimates when this conditioning is not used; indeed, error in σ is larger for the constrained likelihood. Thus, the choice of whether or not to condition on the total number of sampled individuals should probably be determined by the purpose of analysis. For example, if a goal of analysis is model selection, and the alternative model is conditional upon a particular number of individuals (as in most neutral models), then conditioning the PLN or NB in similar fashion will ensure that likelihoods are comparable (Etienne and Olff 2005). Finally, the reparameterized (approximate) likelihood works as well as, or (in the case of the NB, for estimates of k) better than, the three-parameter likelihood. Thus, when direct estimates of S are required, and model selection is intended, the reparameterized formulation may be preferable.

SUPPLEMENTARY MATERIAL

Supplementary Figures S1–Supplementary Data and Supplementary Tables S1–Supplementary Data are available at Journal of Plant Ecology online.

FUNDING

Australian Research Council (grants CE0561435 and DP0880544) and James Cook University.

We thank F. He and Sun Yat Sen University for supporting SRC’s participation in the International Symposium in Biodiversity and Theoretical Ecology from which this contribution arose. Mizue Hisano provided assistance with statistical programming and figure preparation. We also thank R. Etienne for suggestions and R. Izsak for assistance with a technical point, at early stages of the work.

Conflict of interest statement. None declared.

APPENDIX

Consider an idealized case in which two models predict exactly the same observed abundance distribution (i.e. the two models have identical zero-truncated likelihoods), but they differ in the number of unobserved species they predict. For each model, the ratio of the non-truncated to the zero-truncated log-likelihood will be

(A.1)

The derivative of equation (A.1) with respect to s₀ is always negative, and thus, the model predicting the larger number of unobserved species (i.e. the larger p(0)) would have the smaller three-parameter likelihood. The derivative of (A.1) is messy, but it is possible to confirm by brute force evaluation that it is negative for moderately small (e.g. <100) values S_obs and s₀. For large values of S_obs and s₀, we can apply Stirling’s approximation:

to obtain a clearer analytical result:

and then differentiating:

References

Bulmer

MG

,

Fitting the Poisson lognormal distribution to species-abundance data

,

Biometrics

,

1974

, vol.

30

(pg.

101

-

10

)

Google Scholar

Crossref

WorldCat

Cabrera

J

Watson

GS

,

Simulation methods for mean and median bias reduction in parametric estimation

,

J Stat Plan Inference

,

1997

, vol.

57

(pg.

143

-

52

)

Google Scholar

Crossref

WorldCat

Connolly

SR

Dornelas

M

Magurran

AE

McGill

BJ

,

Fitting and empirical evaluation of species abundance models

,

Biological Diversity: Frontiers in Measurement and Assessment

,

2011

Oxford

Oxford University Press

(pg.

123

-

40

)

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Connolly

SR

Dornelas

M

Bellwood

DR

et al. ,

Testing species abundance models: a new bootstrap approach applied to Indo-Pacific coral reefs

,

Ecology

,

2009

, vol.

90

(pg.

3138

-

49

)

Connolly

SR

Hughes

TP

Bellwood

DR

et al. ,

Community structure of corals and reef fishes at multiple scales

,

Science

,

2005

, vol.

309

(pg.

1363

-

5

)

Diserud

OH

Engen

S

,

A general and dynamic species abundance model, embracing the lognormal and the gamma models

,

Am Nat

,

2000

, vol.

155

(pg.

497

-

511

)

Dornelas

M

Connolly

SR

,

Multiple modes in a coral species abundance distribution

,

Ecol Lett

,

2008

, vol.

11

pg.

1008

pg.

1016

Engen

S

,

Heterogeneous communities with lognormal species abundance distribution: species-area curves and sustainability

,

J Theor Biol

,

2007

, vol.

249

(pg.

791

-

803

)

Engen

S

Lande

R

,

Population dynamic models generating species abundance distributions of the gamma type

,

J Theor Biol

,

1996

, vol.

178

(pg.

325

-

31

)

Google Scholar

Crossref

WorldCat

Engen

S

Lande

R

,

Population dynamic models generating the lognormal species abundance distribution

,

Math Biosci

,

1996

, vol.

132

(pg.

169

-

83

)

Engen

S

Lande

R

Walla

T

et al. ,

Analyzing spatial structure of communities using the two- dimensional Poisson lognormal species abundance model

,

Am Nat

,

2002

, vol.

160

(pg.

60

-

73

)

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Engen

S

Saether

BE

Sverdrup-Thygeson

A

et al. ,

Assessment of species diversity from species abundance distributions at different localities

,

Oikos

,

2008

, vol.

117

(pg.

738

-

48

)

Google Scholar

Crossref

WorldCat

Etienne

RS

,

Improved estimation of neutral model parameters for multiple samples with different degrees of dispersal limitation

,

Ecology

,

2009

, vol.

90

(pg.

847

-

52

)

Etienne

RS

Alonso

D

McKane

AJ

,

The zero-sum assumption in neutral biodiversity theory

,

J Theor Biol

,

2007

, vol.

248

(pg.

522

-

36

)

Etienne

RS

Olff

H

,

Confronting different models of community structure to species-abundance data: a Bayesian model comparison

,

Ecol Lett

,

2005

, vol.

8

(pg.

493

-

504

)

Fisher

RA

Corbet

AS

Williams

CB

,

The relation between the number of species and the number of individuals in a random sample of an animal population

,

J Anim Ecol

,

1943

, vol.

12

(pg.

42

-

58

)

Google Scholar

Crossref

WorldCat

Gaston

KJ

Blackburn

TM

,

Pattern and Process in Macroecology

,

2000

Blackwell

Oxford

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Gray

JS

,

Detecting pollution induced changes in communities using the log-normal distribution of individuals among species

,

Mar Pollut Bull

,

1981

, vol.

12

(pg.

173

-

6

)

Google Scholar

Crossref

WorldCat

Hubbell

SP

,

The Unified Neutral Theory of Biodiversity and Biogeography

,

2001

Princeton, NJ

Princeton University Press

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Izsak

R

,

Maximum likelihood fitting of the Poisson lognormal distribution

,

Environ Ecol Stat

,

2008

, vol.

15

(pg.

143

-

56

)

Google Scholar

Crossref

WorldCat

Lande

R

Engen

S

Saether

BE

,

Stochastic Population Dynamics in Ecology and Conservation

,

2003

New York

Oxford University Press

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Magurran

AE

Henderson

PA

Magurran

AE

McGill

BJ

,

Commonness and rarity

,

Biological Diversity: Frontiers in Measurement and Assessment

,

2011

Oxford

Oxford University Press

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

May

RM

Cody

ML

Diamond

JM

,

Patterns of species abundance and diversity

,

Ecology and Evolution of Communities

,

1975

Cambridge

Belknap Press of Harvard University Press

(pg.

81

-

120

)

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

McGill

BJ

Etienne

RS

Gray

JS

et al. ,

Species abundance distributions: moving beyond single-prediction theories to integration within an ecological framework

,

Ecol Lett

,

2007

, vol.

10

(pg.

995

-

1015

)

Pielou

EC

,

Ecological Diversity

,

1975

New York

Wiley

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Preston

FW

,

The commonness, and rarity, of species

,

Ecology

,

1948

, vol.

29

(pg.

254

-

83

)

Google Scholar

Crossref

WorldCat

Pueyo

S

,

Diversity: between neutrality and structure

,

Oikos

,

2006

, vol.

112

(pg.

392

-

405

)

Google Scholar

Crossref

WorldCat

Shaban

SA

Crow

EL

Shimizu

K

,

Poisson-lognormal distributions

,

Lognormal Distributions: Theory and Applications

,

1988

Boca Raton, FL

CRC Press

(pg.

195

-

210

)

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

R Core Development Team

,

R: A Language and Environment for Statistical Computing

,

2011

Vienna, Austria

R Foundation for Statistical Computing

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Volkov

I

Banavar

JR

Hubbell

SP

et al. ,

Neutral theory and relative species abundance in ecology

,

Nature

,

2003

, vol.

424

(pg.

1035

-

7

)

Volkov

I

Banavar

JR

Hubbell

SP

et al. ,

Patterns of relative species abundance in rainforests and coral reefs

,

Nature

,

2007

, vol.

450

(pg.

45

-

9

)

Yee

TW

,

The VGAM package for categorical data analysis

,

J Stat Software

,

2010

, vol.

32

(pg.

1

-

34

)

Google Scholar

Crossref

WorldCat

© The Author 2012. Published by Oxford University Press on behalf of the Institute of Botany, Chinese Academy of Sciences and the Botanical Society of China. All rights reserved. For permissions, please email: journals.permissions@oup.com

Download all slides

Month:	Total Views:
December 2016	1
January 2017	2
February 2017	9
March 2017	8
April 2017	1
May 2017	2
June 2017	2
August 2017	3
September 2017	8
October 2017	12
November 2017	5
December 2017	15
January 2018	12
February 2018	14
March 2018	27
April 2018	27
May 2018	23
June 2018	17
July 2018	12
August 2018	14
September 2018	11
October 2018	13
November 2018	13
December 2018	19
January 2019	22
February 2019	19
March 2019	12
April 2019	37
May 2019	29
June 2019	14
July 2019	15
August 2019	13
September 2019	17
October 2019	24
November 2019	11
December 2019	28
January 2020	19
February 2020	15
March 2020	15
April 2020	15
May 2020	12
June 2020	15
July 2020	28
August 2020	8
September 2020	23
October 2020	11
November 2020	10
December 2020	13
January 2021	5
February 2021	5
March 2021	9
April 2021	17
May 2021	3
June 2021	9
July 2021	10
August 2021	13
September 2021	25
October 2021	19
November 2021	26
December 2021	9
January 2022	20
February 2022	14
March 2022	17
April 2022	23
May 2022	14
June 2022	8
July 2022	16
August 2022	16
September 2022	15
October 2022	22
November 2022	7
December 2022	10
January 2023	13
February 2023	7
March 2023	5
April 2023	8
May 2023	23
June 2023	13
July 2023	24
August 2023	13
September 2023	12
October 2023	29
November 2023	7
December 2023	19
January 2024	16
February 2024	15
March 2024	12
April 2024	23
May 2024	9
June 2024	10
July 2024	22
August 2024	5
September 2024	15
October 2024	14
November 2024	12
December 2024	13
January 2025	4
February 2025	4
March 2025	19
April 2025	22

Article Contents

A comparative analysis of alternative approaches to fitting species-abundance models

Abstract

INTRODUCTION

APPROXIMATE TWO-PARAMETER LIKELIHOOD WITH SPECIES RICHNESS

METHODS

RESULTS

Summary statistics

Comparison of methods: bias and variance

Comparison of methods: absolute median error

Determinants of median error

DISCUSSION

SUPPLEMENTARY MATERIAL

FUNDING

APPENDIX

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

A comparative analysis of alternative approaches to fitting species-abundance models

Abstract

INTRODUCTION

APPROXIMATE TWO-PARAMETER LIKELIHOOD WITH SPECIES RICHNESS

METHODS

RESULTS

Summary statistics

Comparison of methods: bias and variance

Comparison of methods: absolute median error

Determinants of median error

DISCUSSION

SUPPLEMENTARY MATERIAL

FUNDING

APPENDIX

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only