MLNe: Simulating and estimating effective size and migration rate from temporal changes in allele frequencies

Fig. 2.

Profile log-likelihood curve generated by MLNe for a simulated dataset. It shows the maximum-likelihood estimate of N_e is 121, and the 95% confidence interval is {65, 241}.

Although GUI has many advantages, it also has some disadvantages. For example, GUI is inconvenient for making analysis of multiple datasets in a batch mode. In a simulation study, however, usually many replicate datasets need to be analyzed and it is desirable to call MLNe directly from a program for conducting the analyses. In the new version of MLNe, the code for the computational kernel is in Fortran 2003 and has been compiled for platforms Linux, Mac, and Windows 10. It can be run in the x-terminal of Linux and Mac, and in MS-DOS of Windows. When ran in non-GUI mode, MLNe reads a genotype data file and a corresponding parameter file, runs the analysis, and outputs the results in a few files.

A hurdle in applying the previous version (1.0) of MLNe is that it requires the counts of each allele at each locus in each temporal sample as the data. This means the raw genotype data must be pre-processed in two steps. First, the unique alleles observed for each locus across temporal samples must be identified from the genotype data. Second, the copies of each unique allele at each locus of each temporal sample must be counted. Both processes need some coding except for an extremely small dataset. The new version 2.0 completes the data pre-processing automatically. It accepts the raw genotype data in three optional formats (i.e. a genotype is encoded by two integer numbers, by integers 0, 1, or 2 indicating the number of reference alleles for diallelic markers, and by the GenePop format) and an input file for analysis parameters.

Simulations

MLNe has a built-in simulation module that can be used to simulate genotype data with user-defined parameters, such as the isolation or migration models, true value of N_e, number of individuals sampled at each time point, the sampling interval for temporal samples, and number and polymorphisms of marker loci. On obtaining values of these parameters through the GUI (Fig. 3), MLNe initiates an individual-based forward simulation in the Wright–Fisher model to generate genotype data and outputs them to a file. It also generates a corresponding input file of parameters for analyzing the genotype data. The two files are then used by MLNe to get estimates of N_e (and m for the migration model).

Fig. 3.

New simulation project wizard.

In addition to investigating factors affecting the power and accuracy of the temporal approaches, simulations are also valuable to optimizing the experimental design. It is useful, for example, to determine the suitable sample intensities (number of markers, number and the temporal interval of samples, and sample sizes) to yield accurate N_e estimates. Before initiating a project, one can use simulations to generate data in conditions similar to those of the conceived project, and to analyze the simulated data to get a feel of the estimation power and accuracy. For this same reason, simulations are also valuable for training and educational purposes.

The simulation module is capable of simulating genotype data for hundreds of thousands of individuals at hundreds of thousands of loci (see an example below). However, it is worth noting that the simulation assumes free recombination among loci, which is apparently violated when many genomic markers are simulated for any species. In the presence of linkage, temporal methods could yield estimated 95% confidence intervals that are too conservative (i.e. too narrow), although they are expected to yield good point estimates of N_e regardless of the linkage among markers.

Flexible models and multiple methods

Methods under two population genetics models, isolation and migration, are implemented in MLNe. The isolation model is the one assumed in nearly all temporal methods since the seminal work of Krimbas and Tsakas (1971). In this model, a population is assumed to be isolated without immigration from other populations during the period between the first and last sample taken from it. Therefore, the changes in allele frequency at a neutral marker locus in the relatively short sampling period (thus mutations are negligible) would come solely from genetic drift and reflect the average N_e of the population during the period. Using both a moment estimator (Nei and Tajima 1981) and a likelihood estimator (Wang 2001), MLNe analyzes the data and yields N_e estimates with 95% confidence interval estimates as demonstrated in Fig. 2.

The migration model removes the restrictive assumption of a single isolated population and considers a metapopulation consisting of a small (focal) population and an infinitely large source population providing immigrants into the focal population (Wang and Whitlock 2003). The methods are robust to violations of the assumption and can be applied approximately to a finite source population composed of one or more small subpopulations (Wang and Whitlock 2003). MLNe implements a moment estimator and a likelihood estimator developed by Wang and Whitlock (2003) to estimate the N_e of and the immigration rate (m) into the focal population jointly.

For both models, MLNe allows for and uses any number of temporal samples in the estimation. For more than two samples, the average N_e and m over the entire sampling period are estimated directly by the likelihood method, while N_e and m for each sampling period are estimated by the moment estimator and their harmonic and arithmetic means over multiple sampling periods are reported, respectively.

Parallel computation

When genomic data of many markers are used to estimate N_e of a large (say, N_e in tens of thousands) population, the likelihood estimator becomes computationally demanding and may take a long time to complete an analysis. This is especially so for a metapopulation in the migration model, where both N_e and m are inferred jointly. To speed up the analysis, MLNe uses both message passing interface (MPI) and openMP to make parallel runs of the data with multiple processes and multiple threads per process. Both numbers of MPI processes and openMP threads per process are determined by a user according to the data size and computer capacity. While MPI processes use multiple nodes of a computer with distributed memory or multiple cores of a computer with shared memory, openMP threads use cores of a single node with shared memory. Roughly, the computational efficiency depends on the total number of parallel threads, which is the product of the number of MPI processes and the number of openMP threads per MPI process.

To demonstrate the computational speedup of applying MPI and openMP and the capacity of MLNe to handle large genomic data sampled from a large population, I simulated data from an isolated large population of N_e = 60,000 using the simulation module. Two small samples separated by 10 generations, each containing only 50 individuals, were taken from the population and each sampled individual was genotyped at 100,000 single nucleotide polymorphism (SNP) loci. The data were analyzed by MLNe with a maximal N_e set at 100,000, using 2, 3, 4, 5, 6 nodes of a Linux cluster. Each node of the cluster has two 20-core Intel Xeon Gold 6,248 2.5 GHz processors with 192 gigabytes of 2,933 MHz DDR4 RAM, and each physical core has two logical cores by hyperthreading. Therefore, each node has 2 × 20 × 2 = 80 logical cores, which are used as openMP threads in MLNe. The total number of parallel threads used in analyzing the data is thus 80n, where n (=2, 3, 4, 5, 6) is the number of nodes or the number of MPI processes. The time taken for analyzing the data using a different number of nodes (threads) is compared in Fig. 4.

Fig. 4.

Running time (minutes) as a function of the number of parallel threads (x axis) for an example dataset.

The example in Fig. 4 shows that (1) MLNe has the capacity to handle genomic data in estimating N_e of an extremely large population and (2) running time decreases with an increase in the total number of parallel threads used in an analysis. The speedup by parallelization does not increase linearly with an increasing number of threads, perhaps due to the communication cost among threads (openMP) and processes (MPI).

Conclusion

MLNe is a powerful software implementing multiple population genetics models (migration and isolation) and multiple statistical methods of each model for estimating N_e (and m for the emigration model) from any number of temporally spaced samples of individuals, with each individual genotyped at either a few microsatellite loci or many thousands of SNPs. It can be run on multiple computer platforms with or without a GUI, and has a built-in simulation module for generating simulated temporal genotype data. It uses both MPI and openMP for parallel computation to use multiple computer nodes of distributed memory and multiple cores within a node of shared memory. As a result, it can handle genomic data for estimating the N_e and migration rate of very large populations. It could hopefully become a valuable tool for conservation genetics research and teaching.

References

Anderson

EC

,

Williamson

EG

,

Thompson

EA.

Monte Carlo evaluation of the likelihood for Ne from temporally spaced samples

.

Genetics.

2000

;

156

(4)

:

2109

–

2118

.

Beaumont

MA.

Estimation of population growth or decline in genetically monitored populations

.

Genetics.

2003

;

164

(3)

:

1139

–

1160

.

Berthier

P

,

Beaumont

MA

,

Cornuet

JM

,

Luikart

G.

Likelihood-based estimation of the effective population size using temporal changes in allele frequencies: a genealogical approach

.

Genetics.

2002

;

160

(2)

:

741

–

751

.

Crow

JF

,

Kimura

M.

An introduction to population genetics theory

.

New York (NY)

:

Harper and Row

;

1970

.

Google Preview

Do

C

,

Waples

RS

,

Peel

D

,

Macbeth

GM

,

Tillett

BJ

,

Ovenden

JR.

NeEstimator v2: re-implementation of software for the estimation of contemporary effective population size (Ne) from genetic data

.

Mol Ecol Resour.

2014

;

14

(1)

:

209

–

214

.

Fisher

RA.

The genetical theory of natural selection

.

Oxford (UK)

:

Oxford University Press

;

1930

.

Hui

TYJ

,

Burt

A.

Estimating effective population size from temporally spaced samples with a novel, efficient maximum-likelihood algorithm

.

Genetics.

2015

;

200

(1)

:

285

–

293

.

Krimbas

CB

,

Tsakas

S.

The genetics of Dacus oleae V. Changes of esterase polymorphism in a natural population following insecticide control: selection or drift?

Evolution.

1971

;

25

(3)

:

454

–

460

.

Laval

G

,

SanCristobal

M

,

Chevalet

C.

Maximum-likelihood and Markov chain Monte Carlo approaches to estimate inbreeding and effective size from allele frequency changes

.

Genetics.

2003

;

164

(3)

:

1189

–

1204

.

Luikart

G

,

Ryman

N

,

Tallmon

DA

,

Schwartz

MK

,

Allendorf

FW.

Estimation of census and effective population sizes: the increasing usefulness of DNA-based approaches

.

Conser Genet

.

2010

;

11

(2)

:

355

–

373

.

Nei

M

,

Tajima

F.

Genetic drift and estimation of effective population-size

.

Genetics.

1981

;

98

(3)

:

625

–

640

.

Pollak

E.

A new method for estimating the effective population size from allele frequency changes

.

Genetics.

1983

;

104

(3)

:

531

–

548

.

Schwartz

MK

,

Luikart

G

,

Waples

RS.

Genetic monitoring as a promising tool for conservation and management

.

Trends Ecol Evol.

2007

;

22

(1)

:

25

–

33

.

Wang

J.

A pseudo-likelihood method for estimating effective population size from temporally spaced samples

.

Genet Res.

2001

;

78

(3)

:

243

–

257

.

Wang

J.

Estimation of effective population sizes from data on genetic markers

.

Philos Trans R Soc Lond B Biol Sci.

2005

;

360

(1459)

:

1395

–

1409

.

Wang

J

,

Santiago

E

,

Caballero

A.

Prediction and estimation of effective population size

.

Heredity.

2016

;

117

(4)

:

193

–

206

.

Wang

J

,

Whitlock

MC.

Estimating effective population size and migration rates from genetic samples over space and time

.

Genetics.

2003

;

163

(1)

:

429

–

446

.

Waples

RS.

A generalised approach for estimating effective population size from temporal changes in allele frequency

.

Genetics.

1989

;

121

(2)

:

379

–

391

.

Williamson

EG

,

Slatkin

M.

Using maximum likelihood to estimate population size from temporal changes in allele frequencies

.

Genetics.

1999

;

152

(2)

:

755

–

761

.

Wright

S.

Evolution in Mendelian populations

.

Genetics.

1931

;

16

(2)

:

97

–

159

.