Discontinuous Hamiltonian Monte Carlo for discrete parameters and discontinuous likelihoods Free

Performance summary of each algorithm on the Jolly–Seber model example. The term |$(\pm \ldots)$| indicates the error estimate, twice the standard deviations, of our effective sample size estimators. The path length is averaged over each iteration

	ESS per 100 samples	ESS per minute	Path length	Iteration time
DHMC (diagonal)	45.5 (⁠\|$\pm$\| 5.2)	424	45	87.7
DHMC (identity)	24.1 (⁠\|$\pm$\| 2.6)	126	77.5	157
no-U-turn / Gibbs	1.04 (⁠\|$\pm$\| 0.087)	6.38	150	133
Metropolis	0.0714 (⁠\|$\pm$\| 0.016)	58.5	1	1

	ESS per 100 samples	ESS per minute	Path length	Iteration time
DHMC (diagonal)	45.5 (⁠\|$\pm$\| 5.2)	424	45	87.7
DHMC (identity)	24.1 (⁠\|$\pm$\| 2.6)	126	77.5	157
no-U-turn / Gibbs	1.04 (⁠\|$\pm$\| 0.087)	6.38	150	133
Metropolis	0.0714 (⁠\|$\pm$\| 0.016)	58.5	1	1

DHMC: discontinuous Hamiltonian Monte Carlo. ESS: effective sample size. Iteration time: the computational time per iteration of each algorithm relative to the fastest one.

Table 1.

Open in new tab Download slide

Performance summary of each algorithm on the Jolly–Seber model example. The term |$(\pm \ldots)$| indicates the error estimate, twice the standard deviations, of our effective sample size estimators. The path length is averaged over each iteration

	ESS per 100 samples	ESS per minute	Path length	Iteration time
DHMC (diagonal)	45.5 (⁠\|$\pm$\| 5.2)	424	45	87.7
DHMC (identity)	24.1 (⁠\|$\pm$\| 2.6)	126	77.5	157
no-U-turn / Gibbs	1.04 (⁠\|$\pm$\| 0.087)	6.38	150	133
Metropolis	0.0714 (⁠\|$\pm$\| 0.016)	58.5	1	1

	ESS per 100 samples	ESS per minute	Path length	Iteration time
DHMC (diagonal)	45.5 (⁠\|$\pm$\| 5.2)	424	45	87.7
DHMC (identity)	24.1 (⁠\|$\pm$\| 2.6)	126	77.5	157
no-U-turn / Gibbs	1.04 (⁠\|$\pm$\| 0.087)	6.38	150	133
Metropolis	0.0714 (⁠\|$\pm$\| 0.016)	58.5	1	1

DHMC: discontinuous Hamiltonian Monte Carlo. ESS: effective sample size. Iteration time: the computational time per iteration of each algorithm relative to the fastest one.

Fig. 3.

(a) The posterior marginal of |$(p_1, U_1)$| with parameter transformations, estimated from the Monte Carlo samples. (b) The posterior conditional density of the intercept parameter in the generalized Bayes example. The other parameters are fixed at the posterior draw that recorded the highest posterior density among the Monte Carlo samples. The density is not continuous since the loss function is not.

5.3. Generalized Bayesian belief update based on loss functions

Motivated by model misspecification and difficulty in modelling all aspects of a data-generating process, Bissiri et al. (2016) proposed a generalized Bayesian framework which replaces the loglikelihood with a surrogate based on a utility function. Given an additive loss |$\ell(y, \theta)$| for the data |$y$| and parameter of interest |$\theta$|⁠, the prior |$\pi(\theta)$| is updated to obtain the generalized posterior:

$$\begin{equation} \label{eq:pac_posterior} \pi_{\textrm{post}}(\theta) \propto \exp\!\left\{- \ell(y, \theta)\right\} \pi(\theta). \end{equation}$$

(10)

While (10) coincides with a pseudolikelihood-type approach, Bissiri et al. (2016) derived the formula as a coherent and optimal update from a decision-theoretic perspective.

Here, we consider a binary classification problem with an error-rate loss:

$$\begin{equation} \label{eq:error_rate_loss} \ell(y, \beta) = \textstyle{\sum}_{i = 1} {\rm 1}\kern-0.24em{\rm I} \left\{ y_i x_i^{{ \mathrm{\scriptscriptstyle T} }} \beta < 0 \right\}, \end{equation}$$

(11)

where |$y_i \in \{-1, 1\}$|⁠, |$x_i$| is a vector of predictors and |$\beta$| is a regression coefficient. The target distribution of the form (10) based on the loss function (11) was suggested as a challenging test case by Chopin & Ridgway (2017). We use the SECOM data from the UCI machine learning repository, which records various sensor data that can be used to predict the production quality of a semiconductor, measured as pass or fail. We first remove the predictors with more than 20 missing cases and then remove the observations that still had missing predictors, leaving 1477 cases with 376 predictors. All the predictors are normalized and the regression coefficients |$\beta_i$| are given |${N}(0, 1)$| priors. Figure 3(b) illustrates the complexity of the target distribution.

In tuning the proposal covariance of Metropolis for this example, adaptive Metropolis performed so poorly that we instead use |$10^5$| iterations of discontinuous Hamiltonian Monte Carlo to estimate the posterior covariance. Scaling the proposal covariance for random-walk Metropolis according to Roberts et al. (1997) resulted in an acceptance probability of less than 0.04, so we scaled the proposal covariance to achieve the acceptance probability of 0.234 with stochastic optimization (Andrieu & Thoms, 2008). We also found the posterior correlation to be very modest in this example, with the ratio of the largest to smallest eigenvalues of the estimated posterior covariance matrix being |$46 \approx 6.8^2$|⁠. This suggested that coordinatewise updates may be competitive, so we implemented Metropolis-within-Gibbs as an additional benchmark. The parameters are updated one at a time with the acceptance rate calibrated around 0.44, as recommended by Gelman et al. (1996). We ran discontinuous Hamiltonian Monte Carlo for |$10^4$| iterations, Metropolis for |$10^7$| iterations, and Metropolis-within-Gibbs for |$5 \times 10^4$| iterations from stationarity.

Table 2 summarizes the performance of each algorithm. Discontinuous Hamiltonian Monte Carlo with identity mass matrix outperforms Metropolis and Metropolis-within-Gibbs by a factor of 330 and 2, respectively. Using a diagonal mass matrix yields only a minor improvement here as the posterior displays similar scales of uncertainty in all the parameters. The mixing of Metropolis suffers substantially from the dimensionality of the target. Conditional updates of Metropolis-within-Gibbs mix well in this example due to weak dependence among the parameters. On the other hand, as demonstrated in the example here and in § 5.2, discontinuous Hamiltonian Monte Carlo not only scales well in the number of parameters, but also efficiently handles distributions with strong correlations.

Table 2.

Performance summary of each algorithm on the generalized Bayesian posterior example. The term |$(\pm \ldots)$| is the error estimate of our effective sample size estimators. The path length is averaged over each iteration

	ESS per 100 samples	ESS per minute	Path length	Iteration time
DHMC (identity)	26.3 (⁠\|$\pm$\| 3.2)	76	25	972
Metropolis	0.00809 (⁠\|$\pm$\| 0.0018)	0.227	1	1
Metropolis-within-Gibbs	0.514 (⁠\|$\pm$\| 0.039)	39.8	1	36.2

	ESS per 100 samples	ESS per minute	Path length	Iteration time
DHMC (identity)	26.3 (⁠\|$\pm$\| 3.2)	76	25	972
Metropolis	0.00809 (⁠\|$\pm$\| 0.0018)	0.227	1	1
Metropolis-within-Gibbs	0.514 (⁠\|$\pm$\| 0.039)	39.8	1	36.2

DHMC, discontinuous Hamiltonian Monte Carlo; ESS, effective sample size; Iteration time, the computational time per iteration of each algorithm relative to the fastest one.

Table 2.

Performance summary of each algorithm on the generalized Bayesian posterior example. The term |$(\pm \ldots)$| is the error estimate of our effective sample size estimators. The path length is averaged over each iteration

	ESS per 100 samples	ESS per minute	Path length	Iteration time
DHMC (identity)	26.3 (⁠\|$\pm$\| 3.2)	76	25	972
Metropolis	0.00809 (⁠\|$\pm$\| 0.0018)	0.227	1	1
Metropolis-within-Gibbs	0.514 (⁠\|$\pm$\| 0.039)	39.8	1	36.2

	ESS per 100 samples	ESS per minute	Path length	Iteration time
DHMC (identity)	26.3 (⁠\|$\pm$\| 3.2)	76	25	972
Metropolis	0.00809 (⁠\|$\pm$\| 0.0018)	0.227	1	1
Metropolis-within-Gibbs	0.514 (⁠\|$\pm$\| 0.039)	39.8	1	36.2

DHMC, discontinuous Hamiltonian Monte Carlo; ESS, effective sample size; Iteration time, the computational time per iteration of each algorithm relative to the fastest one.

Acknowledgement

Dunson was supported by the National Science Foundation and Office of Naval Research. Lu was supported by the National Science Foundation.

Supplementary material available at Biometrika online includes the proofs and additional theoretical results on properties of discontinuous Hamiltonian dynamics, the algorithm’s connection to the zig-zag sampler, error analysis of the discontinuous dynamics integrator, as well as additional numerical results.

References

Afshar,

H. M.

&

Domke,

J.

(

2015

). Reflection, refraction, and Hamiltonian Monte Carlo. In

Proc. 28th Int. Conf. on Neural Information Processing Systems

, pp.

3007

–

3015

.

Ambrosio,

L.

(

2008

). Transport equation and Cauchy problem for non-smooth vector fields. In

Calculus of Variations and Nonlinear Partial Differential Equations

, Ed.

Dacarogna

B.

and

Marcellini

P.

, pp.

1

–

41

.

New York

:

Springer

.

Andrieu,

C.

&

Thoms,

J.

(

2008

).

A tutorial on adaptive MCMC

.

Statist. Comp.

18

,

343

–

73

.

Berger,

J. O.

,

Bernardo,

J. M.

&

Sun,

D.

(

2012

).

Objective priors for discrete parameter spaces

.

J. Am. Statist. Assoc.

107

,

636

–

48

.

Beskos,

A.

,

Pillai,

N.

,

Roberts,

G.

,

Sanz-Serna,

J.-M.

&

Stuart,

A.

(

2013

).

Optimal tuning of the hybrid Monte Carlo algorithm

.

Bernoulli

19

,

1501

–

34

.

Bissiri,

P. G.

,

Holmes,

C. C.

&

Walker,

S. G.

(

2016

).

A general framework for updating belief distributions

.

J. R. Statist. Soc.

B

78

,

1103

–

30

.

Bou-Rabee,

N.

&

Sanz-Serna,

J. M.

(

2017

).

Randomized Hamiltonian Monte Carlo

.

Ann. Appl. Prob.

27

,

2159

–

94

.

Carpenter,

B.

,

Hoffman,

M. D.

,

Brubaker,

M.

,

Lee,

D.

,

Li,

P.

&

Betancourt,

M.

(

2015

).

The Stan math library: Reverse-mode automatic differentiation in C++

. arXiv: 1509.07164.

Chib,

S.

(

1998

).

Estimation and comparison of multiple change-point models

.

J. Economet.

86

,

221

–

41

.

Chopin,

N.

&

Ridgway,

J.

(

2017

).

Leave Pima indians alone: Binary regression as a benchmark for Bayesian computation

.

Statist. Sci.

32

,

64

–

87

.

Dinh,

V.

,

Bilge,

A.

,

Zhang,

C.

&

Matsen IV,

F. A.

(

2017

). Probabilistic path Hamiltonian Monte Carlo. In

Proc. 34th Int. Conf. on Machine Learning

, vol.

70

, pp.

1009

–

18

.

Duane,

S.

,

Kennedy,

A. D.

,

Pendleton,

B. J.

&

Roweth,

D.

(

1987

).

Hybrid Monte Carlo

.

Phys. Lett. B

195

,

216

–

22

.

Durmus,

A.

,

Moulines,

E.

&

Saksman,

E.

(

2019

).

On the convergence of Hamiltonian Monte Carlo

. arXiv: 1705.00166v2.

Fang,

Y.

,

Sanz-Serna,

J. M.

&

Skeel,

R. D.

(

2014

).

Compressible generalized hybrid Monte Carlo

.

J. Chem. Phys.

140

,

174108

.

Gelman,

A.

,

Carlin,

J. B.

,

Stern,

H. S.

,

Dunson,

D. B.

,

Vehtari,

A.

&

Rubin,

D. B.

(

2013

).

Bayesian Data Analysis

.

Boca Raton

:

CRC Press

.

Gelman,

A.

,

Lee,

D.

&

Guo,

J.

(

2015

).

Stan: A probabilistic programming language for Bayesian inference and optimization

.

J. Educ. Behav. Sci.

40

,

530

–

43

.

Gelman,

A.

,

Roberts,

G. O.

&

Gilks,

W. R.

(

1996

).

Efficient Metropolis jumping rules

.

Bayesian Statist.

5

,

599

–

607

.

Geyer,

C.

(

2011

). Introduction to Markov chain Monte Carlo. In

Handbook of Markov Chain Monte Carlo

, pp.

3

–

48

.

Boca Raton

:

CRC Press

,

Griewank,

A.

&

Walther,

A.

(

2008

).

Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation

.

Philadelphia

:

Society for Industrial and Applied Mathematics

.

Gustafson,

P.

(

1998

).

A guided walk Metropolis algorithm

.

Statist. Comp.

8

,

357

–

64

.

Haario,

H.

,

Saksman,

E.

&

Tamminen,

J.

(

2001

).

An adaptive Metropolis algorithm

.

Bernoulli

7

,

223

–

42

.

Haario,

H.

,

Saksman,

E.

&

Tamminen,

J.

(

2005

).

Componentwise adaptation for high-dimensional MCMC

.

Comp. Statist.

20

,

265

–

73

.

Hairer,

E.

,

Lubich,

C.

&

Wanner,

G.

(

2006

).

Geometric Numerical Integration. Structure-Preserving Algorithms for Ordinary Differential Equations

.

Berlin

:

Springer

.

Hirsch,

M.

&

Smale,

S.

(

1974

).

Differential Equations, Dynamical Systems, and Linear Algebra

.

Cambridge, MA

:

Academic Press

.

Hoffman,

M. D.

&

Gelman,

A.

(

2014

).

The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo

.

J. Mach. Learn. Res.

15

,

1593

–

623

.

Johnson,

A. A.

,

Jones,

G. L.

&

Neath,

R. C.

(

2013

).

Component-wise Markov chain Monte Carlo: Uniform and geometric ergodicity under mixing and composition

.

Statist. Sci.

28

,

360

–

75

.

Jolly,

G. M.

(

1965

).

Explicit estimates from capture–recapture data with both death and immigration-stochastic model

.

Biometrika

52

,

225

–

47

.

Kruschke,

J.

(

2014

).

Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan

, 2nd ed.

Cambridge, MA

:

Academic Press

.

Livingstone,

S.

,

Betancourt,

M.

,

Byrne,

S.

&

Girolami,

M.

(

2018

).

On the geometric ergodicity of Hamiltonian Monte Carlo

. arXiv: 1601.08057v4.

Livingstone,

S.

,

Faulkner,

M. F.

&

Roberts,

G. O.

(

2019

).

Kinetic energy choice in Hamiltonian/hybrid Monte Carlo

.

Biometrika

106

,

303

–

19

.

Lu,

X.

,

Perrone,

V.

,

Hasenclever,

L.

,

Teh,

Y. W.

&

Vollmer,

S.

(

2017

). Relativistic Monte Carlo. In

Proc. 20th Int. Conf. on Artificial Intelligence and Statistics

, vol.

54

, pp.

1236

–

45

.

Lunn,

D.

,

Spiegelhalter,

D.

,

Thomas,

A.

&

Best,

N.

(

2009

).

The BUGS project: Evolution, critique and future directions

.

Statist. Med.

28

,

3049

–

67

.

McLachlan,

R. I.

&

Quispel,

G. R. W.

(

2002

).

Splitting methods

.

Acta Numerica

11

,

341

–

434

.

Metropolis,

N.

,

Rosenbluth,

A. W.

,

Rosenbluth,

M. N.

,

Teller,

A. H.

&

Teller,

E.

(

1953

).

Equation of state calculations by fast computing machines

.

J. Chem. Phys.

21

,

1087

–

92

.

Monnahan,

C. C.

,

Thorson,

J. T.

&

Branch,

T. A.

(

2017

).

Faster estimation of Bayesian models in ecology using Hamiltonian Monte Carlo

.

Meth. Ecol. Evol.

8

,

339

–

48

.

Nakajima,

J.

&

West,

M.

(

2013

).

Bayesian analysis of latent threshold dynamic models

.

J. Bus. Econ. Statist.

31

,

151

–

64

.

Neal,

R. M.

(

1996

).

Bayesian Learning for Neural Networks

.

New York

:

Springer

.

Neal,

R. M.

(

2010

). MCMC using Hamiltonian Dynamics. In

Handbook of Markov Chain Monte Carlo

.

Boca Raton

:

CRC Press

.

Neelon,

B.

&

Dunson,

D. B.

(

2004

).

Bayesian isotonic regression and trend analysis

.

Biometrics

60

,

398

–

406

.

Pakman,

A.

&

Paninski,

L.

(

2013

). Auxiliary-variable exact Hamiltonian Monte Carlo samplers for binary distributions. In

Proc. 26th Int. Conf. on Neural Information Processing Systems

, pp.

2490

–

8

.

Roberts,

G. O.

,

Gelman,

A.

&

Gilks,

W. R.

(

1997

).

Weak convergence and optimal scaling of random walk Metropolis algorithms

.

Ann. Appl. Prob.

7

,

110

–

20

.

Roberts,

G. O.

&

Rosenthal,

J. S.

(

2009

).

Examples of adaptive MCMC

.

J. Comp. Graph. Statist.

18

,

349

–

67

.

Salvatier,

J.

,

Wiecki,

T. V.

&

Fonnesbeck,

C.

(

2016

).

Probabilistic programming in Python using PyMC3

.

PeerJ Comp. Sci.

2

,

e55

.

Schwarz,

C. J.

&

Seber,

G. A. F.

(

1999

).

Estimating animal abundance: Review III

.

Statist. Sci.

14

,

427

–

56

.

Seber,

G. A. F.

(

1982

).

The Estimation of Animal Abundance

.

London

:

Griffin

.

Stan Development Team (

2016

).

Stan Modeling Language Users Guide and Reference Manual, Version 2.14.0

.

Google Preview

Stewart,

D. E.

(

2000

).

Rigid-body dynamics with friction and impact

.

SIAM Rev.

42

,

3

–

39

.

Thawornwattana,

Y.

,

Dalquen,

D.

&

Yang,

Z.

(

2018

).

Designing simple and efficient Markov chain Monte Carlo proposal kernels

.

Bayesian Anal.

13

,

1033

–

59

.

Wagner,

A. K.

,

Soumerai,

S. B.

,

Zhang,

F.

&

Ross-Degnan,

D.

(

2002

).

Segmented regression analysis of interrupted time series studies in medication use research

.

J. Clin. Pharm. Therap.

27

,

299

–

309

.

Zhang,

Y.

,

Sutton,

C.

,

Storkey,

A.

&

Ghahramani,

Z.

(

2012

).

Continuous relaxations for discrete Hamiltonian Monte Carlo

. In

Proc. 25th Int. Conf. on Neural Information Processing Systems

, pp.

3194

–

202

.