Galaxy Zoo: Clump Scout – Design and first application of a two-dimensional aggregation tool for citizen science | Monthly Notices of the Royal Astronomical Society

ABSTRACT

Galaxy Zoo: Clump Scout is a web-based citizen science project designed to identify and spatially locate giant star forming clumps in galaxies that were imaged by the Sloan Digital Sky Survey Legacy Survey. We present a statistically driven software framework that is designed to aggregate two-dimensional annotations of clump locations provided by multiple independent Galaxy Zoo: Clump Scout volunteers and generate a consensus label that identifies the locations of probable clumps within each galaxy. The statistical model our framework is based on allows us to assign false-positive probabilities to each of the clumps we identify, to estimate the skill levels of each of the volunteers who contribute to Galaxy Zoo: Clump Scout and also to quantitatively assess the reliability of the consensus labels that are derived for each subject. We apply our framework to a data set containing 3561 454 two-dimensional points, which constitute 1739 259 annotations of 85 286 distinct subjects provided by 20 999 volunteers. Using this data set, we identify 128 100 potential clumps distributed among 44 126 galaxies. This data set can be used to study the prevalence and demographics of giant star forming clumps in low-redshift galaxies. The code for our aggregation software framework is publicly available at: https://github.com/ou-astrophysics/BoxAggregator

methods: data analysis, methods: statistical, software: data analysis, software: public release, galaxies: structure

1 INTRODUCTION

One of the main goals for modern observational cosmology is to discover and understand how galaxies and their constituent substructures have assembled and evolved throughout cosmic history.

During the last two decades, a large number of observational data have been assembled, which show strong evidence for a substantial evolution in the dominant mode of star formation in galaxies between z ∼ 3 and 0.2 (e.g. Madau & Dickinson 2014; Murata et al. 2014; Guo et al. 2015; Shibuya et al. 2016; Guo et al. 2018).

Early observations using the Hubble Space Telescope (HST) revealed that typical massive galaxies (M ≳ 10¹⁰M_⊙), populating the z ∼ 2 star forming main sequence (Noeske et al. 2007), exhibit thick, gas-rich, clumpy discs with star formation rates |$\dot{M}_{\star }\sim 100\mathrm{M}_{\odot }\, \mathrm{yr}^{-1}$| (e.g. Elmegreen, Elmegreen & Sheets 2004a; Elmegreen, Elmegreen & Hirst 2004b; Genzel et al. 2011). Many of these z ∼ 2 galaxies were found to exhibit discrete, subgalactic regions of enhanced star formation (hereafter referred to as ‘clumps’) with apparent radii |${\lesssim}1\,{\rm kpc}$| and stellar masses |$M_{\star } \gtrsim 10^7\mathrm{M}_{\odot }$| (Elmegreen 2007). More recent evidence suggests that these clumps may in fact be aggregations of smaller substructures that could not be resolved by HST (e.g. Wuyts et al. 2014; Fisher et al. 2017; Dessauges-Zavadsky & Adamo 2018), but this remains to be confirmed. The prevalence of giant star-forming clumps at high redshift and the overall characteristics of their host galaxies are in stark contrast with the thin, uniform and generally quiescent (⁠|$\dot{M}_{\star }\sim 1 \mathrm{yr}^{-1}$|⁠) disc morphologies that prevail among star-forming galaxies in the local Universe (e.g. Simard et al. 2011; Willett et al. 2013a).

The mechanisms that drove this evolution of star formation activity, their onset epochs and the time-scales over which they operated, remain to be fully established. If they can be accurately determined, the abundances of clumps within galaxies at different redshifts, together with their spatial distributions and intrinsic properties, provide obvious diagnostics for the transition from clumpy to more diffuse star formation. Historically, the most extensive surveys of clumpy star formation have relied on HST imaging and focused on intermediate and high-redshift galaxies (e.g. Murata et al. 2014; Guo et al. 2015; Guo et al. 2018). A common conclusion of these studies is that the overall fraction of massive (⁠|$M_{\star }\gtrsim 10^{9.5}\mathrm{M}_{\odot }$|⁠), clumpy star forming galaxies decreases rapidly for z ≲ 2 and falls below ∼5 per cent by z ∼ 0.2.

The scarcity of clumpy galaxies in the local Universe makes the task of identifying them in large numbers much more challenging and related studies at low redshift have entailed focused investigations of small samples containing ∼50 galaxies, or fewer (See, however Mehta et al. 2021). Identifying enough low-redshift clumpy galaxies to enable accurate inference of their overall population demographics and characteristics requires wide-field imaging surveys that encompass a large fraction of the sky and a reliable method for discovering candidate systems. In recent years, extensive ground-based surveys like the Sloan Digital Sky Survey Legacy Survey (SDSS; York et al. 2000) and the Dark Energy Camera Legacy Survey (DECaLS; Dey et al. 2019) have delivered publicly available wide field imaging data that make systematic searches for large numbers of low-redshift clumpy galaxies possible. Galaxy Zoo: Clump Scout (Adams et al. 2022) is a citizen science project that used SDSS imaging data and was designed to let volunteers from the general public identify clumpy galaxies and the clumps they contain. Multiple volunteers inspect images of galaxies and provide two dimensional annotations marking the locations of any clumps the galaxies contain.

One of the most challenging aspects of collecting data using a citizen science approach is calibrating the reliability of the responses that volunteers provide. Translating astrophysical analyses into a citizen science context can be difficult because the subject matter and related concepts are often not familiar to non-experts. This unfamiliarity can result in annotations that are noisy with large variations between the responses of different volunteers. The traditional approach for mitigating such noise is to collect a large number of independent annotations and derive an average result representing the overall consensus between volunteers. This has two obvious disadvantages: firstly, volunteer effort may be wasted if more responses are accumulated than are actually required to mitigate the variation between responses and secondly, even after a large number of responses have been collected, there is no formal guarantee that the consensus is accurate or sufficiently precise.

To address these issues, more quantitative approaches have been developed that attempt to infer statistical estimates for the reliability of consensus derived from citizen science annotations and classifications. For example, Marshall et al. (2016) developed the Space Warps Analysis Pipeline (SWAP) which used a binomial model for a simple true-or-false response to derive a Bayesian estimate for the probability that astrophysical images included signatures of strong gravitational lensing. The SWAP algorithm was also used by Wright et al. (2017) to accelerate consensus for citizen-science classification of potential supernova flashes and assign false-alarm probabilities to candidate events. Later, Beck et al. (2018) showed that applying SWAP to galaxy morphology labels collected via the Galaxy Zoo platform (Lintott et al. 2008; Willett et al. 2013b) increased the rate of classification by 500 per cent and reduced the volunteer effort that was required by a factor of ∼6.5, relative to the Galaxy Zoo standard requirement for 40 volunteers to inspect each galaxy.

In this paper, we build on the principle of SWAP and develop an aggregation approach to derive quantitative estimates for the reliability of two dimensional labels of clump locations within galaxies based on annotations provided by Galaxy Zoo: Clump Scout volunteers. Like SWAP, we rely on a statistical model to derive probabilistic estimates for several quantities that determine the reliability of a label that represents the consensus of multiple independent annotations. Two dimensional annotations are more complex than the simple binary classification tasks that SWAP was designed to process and our statistical model is necessarily also more complicated. We base our approach on a method that was initially presented by Branson, Van Horn & Perona (2017) (Hereafter BVP17), who tested their algorithm on small and relatively noise-free annotation data sets that contained a few thousand annotations and were collected from paid workers on the Amazon Mechanical Turk platform.¹ We have developed a new implementation of this algorithm that is computationally efficient enough to process millions of independent annotations provided for tens of thousands of images by the Galaxy Zoo: Clump Scout volunteers. Our goal is to find out whether this algorithm can be used successfully to derive complicated two-dimensional labels with quantitative reliability estimates in a mass-participation citizen-science context using noisy annotations provided by a cohort of non-expert volunteers. We also aim to determine whether the reliability estimates we derive can be used to accelerate the labeling process and reduce the amount of volunteer effort that is required to accurately label the clumps in each galaxy.

The remainder of this paper is organized as follows: In Section 2, we describe how the imaging data presented to volunteers in Galaxy Zoo: Clump Scout were selected and prepared. In Section 3, we outline the annotation workflow that volunteers used to annotate the images and the training they received. In Section 4, we provide details of the statistical model that underpins our aggregation algorithm. In Section 5, we explain how our algorithm actually computes the labels it derives. In Section 6, we present the results of applying our algorithm to the Galaxy Zoo: Clump Scout data and analyse the quantitative reliability metrics that are generated. In Section 7, we discuss the implications of these results in the context of the goals outlined above and the suitability of citizen science as a method for complex astrophysical image analysis. Finally, in Section 8, we summarize our findings and conclude.

2 DATA

In this section, we briefly describe the the galaxy selection criteria and the image preparation pipeline used for Galaxy Zoo: Clump Scout . A much more detailed description is provided by Adams et al. (2022).

2.1 Galaxy image selection

The galaxy images used in Galaxy Zoo: Clump Scout comprise three subsets of the sample that was visually inspected and morphologically classified by volunteers contributing to the Galaxy Zoo 2 (GZ2) citizen science project (Willett et al. 2013a). The criteria that were used to select these subsets are described in detail in Adams et al. (2022). For convenience, this section summarizes the most relevant properties of the galaxies that were inspected by the Galaxy Zoo: Clump Scout volunteers.

A primary sample of 53 613 galaxies with 0.02 ≤ z ≤ 0.25 was selected based on the morphological labels provided by GZ2 volunteers. We anticipated that the presence of obvious star-forming clumps in images of smooth elliptical galaxies was very unlikely so for this primary sample, we limited our selection to galaxies for which more than 50 per cent of volunteers responded negatively² to the question ‘Is the galaxy simply smooth and rounded, with no sign of a disk?’.

To estimate the number of clumpy galaxies that were observed by SDSS, but which were excluded from our primary sample, we also include a smaller, secondary sample. This sample contains 4937 galaxies for which fewer than 50 per cent of GZ2 volunteers identified features or a disc and was selected within a more restricted redshift range 0.02 ≤ z ≤ 0.075.

Finally, Galaxy Zoo: Clump Scout volunteers also annotated a sample of 26 736 galaxies matching the selection criteria used for the primary sample, but which had simulated emission from clumps with known photometric and physical properties superimposed (see Adams et al. 2022, for details of the simulation procedure). Annotations of these simulated clumps were used by Adams et al. (2022) to derive an estimate of the Galaxy Zoo: Clump Scout sample completeness for clumps with specified photometric properties.

Stellar mass estimates for galaxies in all three samples were taken from the SDSS DR7 MPA-JHU value-added catalogue (Kauffmann et al. 2003; Brinchmann et al. 2004). All three samples include galaxies with stellar masses 10^8.5M_⊙ ≲ M_⋆ ≲ 10¹²M_⊙.

2.2 Galaxy image preparation

For each of our selected galaxies we extract square cut-outs from SDSS g, r, and i band FITS (Pence et al. 2010) images that are normally six times larger than the galaxy’s measured 90 per cent r-band Petrosian radius, but have a minimum side length of 40 pixels.³ Experience from previous iterations of the Galaxy Zoo projects, including GZ2, has shown that sizing cut-outs relative to the host galaxy radius in this way provides sufficient angular resolution for volunteers to discern morphological features, while including enough of the surrounding context to help distinguish those features from instrumental noise and background objects. We then resample these single band images on to a common pixel grid with SDSS native resolution (0.396 arcsec pixel⁻¹) before combining them (without PSF-matching) into a three-channel colour composite. We assign the g, r, and i bands to the red, blue, and green channels, respectively, and scale each band independently using the formula presented in Lupton et al. (2004). For an input pixel intensity I_x in band x the scaled intensity |$I^{\prime }_{x}$| is computed using

$$\begin{eqnarray} I^{\prime }_{x} = \frac{1}{Q}\mathrm{asinh}\left[Q\cdot \frac{\left(\frac{I_{x}}{\beta _{x}}-m\right)}{\alpha }\right]. \end{eqnarray}$$

(1)

We specify that {Q, α, m} = {7, 0.2, 0} for all bands and that {β_g, β_r, β_i} = {0.7, 1.17, 1.818}. Finally, we re-scale each colour image so that its height and width are both 400 pixels. Note that this means the angular size of the cut-out image pixels varies between 0.1 and ∼18 arcsec pix⁻¹ for different subjects depending on the angular size of the central galaxy. The number of SDSS native image pixels spanned by the SDSS imaging PSF FWHM varies between 1.5 and 7.5 with a median value ∼2.8 for 99 per cent of the unscaled cut-out images. The remaining of subjects 1 per cent populate a tail out to ∼18 pixels. In the final scaled cut-outs, the number of pixels spanned by the PSF FWHM varies between ∼1 and ∼70 with a median value of ∼11. Examples of the images generated using this procedure are shown in Figs 6, 18, D2, and D3

3 COLLECTING ANNOTATIONS

To identify the locations of clumps within their host galaxies, we designed a web-based citizen science project using the Zooniverse project builder interface.⁴

3.1 Volunteer training

For non-expert volunteers, identifying genuine clumps among the potentially complex features of their host galaxies can be daunting. To improve volunteers’ confidence and help them to provide accurate annotations, we provided several pedagogical and training resources. Following the approach of other Zooniverse projects, we designed a detailed practical tutorial explaining each step of the annotation workflow. This tutorial was automatically presented to volunteers when they joined the project and remained available for reference thereafter. Additional reference images and explanatory text were provided using the Field Guide feature of the Zooniverse interface. A separate About section of the project provided pedagogical material explaining the scientific motivation of the project. Finally, to guide the progress of first-time volunteers, we provided expert labels for a small subset of our galaxy images. Ten such images were interspersed with decreasing frequency among the first ∼20 subjects that each volunteer inspected. We implemented a system to provide real-time feedback for volunteer annotations of expert-labelled galaxy images and inform them if they missed genuine clumps or mistakenly annotated an object that experts had disregarded. This feedback system was designed to refine volunteers expectations regarding the visual appearance of genuine clumps during the early stages of their engagement with the project.

3.2 The annotation workflow

Volunteers following the Galaxy Zoo: Clump Scout workflow inspect a sequence of single subject galaxy images (hereafter ‘subjects’) that are randomly drawn from a global subject set. The subject selection ensures that no volunteer inspects the same image more than once and each subject is inspected by a group of approximately 20 volunteers. Each volunteer first annotates the two-dimensional location of the central bulge of the central galaxy in the image if it is visible, before proceeding to annotate the locations of any clumps they can discern. To mitigate against the possibility that volunteers would disregard genuine clumps with appearances that confound their expectations, we provided an opportunity to mark clumps as ‘unusual’. We investigate the impact of including or discarding this unusual clump subset in Section 6.

The full Galaxy Zoo: Clump Scout data set contains 3561 454 click locations, which constitute 1739 259 annotations of 85 286 distinct subjects provided by 20 999 volunteers.

3.3 Initial annotation processing

We expect that even the largest individual clumps will be at best marginally resolved for the lowest redshift galaxies in our data sample. This implies that almost all clumps will appear as point-sources with a light profile equal to the instrumental point-spread function (PSF). Our data preparation procedure (Section 2.2) results in subject images that have different pixel sampling of the PSF depending on the angular size of the central host galaxy. To account for this fact, we transform the two-dimensional point estimates for clump locations that volunteers provide into square boxes with side-length equal to twice the full width at half maximum (FWHM) of the pertinent subject’s PSF. Assigning a finite, instrumentally motivated clump extension allows us to identify groups of volunteer clicks with separations that are smaller than the PSF. A prior assumption of our data aggregation approach that it is impossible for a single volunteer to mark separate clumps within the same subject that are closer than twice the PSF FWHM.⁵ It is likely that any such multiplets that volunteers do provide represent noise peaks in contrast-enhanced subject images or are simply accidents. In Section 4, we describe how our aggregation algorithm effectively deduplicates multiple nearby annotations by individual volunteers.

3.4 A scale-free distance metric

Using square boxes to define the marked clump locations allows us to inexpensively compute the ratio of the area of the intersection between pairs of boxes and the area of their union (see Fig. 1). We use the complement of this ratio, which is commonly referred to as the Jaccard distance (Jaccard 1912), as a scale-free distance metric between any volunteer-marked locations.

$$\begin{eqnarray} d=1-\frac{A_{\mathrm{intersection}}}{A_{\mathrm{union}}}. \end{eqnarray}$$

(2)

The Jaccard distance is maximally unity if the boxes are disjoint and minimally zero if they coincide perfectly.

Figure 1.

Geometric illustration of the ratio between the area of the intersection between two boxes (dotted region) and the area of their union (dashed region). We use the complement of this ratio as a scale-free distance metric bounded between zero and unity.

Open in new tab Download slide

4 DATA AGGREGATION MODEL

The core of our data aggregation approach is based on a custom implementation of the probabilistic model and algorithm proposed by BVP17 . In this section, we present a detailed description of the model, and explain how it is used to optimize the efficiency of clump detection using the volunteers’ annotations. We recognize that this paper contains a lot of somewhat complicated notation, so to aid the reader we have included a reference table of the most commonly recurring symbols in Table C1.

4.1 Overview

We construct a global model that simultaneously considers N_S individual elements of the full subject set |$S\equiv \lbrace s_{i}\rbrace _{i=1}^{N_{\mathrm{S}}}$| and individual members of the entire volunteer cohort V. Each subject s_i ∈ S, is inspected by a randomly selected group of volunteers V_i ∈ V, who each provide a set of independent two dimensional annotations of visible clump locations |$Z_{i}\equiv \lbrace z_{ij}\rbrace _{j=1}^{|V_{i}|}$|⁠. Throughout this paper, we will use the notation |X| to denote the number of elements in the set X, so here |V_i| denotes the number of volunteers who annotate the subject s_i. For convenience, we define S_j ∈ S to denote the subset of subjects that are inspected by the jth volunteer. For every subject s_i, we define a true label y_i to encode the unknown locations of all real clumps in the image. Using only the information provided by the global set of volunteer annotations |$Z\equiv \bigcup _{i}Z_{i}$|⁠, we wish to derive a separate estimated label |$\hat{y}_{i}$| for each subject that closely approximates y_i. Our goal is to minimize the mismatch between |$\hat{y}_{i}$| and y_i, while keeping the number of volunteers who annotate the subject s_i as small as possible and thereby to optimize our use of volunteers’ effort. We facilitate this aim by computing a ‘risk’ metric |$\mathcal {R}_{i}$| for each subject that represents a weighted combination of quantitative magnitude estimates for several sources of approximation error in the estimated label (see Section 5.7 for more details). We expect that the risk for a particular subject will decrease as the number of volunteer annotations for that subject increases. Accordingly, by choosing an appropriate global risk threshold |$\mathcal {R}_{i} < \tau$|⁠, we aim to be able to confidently retire individual subjects from the classification pool as soon as the expected error is acceptably small. This approach differs from many traditional crowd-sourcing techniques, which require a fixed number of volunteers to inspect each subject. Such approaches are generally less efficient because stable consensus between volunteers is often achieved before the prescribed number of annotations have been gathered. An additional benefit of our approach is that particularly difficult subjects can be segregated for expert inspection if their risk remains high after many volunteers have inspected the subject.

4.2 Associating subject annotations with subject labels

Each of the volunteer annotations z_ij ∈ Z_i forms a set of |B_ij| ≥ 0 square boxes |$z_{ij}=\big\lbrace b_{ij}^{k}\big\rbrace _{k=1}^{|B_{ij}|}$| that encodes the locations of any clumps that the volunteer perceived in the subject s_i. Analogously, we model the true clump locations for s_i as an abstract set of |B_i| ≥ 0 rectangular boxes such that |$y_{i}\equiv \big\lbrace b_{i}^{l}\big\rbrace _{l=1}^{|B_{i}|}$|⁠. The concrete sizes and shapes of these boxes are ultimately determined by our aggregation algorithm, but for subject s_i they are guaranteed to be at least as large as the boxes comprising the volunteer annotations for that subject. Our goal is to associate each of the click locations corresponding to volunteer annotations for a particular subject with a single true clump location. Formally, we aim to associate each of the concrete elements of Z_i with a single abstract element of y_i. This task is complicated for several reasons. Different volunteers may annotate different subsets of clumps and the order in which they do so is not defined nor even constrained. Volunteers may miss some real clumps, so there may be elements of y_i that have no counterpart annotations in a particular z_ij. Conversely, the set of annotations provided by a particular volunteer for a particular subject may contain false positives, so some elements of a particular z_ij may not correspond with any elements of y_i.

Fig. 2 provides a schematic illustration of the process by which we associate volunteer annotations with probable clump locations and Section 5.3 explains the notation and the computational details. Formally, our aggregation algorithm computes an optimal set of mapping indices |$\big\lbrace a_{ij}^{k}\big\rbrace _{k=1}^{|B_{ij}|}$| such that each volunteer-provided box |$b_{ij}^{k}\in z_{ij}$| is associated with real clump location |$b_{i}^{a_{ij}^{k}}\in y_{i}$|⁠. The possibility of false positive boxes in z_ij is accounted for by defining a singleton ‘|$\varnothing$|’ element to which they can be associated.

$Schematic illustration of how elements of volunteers’ annotations are associated with elements of the subject label yi. We illustrate a case in which three volunteers have provided three independent annotations of the same subject. Volunteers 1 and 2 both annotate subsets of the real clumps in the image. Volunteer 3 mistakenly marks two foreground stars as clumps. The central column lists the value of $\lbrace a_{ij}^{k}\rbrace$ computed for each of the boxes forming the volunteers’ annotations. For volunteers 1 and 2, these values define the index of the corresponding box in yi. Both annotations provided by volunteer 3 probably mark foreground stars and neither is marked by another volunteer. In this toy example, the algorithm maps both to the ‘$\varnothing$’ element, thereby defining them as false positives.$

Figure 2.

Schematic illustration of how elements of volunteers’ annotations are associated with elements of the subject label y_i. We illustrate a case in which three volunteers have provided three independent annotations of the same subject. Volunteers 1 and 2 both annotate subsets of the real clumps in the image. Volunteer 3 mistakenly marks two foreground stars as clumps. The central column lists the value of |$\lbrace a_{ij}^{k}\rbrace$| computed for each of the boxes forming the volunteers’ annotations. For volunteers 1 and 2, these values define the index of the corresponding box in y_i. Both annotations provided by volunteer 3 probably mark foreground stars and neither is marked by another volunteer. In this toy example, the algorithm maps both to the ‘|$\varnothing$|’ element, thereby defining them as false positives.

Open in new tab Download slide

4.3 Modelling volunteer skill

For a given subject, the visibility of clumps to a particular volunteer, and the positional accuracy with which they are able annotate the clumps they do perceive is likely to be influenced by several factors. These may include: domain expertise, experience gained from time spent contributing to Galaxy Zoo: Clump Scout, confusion regarding the detailed task instructions, and even the screen size and resolution of device they typically use to provide annotations.

To model the impact of these factors we consider three scenarios, which relate a particular volunteer’s annotations to the locations of real clumps in the subject image. Consider the annotations provided by the jth volunteer in our cohort.

In the first scenario, volunteer j provides a true positive by marking a location that lies close to a real clump. It is unlikely that any volunteer’s mark precisely annotates the true clump location and indeed, different volunteers may have different perceptions of where the clump actually is. We model any positional offset between the volunteer j’s annotation and the true clump location as a random Jaccard distance d_j, drawn from a Gaussian distribution with zero mean and a volunteer-specific variance |$\sigma ^{2}_{j}$|⁠.

$$\begin{eqnarray} d_{j} \sim \mathrm{Gaussian}\big(0, \sigma ^{2}_{j}\big). \end{eqnarray}$$

(3)

In the second scenario, the volunteer provides a false positive by marking a location which does not correspond to the location of a real clump. We model the rate of false positive annotations for volunteer j by considering each mark they provide as a Bernoulli trial with ‘success’ probability |$p_{j}^{\mathrm{fp}}$|⁠.

Finally, volunteer j may provide an implicit false negative by failing to mark the location of a real clump. We model the false negative rate for volunteer j by considering each opportunity to mark a real clump location as a Bernoulli trial with ‘success’ probability |$p_{j}^{\mathrm{fn}}$|⁠.

Hereafter, we refer collectively to the three model parameters |$\mathcal {S}_{j} \equiv \big\lbrace \sigma _{j}, p_{j}^{\mathrm{fp}}, p_{j}^{\mathrm{fn}}\big\rbrace$| as volunteer j’s ‘skill’ parameters.

4.4 Modelling subject difficulty

Notwithstanding the skill of individual volunteers, there are numerous image characteristics that may result in varying degrees of clump visibility at different locations for different subjects in S. An obvious example is clump contrast; bright clumps that appear superimposed on a smooth, faint background galaxy will be easier to discern than faint clumps on a bright, noisy background. For simplicity, we assume that the impact of all such confounding factors manifests as a positional offset between the true location of a clump and any volunteer annotations that identify it. For a particular true clump location |$b_{i}^{l}\in y_{i}$|⁠, we model the size of this offset as a random Jaccard distance d_i, l drawn from a Gaussian distribution with zero mean and variance |${\sigma ^{l}_{i}}^{2}$|⁠.

$$\begin{eqnarray} d_{i, l} \sim \mathrm{Gaussian}\left(0, {\sigma ^{l}_{i}}^{2}\right) \end{eqnarray}$$

(4)

Hereafter, we refer to the set |$\mathcal {D}_{i}\equiv \lbrace \sigma _{i}^{l}\rbrace$| as the subject ‘difficulty’.

4.5 Modelling volunteer annotations

We combine our volunteer skill and image difficulty models to define a compound model for the annotation z_ij that each volunteer provides for a each subject s_i.

$$\begin{eqnarray} p(z_{ij}|y_{i}, \mathcal {D}_{i},\mathcal {S}_{j}) &=& \big(p_{j}^{\mathrm{fn}}\big)^{n_{\mathrm{fn}}} \big(1-p_{j}^{\mathrm{fn}}\big)^{n_{\mathrm{tp}}}\\ &&{\cdot}\,\big(p_{j}^{\mathrm{fp}}\big)^{n_{\mathrm{fp}}}\big(1-p_{j}^{\mathrm{fp}}\big)^{n_{\mathrm{tp}}}\\ && {\cdot}\,\displaystyle \prod \limits _{\substack{k=1,\,l=a_{ij}^{k}\ne \varnothing }}^{|B_{ij}|} \mathrm{Gaussian}\big(\vert d_{kl}\vert ^{2};{\sigma _{ij}^{k}}^{2}\big). \end{eqnarray}$$

(5)

The first and second terms represent binomial models, which compute the probability that z_ij contains n_fn false negatives, n_fp false positives and n_tp = |B_ij| − n_fp true positives, given |$p_{j}^{\mathrm{fp}}$| and |$p_{j}^{\mathrm{fn}}$|⁠.

The third term considers the Jaccard distances d_kl between any true positive (i.e. |$a_{ij}^{k}\ne \varnothing$|⁠) box |$b_{ij}^{k}\in z_{ij}$| and their counterparts |$b_{i}^{l}\in y_{i}$| as well as the subject’s difficulty |$\mathcal {D}_{i}$| and the volunteer’s skill |$\mathcal {S}_{j}$|⁠.

We combine the Gaussian components of the volunteer skill and image difficulty models by computing a combined variance parameter

$$\begin{eqnarray} {\sigma _{ij}^{k}}^{2} = (1-\eta)\cdot {\sigma _{i}^{a_{ij}^{k}}}^{2} + \eta \cdot {\sigma _{j}}^{2}, \end{eqnarray}$$

(6)

where η weights the relative impact of volunteer skill and image difficulty according to the p-values computed by their respective probability models. Formally, we model η as the expected value of a binary indicator variable, e

$$\begin{eqnarray} e = \left\lbrace \begin{array}{@{}l@{\quad }l@{}}\textrm {1~if~volunteer~skill~dominates~} d_{kl}\\ \textrm {0~if~image~difficulty~dominates~} d_{kl} \end{array}\!\!.\right. \end{eqnarray}$$

(7)

We assume that both sources of variance are equally likely to dominate for any particular volunteer annotation (i.e. P(e = 1) = P(e = 0)), which implies (e.g. Ivezić et al. 2019)

$$\begin{eqnarray} \eta = \mathbb {E}(e) = \frac{\displaystyle \sum _{e=0}^{1}eP(d_{kl}|e)}{\displaystyle \sum _{e=0}^{1}P(d_{kl}|e)} = \frac{p_{j}}{p_{i}+p_{j}}, \end{eqnarray}$$

(8)

where

$$\begin{eqnarray} p_{j}&=&P(d_{kl}|e=1)=\mathrm{Gaussian}\left(\left|d_{kl}\right|^{2};{\sigma _{j}}^{2}\right) \\ p_{i}&=&P(d_{kl}|e=0)=\mathrm{Gaussian}\big(|d_{kl}|^{2};{\sigma _{i}^{l}}^{2}\big). \end{eqnarray}$$

4.6 Global model and parameter priors

Our combined model for a single volunteer annotation of a single subject (i.e. |$p(z_{ij}|y_{i}, \mathcal {D}_{i},\mathcal {S}_{j})$|⁠, equation (5)) forms the kernel of a joint model for the set of all subject true labels |$Y\equiv \lbrace y_{i}\rbrace _{i=1}^{|S|}$|⁠, the set of all subject difficulties |$\mathcal {D}\equiv \lbrace \mathcal {D}_{i}\rbrace _{i=1}^{|S|}$| and the set of all volunteer skills |$\mathcal {S}\equiv \lbrace \mathcal {S}_{j}\rbrace _{j=1}^{|V|}$| given the union of all volunteer annotations, which we denote Z.

$$\begin{eqnarray} P(Y, \mathcal {D},\mathcal {S}|Z) &=& \prod _{i}\pi (y_{i})\pi (\mathcal {D}_{i})\\ &&\cdot \prod _{j}\pi (\mathcal {S}_{j})p(z_{ij}|y_{i}, \mathcal {D}_{i},\mathcal {S}_{j}). \end{eqnarray}$$

(9)

The additional terms in equation (9) represent prior distributions for the parameters of our model.

|$\pi (\mathcal {D}_{i})$| models the prior probabilities of observing the difficulty parameters associated with the ith subject.
|$\pi (\mathcal {S}_{j})$| models the prior probability of observing the volunteer skill parameters associated with the jth volunteer.
π(y_j) models the prior probability that the unknown true label for s_i is y_i. For simplicity, we assume that all possible labels are equally likely.

For practical reasons, we choose prior distributions for each parameter that are the conjugate priors⁶ of that parameter for the corresponding likelihood model distribution. This choice facilitates straightforward computation of model parameter updates when new annotations are collected.

Specifically, we use Beta distribution priors for the binomial-distributed parameters |$\big\lbrace p_{j}^{\mathrm{fp}}, p_{j}^{\mathrm{fp}}\big\rbrace$|

$$\begin{eqnarray} \pi (p_{j}^{k}) \sim \mathrm{Beta}\big(p_{j}^{k};n_{\beta }^{k}p_{0}^{k}, n_{\beta }^{k}\big(1-p_{0}^{k}\big)\big):\,\,k\in [\mathrm{fp},\mathrm{fn}]. \end{eqnarray}$$

(10)

Intuitively, this prior simulates the information gained by performing n_β Bernoulli trials with success probability |$p_{0}^{k}$|⁠.

For the parameters that are modelled as variances of Gaussian likelihood models |$\big\lbrace \sigma _{j}^{2}, {\sigma _{i}^{l}}^{2}\big\rbrace$|⁠, we specify scaled inverse chi-squared priors

$$\begin{eqnarray} \pi\big({\sigma _{i}^{l}}^{2}; n_{\chi ,S}, \sigma _{0,S}^{2}\big)&=& \mathrm{Scale-inv-}\chi ^{2}\big(\sigma ^{2}; n_{\chi , S}, \sigma _{0,S}^{2}\big)\\ \pi\big(\sigma _{j}^{2}; n_{\chi , V}, \sigma _{0,V}^{2}\big)&=& \mathrm{Scale-inv-}\chi ^{2}\big(\sigma ^{2}; n_{\chi , V}, \sigma _{0,V}^{2}\big), \end{eqnarray}$$

(11)

which simulate the information gained from a sample of n_χ previous observations drawn from a Gaussian distribution with zero mean and variance |$\sigma _{0}^{2}$|⁠.

The initial values for parameters of our prior models |$\big\lbrace p_{0}^{\mathrm{fp}}, p_{0}^{\mathrm{fp}}, n_{\beta }^{\mathrm{fp}}, n_{\beta }^{\mathrm{fn}}, \sigma _{0,S}^{2}, n_{\chi , S}, \sigma _{0,V}^{2}, n_{\chi , V}\big\rbrace$| are hyper-parameters of our algorithm which must be chosen a-priori. Table 1 lists the values that we assign to each of these hyper-parameters when processing the Galaxy Zoo: Clump Scout data set.

Table 1.

Open in new tab

Framework hyper-parameter values used to process the Galaxy Zoo: Clump Scout data set.

Parameter	Value
\|$p_{0}^{\mathrm{fp}}$\|	0.1
\|$p_{0}^{\mathrm{fp}}$\|	0.1
\|$n_{\beta }^{\mathrm{fp}}$\|	500
\|$n_{\beta }^{\mathrm{fn}}$\|	50
\|$\sigma _{0,S}^{2}$\|	0.1
n_{χ, S}	10
\|$\sigma _{0,V}^{2}$\|	0.1
n_{χ, V}	10
f_V	0.1
d_max	0.9

Parameter	Value
\|$p_{0}^{\mathrm{fp}}$\|	0.1
\|$p_{0}^{\mathrm{fp}}$\|	0.1
\|$n_{\beta }^{\mathrm{fp}}$\|	500
\|$n_{\beta }^{\mathrm{fn}}$\|	50
\|$\sigma _{0,S}^{2}$\|	0.1
n_{χ, S}	10
\|$\sigma _{0,V}^{2}$\|	0.1
n_{χ, V}	10
f_V	0.1
d_max	0.9

Table 1.

Open in new tab

Framework hyper-parameter values used to process the Galaxy Zoo: Clump Scout data set.

Parameter	Value
\|$p_{0}^{\mathrm{fp}}$\|	0.1
\|$p_{0}^{\mathrm{fp}}$\|	0.1
\|$n_{\beta }^{\mathrm{fp}}$\|	500
\|$n_{\beta }^{\mathrm{fn}}$\|	50
\|$\sigma _{0,S}^{2}$\|	0.1
n_{χ, S}	10
\|$\sigma _{0,V}^{2}$\|	0.1
n_{χ, V}	10
f_V	0.1
d_max	0.9

Parameter	Value
\|$p_{0}^{\mathrm{fp}}$\|	0.1
\|$p_{0}^{\mathrm{fp}}$\|	0.1
\|$n_{\beta }^{\mathrm{fp}}$\|	500
\|$n_{\beta }^{\mathrm{fn}}$\|	50
\|$\sigma _{0,S}^{2}$\|	0.1
n_{χ, S}	10
\|$\sigma _{0,V}^{2}$\|	0.1
n_{χ, V}	10
f_V	0.1
d_max	0.9

In Appendix A, we provide detailed rationale for our choice of prior distribution models and show how they yield estimates for our likelihood model parameters that become increasingly data-dominated as more annotations are collected.

5 COMPUTING AGGREGATED LABELS

Fig. 3 provides a schematic overview of how our implementation computes aggregated labels for subjects. In subsequent subsections we describe the illustrated operations in detail.

Figure 3.

Schematic overview of the aggregation algorithm.

Open in new tab Download slide

5.1 The working batch

To minimize the dependence of aggregated clump locations on our choice of model prior hyper-parameters, we design our aggregation framework to process elements from a dynamically maintained working batch containing data and metadata for ≲ 25 thousand classifications.⁷ Each element in the working batch represents a single click location marking a clump as part of the annotation provided by a single volunteer.

To populate the working batch, we select subjects that have been inspected by at least three volunteers and have at least one annotated clump. For each selected subject, we assemble all its available annotation data and append them to the working batch in a single block of elements. This ensures that any subject retirement decision is made on the basis of all available information. We specify a minimum target batch size and new blocks are added until the size of working batch exceeds this target. If five or more volunteers inspect a subject and none annotate a clump, we assume that no clumps are present and preemptively retire the subject instead of adding its data to the working batch. Whenever a volunteer inspects a subject that has at least one clump annotation, but does not annotate any clumps themselves, we append a single empty classification element to the working batch. We require records of these empty classifications in order to compute the probability that a particular volunteer fails to annotate a real clump, i.e. |$p_{j}^{\mathrm{fn}}$|⁠.

After processing a single batch of classification data, the most likely outcome is that only a subset of the corresponding subjects will have |$\mathcal {R}_{i} < \tau$| (see Section 4.1 and Section 5.7), and be deemed sufficiently low risk for retirement. We update the working batch by removing the classification data for retired subjects and replenishing them with new blocks of classification data for active subjects. Once a subject is retired, the aggregated estimated label |$\hat{y}_{i}$| is considered final and any subsequently submitted classifications for that subject will not be included in subsequent batches.

We impose a maximum lifetime for any data element by specifying the maximum number of batch replenishment cycles that they can persist within the working batch. Subjects whose data remain after this lifetime has expired are retired and flagged for inspection by experts. This forced retirement strategy prevents the working batch becoming stale and dominated by inherently difficult or high-risk subjects that never retire normally.

5.2 Initialization

Processing of each working batch begins with an initialization phase. Adding new blocks of data to the working batch implies introducing new subjects to our likelihood model. We initialize the subject difficulty parameters of all new subjects to the same value, which we specify as a hyper-parameter of our aggregation framework.

$$\begin{eqnarray} \sigma ^{2}_{i, \mathrm{init}}=\sigma _{\mathrm{S},0}^{2}\,\,\forall i. \end{eqnarray}$$

(12)

The newly added data blocks may include annotations that were provided by previously unknown volunteers. If so, we initialize the skill parameters for all new volunteers identically using three of the hyper-parameters that were introduced in Section 4.6.

$$\begin{eqnarray} p_{j, \mathrm{init}}^{\mathrm{fp}}=p_{0}^{\mathrm{fp}}\,\,\forall j \end{eqnarray}$$

(13)

$$\begin{eqnarray} p_{j, \mathrm{init}}^{\mathrm{fn}}=p_{0}^{\mathrm{fn}}\,\,\forall j \end{eqnarray}$$

(14)

$$\begin{eqnarray} \sigma ^{2}_{j, \mathrm{init}}=\sigma _{\mathrm{V},0}^{2}\,\,\forall j. \end{eqnarray}$$

(15)

A subset of elements in the working batch correspond with subject blocks from earlier batches that did not retire. We re-initialize the parameters for these subjects, and re-compute the skill parameters of returning volunteers to reflect only their annotations for subjects that have retired. This parameter propagation strategy allows us to use information that we have learned about volunteers’ skills, while ensuring that the subjects that persist between batches are processed identically to new subjects that happen to have received annotations from returning volunteers. After initializing or propagating the model parameters for all elements of the working batch, we cache their values.

To complete the initialization phase for each new working batch, we use the algorithm described in Section 5.3 to perform preliminary clustering of overlapping volunteer annotations for each subject. The subsequent subsections explain how we apply iterative expectation maximization to refine the initial clusters, while simultaneously computing the maximum likelihood solution of equation (9).

5.3 Computing box associations

For each subject s_i ∈ S, we follow the approach of BVP17 and implement a Facility Location algorithm (Mahdian et al. 2001) to approximately⁸ derive the maximum likelihood mapping |$A=\big\lbrace a_{ij}^{k}\big\rbrace _{k=1}^{|B_{ij}|}$| between the click locations comprising individual volunteers’ annotations |$z_{ij} = \big\lbrace b_{ij}^{k}\big\rbrace _{k=1}^{|B_{ij}|}$| and the set |$y_{i} = \big\lbrace b_{i}^{l}\big\rbrace _{l=1}^{|B_{i}|}$| (see Section 4.2 and Fig. 2).

Facility location algorithms form clusters with a specific topology comprising one or more cities, uniquely connected to a single, central facility.⁹ This topology is illustrated in Fig. 4.

$Top: The topology of the clusters that are assembled by the Facility Location algorithm. In this case, the set of boxes has been partitioned into three clusters. Within each cluster, the central facility (F 1-3) is connected to one or more cities (C 1-5). Each city is connected to exactly one facility. Bottom: Possible arrangement of aggregated box clusters corresponding to the illustrated topology for an image after inspection by three volunteers. Blue boxes $b_{i}^{l}$ correspond to facilities (F 1-3) and red boxes $b_{ij}^{k}$ correspond with the cities (C 1-5). Note that each volunteer may contribute at most one box to each cluster and in this case the same volunteer contributed the boxes that were assigned facility status.$

Figure 4.

Top: The topology of the clusters that are assembled by the Facility Location algorithm. In this case, the set of boxes has been partitioned into three clusters. Within each cluster, the central facility (F 1-3) is connected to one or more cities (C 1-5). Each city is connected to exactly one facility. Bottom: Possible arrangement of aggregated box clusters corresponding to the illustrated topology for an image after inspection by three volunteers. Blue boxes |$b_{i}^{l}$| correspond to facilities (F 1-3) and red boxes |$b_{ij}^{k}$| correspond with the cities (C 1-5). Note that each volunteer may contribute at most one box to each cluster and in this case the same volunteer contributed the boxes that were assigned facility status.

Open in new tab Download slide

Our implementation identifies disjoint, spatially concentrated subsets of the boxes in Z_i which we then identify with true clump locations |$b_{i}^{l}\in y_{i}$|⁠. We label each of these aggregated clusters with the index l and denote them as |$Z_{i}^{l}$|⁠. Establishing a new cluster entails labelling a particular box |$b_{ij}^{k}\in Z_{i}$| as a facility and connecting at least one other box |$b_{ij^{\prime }}^{k^{\prime }}$| that was provided by a different volunteer. Note that by associating box |$b_{ij}^{k}$| with cluster |$Z_{i}^{l}$| as either a city or a facility, we establish the mapping |$a_{ij}^{k} = l$|⁠. Each box in the set of volunteer annotations is associated with at most one true clump and each subset may contain at most one box per volunteer. These constraints reflect our assumption that separate marks provided by the same volunteer are intended to indicate separate clumps.

We specify that assigning facility status to a particular box incurs a real-valued cost

$$\begin{eqnarray} C^{\mathrm{f}}\big(b_{ij}^{k}\big) = -\displaystyle \sum _{j=1}^{ |V_{i}|}\ln\big(p_{j}^{\mathrm{fn}}\big), \end{eqnarray}$$

(16)

and connecting another box |$b_{ij^{\prime }}^{k^{\prime }}$| to an established facility |$b_{ij}^{k}$| incurs a cost

$$\begin{eqnarray} C^{\mathrm{fc}}\big(b_{ij}^{k}, b_{ij^{\prime}}^{k^{\prime }}\big) &= &\ln\big(p_{j^{\prime}}^{\mathrm{fn}}\big)-\ln\big(1-p_{j^{\prime}}^{\mathrm{fn}}\big) \\ &&-\,\ln\big(1-p_{j^{\prime}}^{\mathrm{fp}}\big)- \ln \left[\mathrm{Gaussian}\big(\vert d\vert ^{2}, \sigma _{j^{\prime }}^{2}\big)\right] \\ &=&\ln\big(p_{j^{\prime}}^{\mathrm{fn}}\big)-\ln\big(p_{j^{\prime}}^{\mathrm{tn}}\big) \\ &&-\,\ln\big(p_{j^{\prime}}^{\mathrm{tp}}\big)- \ln \big[\mathrm{Gaussian}\big(\vert d\vert ^{2}, \sigma _{j^{\prime }}^{2}\big)\big], \end{eqnarray}$$

(17)

where d represents the Jaccard distance between |$b_{ij}^{k}$| and |$b_{ij^{\prime }}^{k^{\prime }}$|⁠.

Combining these cost definitions yields the assembly cost for an individual cluster

$$\begin{eqnarray} C\big(Z_{i}^{l}\big) = C^{\mathrm{f}}\big(b_{ij}^{k}\big) + \displaystyle \sum \limits _{ \substack{b_{ij^{\prime }}^{k^{\prime }}\in Z_{i}^{l}\\ j\ne j^{\prime }}} C^{\mathrm{fc}}\big(b_{ij}^{k}, b_{ij^{\prime}}^{k^{\prime }}\big). \end{eqnarray}$$

(18)

Some boxes may represent false positive annotations. To handle these cases we follow the approach of BVP17 and establish a dummy facility at zero cost. Connections to the dummy facility identify boxes as false positives and incur box-specific costs

$$\begin{eqnarray} C^{\varnothing \mathrm{c}}\big(b_{ij^{\prime }}^{k}\big) = -\ln \big(p_{j^{\prime }}^{\mathrm{fp}}\big). \end{eqnarray}$$

(19)

Let |$Z^{\star }_{i}$| be the set of all established clusters for subject s_i. The definitions of |$C^{\mathrm{f}}\big(b_{ij}^{k}\big)$|⁠, |$C^{\mathrm{fc}}\big(b_{ij}^{k}, b_{ij^{\prime }}^{k^{\prime }}\big)$| and |$C^{\varnothing \mathrm{c}}\big(b_{ij^{\prime }}^{k}\big)$| imply an expression for the total cost C_i of all established clusters and all connections to the dummy facility that closely approximates the negative natural logarithm of the product over volunteers |$\prod _{j}\pi (\mathcal {S}_{j})p(z_{ij}|y_{i}, \mathcal {D}_{i},\mathcal {S}_{j}))$| defined in equation (5).¹⁰

$$\begin{eqnarray} C_{i} &=& \displaystyle \sum _{Z_{i}^{l}\in Z^{\star }_{i}} C\big(Z_{i}^{l}\big) +\displaystyle \sum _{b_{ij^{\prime }}^{k}\in Z_{i}\setminus Z^{\star }_{i}} C^{\varnothing \mathrm{c}}\big(b_{ij^{\prime }}^{k}\big) \\ &\approx& -\ln \left(\prod _{j}p(z_{ij}|y_{i}, \mathcal {D}_{i},\mathcal {S}_{j}))\right). \end{eqnarray}$$

(20)

The facility location algorithm is designed to compute the box-to-cluster mapping that minimizes C_i, which simultaneously yields the approximate maximum likelihood solution of equation (5) for given volunteer skill and image difficulty parameters.

To derive the aggregated estimate for the subject label |$\hat{y}_{i}$|⁠, we merge the individual boxes comprising each cluster by computing the mean coordinates of their corresponding vertex indices.¹¹ This yields a rectangular representation for each true clump location that is at least as large as each of the boxes comprising the set of annotations for the ith subject, Z_i.

During the initialization phase, we use a simplified set of facility location costs that do not depend on the volunteer skill parameters or the image difficulties. We specify that establishing a new facility during initialization incurs the same cost for any volunteer annotation

$$\begin{eqnarray} C^{\mathrm{f, init}}_{i} = f_{\mathrm{V}}|V_{i}|, \end{eqnarray}$$

(21)

where f_V ∈ [0, 1] is a hyper-parameter that represents the fraction of volunteers who inspected a subject that must contribute a box to an assembled cluster and we remind readers that |V_i| denotes the number of volunteers who inspected the ith subject. The initialization-phase cost of connecting box |$b_{ij^{\prime }}^{k^{\prime }}$| to an established facility |$b_{ij}^{k}$| still depends on the Jaccard distance d between them.

$$\begin{eqnarray} C^{\mathrm{fc, init}}\big(b_{ij}^{k}, b_{ij^{\prime}}^{k^{\prime}}\big) = {\left\lbrace \begin{array}{@{}l@{\quad }l@{}}0~&{\text{ if }}~d \le d_{\max }\\ \infty &{\text{ if }}~d > d_{\max }\end{array}\right.}, \end{eqnarray}$$

(22)

where d_max ∈ [0, 1] is a hyper-parameter that represents the maximum Jaccard distance between any city in a cluster and its central facility. Finally, connecting any box to the dummy facility during initialization incurs unit cost

$$\begin{eqnarray} C^{\varnothing \mathrm{c, init}}\big(b_{ij^{\prime }}^{k}\big) = 1. \end{eqnarray}$$

(23)

Table 1 lists the values we adopt for f_V and d_max.

5.4 Computing image difficulty

For each rectangular box |$\hat{b}_{i}^{l}\in \hat{y}_{i}$| comprising the estimated label for the ith subject, we use the global hyper-parameter |$\sigma ^{2}_{S,0}$| to define a subject-specific minimum difficulty

$$\begin{eqnarray} \sigma ^{2}_{\min , i} = \frac{\sigma ^{2}_{S,0}}{\vert \hat{y}_{i}\vert}. \end{eqnarray}$$

(24)

Intuitively, if a subject’s label includes more identified clump locations then we assume that clumps are easier to precisely locate and the minimum difficulty is reduced. We then update the minimum value to reflect the scatter between the subset of volunteer boxes |${b}_{ij}^{k}\in Z_{i}^{l}$| that were associated with the corresponding ground truth cluster. For each of these true positive boxes, we compute the Jaccard distance |${d_{ij}^{k}}^{l}$| between it and its corresponding rectangular box in the estimated subject label, |$\hat{b}_{i}^{l}$|⁠. Using these distances in conjunction with equation (A11), we estimate

$$\begin{eqnarray} {\sigma ^{l}_{i}}^{2}\approx \frac{n_{\chi , S}\sigma _{S, 0}^{2}}{\big\vert Z_{i }^{l}\big\vert + n_{\chi , S}+2}\displaystyle \sum _{\hat{b}_{ij}^{k}\in Z_{i }^{l}}\vert \Delta \vert ^{2}, \end{eqnarray}$$

(25)

where |$\Delta ^{2} = \sigma ^{2}_{\min , i} + \big\vert {d_{ij}^{k}}^{l}\big\vert ^{2}$|⁠, and n_{χ, S} is another hyper-parameter or our algorithm (see Section 4.6 and Appendix A).

5.5 Computing volunteer skill

We compute each volunteer’s skill parameters |$p_{j}^{\mathrm{fp}}$|⁠, |$p_{j}^{\mathrm{fp}}$|⁠, and |$\sigma _{j}^{2}$| (see Section 4.3) by comparing their individual clump annotations z_ij ∈ Z for each subject in s_i ∈ S_j with the corresponding label estimate |$\hat{y}_{i}$|⁠. For each volunteer, we compute the number of false positives by counting the subset of their annotation boxes that were associated with the dummy cluster

$$\begin{eqnarray} n_{\mathrm{fp}, j}=\displaystyle \sum _{s_{i}\in S_{j}} \displaystyle \sum _{z_{ij}\in Z_{i}}\mathbf {1} \big[a_{ij}^{k}=\varnothing \big]. \end{eqnarray}$$

(26)

We compute the number of false negatives for a volunteer by summing the number of established clusters for each image they inspected that do not contain one of their boxes

$$\begin{eqnarray} n_{\mathrm{fn}, j}= \displaystyle \sum _{s_{i}\in S_{j}} \displaystyle \sum _{Z_{i}^{l}\in Z_{i}}\mathbf {1} \big[z_{ij}\cap Z_{i}^{l}=\emptyset\big]. \end{eqnarray}$$

(27)

Note that ∅ in equation (27) represents the empty set and not our notation for the dummy facility |$\varnothing$|⁠. Analogously, we compute the number of true positives by counting the total number of clusters to which the volunteer contributed

$$\begin{eqnarray} n_{\mathrm{tp}, j}= \displaystyle \sum _{s_{i}\in S_{j}} \displaystyle \sum _{Z_{i}^{l}\in Z_{i}}\mathbf {1} \big[z_{ij}\cap Z_{i}^{l}\ne \emptyset\big]. \end{eqnarray}$$

(28)

We use the expressions in equations (26) and (27) in conjunction with equation (A6) to compute estimates for |$p^{\mathrm{fp}}_{j}$| and |$p^{\mathrm{fn}}_{j}$|

$$\begin{eqnarray} p_{j}^{\mathrm{fp}}\approx \frac{n_{\beta }p_{0}^{\mathrm{fp}}+n_{\mathrm{fp}, j}}{n_{\beta } + \vert Z_{j}\vert } \end{eqnarray}$$

(29)

$$\begin{eqnarray} p_{j}^{\mathrm{fn}}\approx \frac{n_{\beta }p_{0}^{\mathrm{fn}}+n_{\mathrm{fn}, j}}{n_{\beta } + \vert Z_{j}\vert}. \end{eqnarray}$$

(30)

To compute |$\sigma _{j}^{2}$| for each volunteer, we follow a similar approach to that used when computing image difficulties. We compute the Jaccard distances |$\big\lbrace {d_{ij}^{k}}^{l}\big\rbrace _{l=1}^{n_{\mathrm{tp},j}}$| between all true positive boxes and the merged rectangular box |$\hat{b}_{i}^{l}$| that was derived from the cluster to which they are associated. We then use these distances in conjunction with equations (28) and (A11) to estimate

$$\begin{eqnarray} \sigma ^{2}_{j}\approx \frac{n_{\chi ,V}\sigma _{0,V}^{2}}{n_{\mathrm{tp}, j}+ n_{\chi ,V}+2}\displaystyle \sum _{l=1}^{n_{\mathrm{tp}, j}}\big\vert {d_{ij}^{k}}^{l}\big\vert ^{2}. \end{eqnarray}$$

(31)

As a consequence of our prior specifications the formulations of equations (29), (30), and (31) can all be factored into terms that depend only on the current working batch and terms that depend only on prior information. This allows us to straightforwardly update the skill parameters of returning volunteers without having to reconsider the annotations they contributed to previous working batches.

5.6 Computing maximum likelihood labels

Once the associated clusters have been defined and the subject difficulties and volunteer skills have been computed, we are able to compute the likelihood of each subject’s estimated label using equations (8), (6), and (5). Practically, we compute the log-likelihood for each subject, and sum these to derive a global likelihood for all annotation data that comprise the current working batch.

Recall (Section 5.3) that we use a simplified set of facility location costs to derive an initial clustering solution for each new working batch. These costs are used for initialization because they can be computed without having estimated volunteer skills or subject difficulties, but they will generally not yield a set of clusters that correspond with the maximum likelihood solution of equation (5) for any subject. Similarly, the likelihood model parameters that we compute based on the initial clustering solution are unlikely to be good estimates of the subject difficulties or volunteer skills. As illustrated by the red boxes in Fig. 3, we use an iterative approach to derive the maximum likelihood solution for equation (5) and the corresponding best estimates of the likelihood model parameters.

After the initial set of volunteer skills have been computed, we recompute the box associations for all subjects using the nominal facility location costs specified in equation (16), equations (17), (19). Using these clusters we recompute the likelihood model parameters and the corresponding subject label likelihoods. We repeat this procedure until the sum of log-likelihoods for all subjects converges to its maximum value.

5.7 Computing subject risks

In Section 4.1, we introduced the concept of a ‘risk’ metric |$\mathcal {R}_{i}$| that can be computed for any subject s_i and used to quantitatively determine whether the estimated label |$\hat{y}_{i}$| is sufficiently representative of the unknown true label y_i to be scientifically useful. Specifying a risk that decreases monotonically as the reliability of |$\hat{y}_{i}$| increases, enables a principled decision to retire the subject s_i when its risk falls below a predefined threshold value which we denote τ.

To compute the risk for the ith subject, we follow the approach of BVP17 and define |$\mathcal {R}_{i}$| for each subject as the weighted sum of three separate terms.

$$\begin{eqnarray} \mathcal {R}_{i} = \alpha _{\mathrm{fp}}N_{i}^{\mathrm{fp}} + \alpha _{\mathrm{fn}}N_{i}^{\mathrm{fn}} + \alpha _{\mathrm{\sigma }}N_{i}^{\mathrm{\sigma }}(\delta). \end{eqnarray}$$

(32)

The first term, |$N_{i}^{\mathrm{fp}}$|⁠, represents an estimate of the number of detected clumps that are spurious, while |$N_{i}^{\mathrm{fn}}$| estimates the number of genuine clumps that have not been detected. Finally, |$N_{i}^{\mathrm{\sigma }}(\delta)$| estimates the number detected clump locations that are genuine but insufficiently accurate in the sense that their Jaccard distance from the true clump location is likely to exceed a threshold value δ, which we specify as a hyper-parameter.

The weight terms α_fp, α_fn, and α_σ are hyper-parameters that allow the properties of the clump sample for retired subjects to be tuned for particular scientific investigations. For a specific value of τ, increasing the value of α_fp relative to the other weights will result in a purer clump sample, while a relative increase in α_fn increases the sample completeness. Specifying a larger value for α_σ will result in more accurate clump locations, which may be useful for studies considering the radial distribution of clumps within their host galaxies.

To estimate the expected number of genuine clumps in the estimated label for the ith subject, we consider each established cluster |$Z_{i}^{l}\in Z_{i}$| and identify two subsets |$V_{i}^{\mathrm{mark},l}, V_{i}^{\mathrm{miss},l}\in V_{i}$| of the volunteers who inspected the ith subject s_i. The volunteers in |$V_{i}^{\mathrm{mark},l}$| are those who inspected s_i and contributed a box to the lth cluster |$Z_{i}^{l}$|⁠. Conversely, |$V_{i}^{\mathrm{miss},l}$| contains the volunteers who inspected s_i but missed the clump associated with |$Z_{i}^{l}$|⁠. To estimate the overall probability that |$Z_{i}^{l}$| represents a false positive detection, we combine the probability that all volunteers in |$V_{i}^{\mathrm{miss},l}$| correctly omitted the detected clump from their annotation with the probability that all volunteers in |$V_{i}^{\mathrm{match},l}$| provided a spurious annotation.

$$\begin{eqnarray} p_{l}^{\mathrm{fp}} = \displaystyle \prod _{j=1}^{\left|V_{i}^{\mathrm{miss},l}\right|}(1-p_{j}^{\mathrm{fn}})\cdot \displaystyle \prod _{j=1}^{\left|V_{i}^{\mathrm{mark},l}\right|}p_{j}^{\mathrm{fp}}. \end{eqnarray}$$

(33)

Similarly, to estimate the overall probability that the cluster |$Z_{i}^{l}$| represents a true positive clump detection, we combine the probability that all volunteers in |$V_{i}^{\mathrm{miss},l}$| missed a genuine clump with the probability that the associated boxes provided by all volunteers in |$V_{i}^{\mathrm{mark},l}$| were correct.

$$\begin{eqnarray} p_{l}^{\mathrm{tp}} = \displaystyle \prod _{j=1}^{\left|V_{i}^{\mathrm{miss},l}\right|}p_{j}^{\mathrm{fn}}\cdot \displaystyle \prod _{j=1}^{\left|V_{i}^{\mathrm{mark},l}\right|}\big(1-p_{j}^{\mathrm{fp}}\big). \end{eqnarray}$$

(34)

Finally, we use equations (33) and (34) to estimate the number of clusters in Z_i that are false positives by summing the expected value of an indicator variable that equals 1 when the cluster is a false positive and 0 otherwise for all clusters |$Z_{i}^{l}\in Z_{i}$| (Recall that we used an analogous approach to compute the parameter η in Section 4.5).

$$\begin{eqnarray} N_{i}^{\mathrm{fp}} = \displaystyle \sum _{l=1}^{|Z_{i}|}{\frac{p_{l}^{\mathrm{fp}}}{p_{l}^{\mathrm{fp}}+p_{l}^{\mathrm{tp}}}}. \end{eqnarray}$$

(35)

To estimate the expected number of clumps in the estimated label for the ith subject that are genuine, but have insufficiently accurate locations,¹² we consider the sets of true positive boxes, supplied by the volunteers in |$V_{i}^{\mathrm{mark},l}$|⁠, that were associated with each cluster |$Z_{i}^{l}\in Z_{i}$|⁠. We model the Jaccard distance d_l between estimated clump location |$\hat{b}_{i}^{l}\in \hat{y}_{i}$| and the true clump location |$b_{i}^{l}\in y_{i}$| as a random sample from a Gaussian distribution with zero mean and variance derived by summing the constituent box variances defined in equation (6)

$$\begin{eqnarray} {\sigma _{i}^{l}}^{2} = \displaystyle \sum _{b_{ij}^{k}\in Z_{i}^{l}} {\sigma _{ij}^{k}}^{2}. \end{eqnarray}$$

(36)

Using this Gaussian model, we estimate the expected number of estimated clump locations that are inaccurate by more than δ by summing the probabilities |$\lbrace p_{l}^{\mathrm{\sigma }}\rbrace _{l=1}^{|\hat{y}_{i}|}$| that the errors in the individual clump locations exceed this threshold.

$$\begin{eqnarray} N_{i}^{\mathrm{\sigma }} &=& \displaystyle \sum _{\hat{b}_{i}^{l}\in \hat{y}_{i}} p_{l}^{\mathrm{\sigma }} \\ &=& \sum _{\hat{b}_{i}^{l}\in \hat{y}_{i}}1-\mathrm{erf}\left(\frac{\delta }{\sqrt{2{\sigma _{i}^{l}}^{2}}}\right). \end{eqnarray}$$

(37)

Our approach for estimating the expected number of genuine clumps that are not represented in estimated label for the ith subject (i.e. the number of false negatives) emulates the one used by BVP17 . We begin by using the facility location algorithm to re-cluster the annotations for each subject, subject to three additional constraints that are based on the original maximum likelihood solution.

Volunteer boxes that were originally associated with true positive clusters are not considered as potential cities. This means that the only way that true positive annotations can contribute to clusters is by becoming facilities.
Only annotations that were not defined as facilities originally are considered as potential facilities. This prevents rediscovery of the clumps that were indicated by the maximum likelihood solution for the subject.
There is no dummy facility available, so all annotations must either become a facility or connect to an existing facility, regardless of how high the connection or establishment costs are.

We assume that each of the assembled clusters |$\big\lbrace {Z^{\prime }}_{i}^{l^{\prime }}\big\rbrace _{l^{\prime }=1}^{\vert {Z^{\prime }}_{i}\vert }$| comprising the constrained facility location solution Z′_i represents a potentially missed clump detection. For each new cluster we compute its assembly cost |$C_{l^{\prime }}$| using equation (18) and compare this with the cost of connecting all the cities it contains to the dummy facility.

$$\begin{eqnarray} C_{l^{\prime }}^{\varnothing } = \displaystyle \sum _{b_{ij}^{k}\in {Z^{\prime }}_{i}^{l^{\prime }}} C^{\varnothing \mathrm{c}}\big(b_{ij}^{k}\big). \end{eqnarray}$$

(38)

To compute an initial estimate for |$N_{i}^{\mathrm{fn}}$|⁠, we sum the expected value of an indicator variable that equals 1 when the cluster is a false negative and 0 otherwise for all clusters in Z′_i.

$$\begin{eqnarray} N_{i}^{\mathrm{fn, init}} = \displaystyle \sum _{{Z^{\prime }}_{i}^{l^{\prime }}\in {Z^{\prime }}_{i}}\frac{p_{l^{\prime }}^{\mathrm{fn}}}{p_{l^{\prime }}^{\mathrm{fn}}+p_{l^{\prime }}^{\mathrm{tn}}}, \end{eqnarray}$$

(39)

where |$p_{l^{\prime }}^{\mathrm{fn}}$| estimates the probability that cluster |${Z^{\prime }}_{i}^{l^{\prime }}$| identifies a real clump |$b_{i}^{l^{\prime }}$| that was originally missed by the maximum likelihood solution. By analogy with equation (20)

$$\begin{eqnarray} p_{l^{\prime }}^{\mathrm{fn}} = e^{-C_{l^{\prime }}} \approx \prod _{z_{ij}\in {Z^{\prime }}_{i}^{l^{\prime }}} p\big(z_{ij}|y_{i}\ni b_{i}^{l^{\prime }},\mathcal {D}_{i},\mathcal {S}_{j}\big). \end{eqnarray}$$

(40)

Furthermore, |$p_{l^{\prime }}^{\mathrm{tn}}$| is the probability that the boxes in |${Z^{\prime }}_{i}^{l^{\prime }}$|⁠, all correspond with false positive clicks

$$\begin{eqnarray} p_{l^{\prime }}^{\mathrm{tn}} = e^{-C_{l^{\prime }}^{\varnothing }} = \prod _{z_{ij}\in {Z^{\prime }}_{i}^{l^{\prime }}} p_{j}^{\mathrm{fp}}. \end{eqnarray}$$

(41)

This initial estimate cannot be computed for a subject if no volunteer boxes were originally connected to the dummy facility. However the absence of nominally false positive boxes does not imply that no clumps have been missed. To estimate how many clumps might have been missed when no false positives are present, we consider intersections between the global set of all annotations provided by all volunteers for all subjects in the working batch. We use this global set to estimate a subject-agnostic probability that two boxes coincide at any particular location within a subject. The higher this probability is for a particular location, the more likely it is that a clump will be located there. We define a coincidence when two boxes are separated by a Jaccard distance less than d_max.¹³ We begin by randomly shuffling the elements of the working batch. We then process the randomized elements sequentially to find any mutually coinciding subsets of volunteer boxes. For each box, we check for coincidence with any of the previously processed elements. If the boxes coincide, we increment a coincidence count, which we denote |${n^{\cap }}_{ij}^{k}$|⁠, for the previously processed element and remove the current element from the shuffled batch. If no coincidences are found, we retain the current element, which allows coincidences between it and subsequent elements to be identified and counted. After the shuffled working batch has been processed, we estimate the probability of a coincidence with each of the remaining elements by computing the ratio between its accumulated coincidence count and the total number of annotations n_z comprising to the working batch¹⁴

$$\begin{eqnarray} {p^{\cap }}_{ij}^{k}=\frac{{n^{\cap }}_{ij}^{k}}{n_{z}}. \end{eqnarray}$$

(42)

Fig. 5 illustrates the different stages in our computation of the |${p^{\cap }}_{ij}^{k}$| using all boxes in the working batch.

Figure 5.

Computing the random coincidence probability using all boxes in the working batch. Left-hand panel: Shaded boxes represent all elements in the first working batch. Solid boundaries indicate groups of boxes that coincided using the d_max = 0.9 criterion. Note that large boxes may validly encompass all or most of smaller ones without coinciding if the ratio of the box areas areas in normalized coordinates less than 0.9d_max. Boxes that did not coincide with any others are shown using dashed lines. Middle panel: The elements of B^∩ coloured according to the number of boxes they were found to coincide with. Right-hand panel: Two dimensional map showing the mean probability that one or more boxes in the working batch will accidentally coincide at a given two-dimensional location.

Open in new tab Download slide

Let B^∩ represent the remaining elements of the shuffled working batch, for which a value of |${p^{\cap }}_{ij}^{k}$| has been computed. For each subject in the original working batch, we find the subset B^⋆ of elements in B^∩ that constitute that do not coincide with any of the boxes |$\hat{b}_{i}^{l}\in \hat{y}_{i}$| that comprise the estimated subject label. For each box in B^⋆, we increment |$N_{i}^{\mathrm{fn, init}}$| by the product of the probability that a coincidence occurs at that location and the probability that all volunteers who inspected the image would have missed the clump.

$$\begin{eqnarray} N_{i}^{\mathrm{fn}} = N_{i}^{\mathrm{fn, init}} + \displaystyle \sum _{b_{ij}^{k}\in B^{\star }}{p^{\cap }}_{ij}^{k}\cdot e^{-C^{\mathrm{f}}(b_{ij}^{k})}. \end{eqnarray}$$

(43)

Note that for subjects that have no clumps identified in their estimated labels, B^⋆ → B^∩. In practice, we find that |$N_{i}^{\mathrm{fn, init}}$| always dominates the estimate of |$N_{i}^{\mathrm{fn}}$| and that the second term in equation (43) is always ≪1.

5.8 Subject retirement and batch finalization

Computing the expected false positive, false negative and inaccurate true positive counts (i.e. |$N_{i}^{\mathrm{fp}}$|⁠, |$N_{i}^{\mathrm{fn}}$|⁠, and |$N_{i}^{\sigma }(\delta)$|⁠) independently for each subject allows us to define a compound retirement criterion that specifies maximum permissible values, |$N_{i,\max }^{\mathrm{fp}}$|⁠, |$N_{i,\max }^{\mathrm{fn}}$|⁠, and |$N_{i,\max }^{\sigma }$|⁠, for each of these quantities as well as a threshold τ on the overall subject risk. Table 2 lists the thresholds we use in practise as well as the values we adopt for the coefficients specified in equation (32).

Table 2.

Open in new tab

Parameters used to determine subject retirement and compute overall subject risk.

Parameter	Value
α_fp	1
α_fn	1
α_σ	2
δ	0.5
\|$N_{i,\max }^{\mathrm{fp}}$\|	1
\|$N_{i,\max }^{\mathrm{fn}}$\|	0.3
\|$N_{i, \max }^{\sigma }$\|	3
τ	5

Table 2.

Open in new tab

Parameters used to determine subject retirement and compute overall subject risk.

Parameter	Value
α_fp	1
α_fn	1
α_σ	2
δ	0.5
\|$N_{i,\max }^{\mathrm{fp}}$\|	1
\|$N_{i,\max }^{\mathrm{fn}}$\|	0.3
\|$N_{i, \max }^{\sigma }$\|	3
τ	5

Once the subject risks have been computed, we retire those subjects for which the overall risk |$\mathcal {R}_{i} < \tau$| and |$N_{i}^{\mathrm{fp}}$|⁠, |$N_{i}^{\mathrm{fn}}$|⁠, and |$N_{i}^{\sigma }$| are all less than their specified maximum permissible values, before removing their elements from the working batch. We also identify and remove any stale subject data that have persisted for the maximum allowed number of batch replenishment cycles without retiring. Such subjects are likely very difficult or complicated, so we mark them for expert inspection, assessment and labelling. For the remaining subjects that were not retired, we re-initialise their difficulty parameters and discard any associated clusters that were established when the working batch was processed.

Annotation data that were provided by a single volunteer for different subjects can appear in separate working batches, especially if volunteers return to the project regularly over an extended period of time. It is also possible that only a subset of the subjects annotated by a volunteer in a single working batch are retired when batch processing completes. If a volunteer’s annotation data persist between batches, those persistent data should not be used to update volunteer skills multiple times during multiple batch processing cycles. This could lead to pathological subjects unfairly inflating or reducing the skill parameter values (⁠|$p_{j}^{\mathrm{fp}}$|⁠, |$p_{j}^{\mathrm{fn}}$|⁠, |$\sigma ^{2}_{j}$|⁠) for a particular volunteer. To avoid this scenario, we restore the volunteer skills that were cached at the start of the latest cycle and update them using only annotation for subjects that did retire.

The batch processing cycle then restarts by acquiring new annotation data and repopulating the working batch.

6 RESULTS

Recall that the full Galaxy Zoo: Clump Scout data set (Z) contains 3561 454 click locations, which constitute 1739 259 annotations of 85 286 distinct subjects provided by 20 999 volunteers and that approximately 20 volunteers inspected each subject. Using this data set, we identify 128 100 potential clumps distributed among 44 126 galaxies. Fig. 6 shows five examples of galaxies in which clumps were detected.

Figure 6.

Examples of clump-hosting galaxies, illustrating the ability of our framework to exclude false-positive annotations. The left-hand column shows galaxy images as they were seen by volunteers. The second column overlays all volunteer annotations on a grey-scale image of the same galaxy. In the third column, volunteer annotations that were assigned to a facility and identified as clumps are shown in colour. Annotations that were assigned to the dummy facility are shown in black. The fourth column shows the clump locations that we ultimately identify.

Open in new tab Download slide

6.1 Testing the effect of volunteer multiplicity

We expect that the performance of our aggregation framework will vary depending upon the number of volunteers who inspect each subject. To investigate this dependence, we assemble 17 subsamples of annotations |$\lbrace \tilde{Z}_{n}\rbrace _{n=3}^{20}\in Z$|⁠, that contain between 3 and 20 annotations per galaxy. Each |$\tilde{Z}_{n}$| is constructed by randomly sampling n annotations for each subject s_i ∈ S. For example, |$\tilde{Z}_{5}$| includes five randomly sampled annotations for each galaxy in the Galaxy Zoo: Clump Scout subject set. We then use our aggregation framework to derive the set of corresponding estimated subject labels |$\hat{Y}(\tilde{Z}_{n})\equiv \lbrace \hat{y}_{i,n} \rbrace _{i=1}^{\vert S\vert }$|⁠, where |$\hat{y}_{i,n} = \hat{y}_{i}(Z=\tilde{Z}_{n})$| is the label for s_i based only on the n annotations for that subject within |$\tilde{Z}_{n}$|⁠.¹⁵ In subsequent sections, we will examine the differences between results derived using these different restricted data sets. Note that the data set containing 20 annotations per subject, denoted |$\tilde{Z}_{20}$|⁠, is not quite the full Galaxy Zoo: Clump Scout data set Z because the Zooniverse interface occasionally collects more than 20 annotations per subject.

6.2 Aggregated clump properties

Our aggregation algorithm assigns a separate false positive probability |$p^{\mathrm{fp}}_{l}$| to each clump it identifies (see Section 5.7). The left-hand panel of Fig. 7 shows the distribution of this false positive probability for clumps detected using 20 annotations per subject, which is strongly bimodal with ≈90 per cent of clumps having |$0.2 < p^{\mathrm{fp}}_{l} >0.8$|⁠. The right-hand panel shows how the distribution of the false positive probabilities for all identified clumps evolves as more volunteers annotate each subject. For fewer than five annotations per subject (i.e. n ≲ 5), the estimates for the clumps’ false positive probabilities remain somewhat prior-dominated and the distributions are unimodal with medians close to the hyper-parameter value |$p_{0}^{\mathrm{fp}}=0.1$|⁠. For more than five annotations per subject (i.e. n > 5), the distributions become progressively more bimodal which increases their interquartile ranges. The distribution medians decrease monotonically as the number of annotations per subject n → 20, which indicates that providing more volunteer annotations per subject allows our framework to more confidently predict the presence of clumps.

$Left-hand panel: Distribution of estimated false positive probability $p^{\mathrm{fp}}_{l}$ for clumps identified using 20 annotations per subject (i.e. using $\tilde{Z}_{20}$). The distribution is strongly bimodal with ≈90 per cent of clumps having $0.2 < p^{\mathrm{fp}}_{l} >0.8$. The inset shows the distribution in for $p^{\mathrm{fp}}_{l} < 0.01$. Right-hand panel: Distributions of $p^{\mathrm{fp}}_{l}$ corresponding to n between 3 and 20 volunteer annotations per subject. The distribution medians decrease monotonically from ≈0.04 for n = 3 to ≈5 × 10−5 for n = 20, while the distribution interquartile ranges become wider as more volunteers annotate each subject. We use a ‘logistic’ scaling for the y-axis to highlight the development of the bimodal structure for large n. Note that the colour scale shows the number density of clumps to account for the fact that the two-dimensional histogram bins cover different areas.$

Figure 7.

Left-hand panel: Distribution of estimated false positive probability |$p^{\mathrm{fp}}_{l}$| for clumps identified using 20 annotations per subject (i.e. using |$\tilde{Z}_{20}$|⁠). The distribution is strongly bimodal with ≈90 per cent of clumps having |$0.2 < p^{\mathrm{fp}}_{l} >0.8$|⁠. The inset shows the distribution in for |$p^{\mathrm{fp}}_{l} < 0.01$|⁠. Right-hand panel: Distributions of |$p^{\mathrm{fp}}_{l}$| corresponding to n between 3 and 20 volunteer annotations per subject. The distribution medians decrease monotonically from ≈0.04 for n = 3 to ≈5 × 10⁻⁵ for n = 20, while the distribution interquartile ranges become wider as more volunteers annotate each subject. We use a ‘logistic’ scaling for the y-axis to highlight the development of the bimodal structure for large n. Note that the colour scale shows the number density of clumps to account for the fact that the two-dimensional histogram bins cover different areas.

Open in new tab Download slide

For every bounding box in each subject’s maximum likelihood label, we also compute the probability |$p^{\sigma }_{l}$| that the Jaccard distance between it and the unknown true location of the clump exceeds δ = 0.5. The left-hand panel of Fig. 8 shows the distribution of |$p^{\sigma }_{l}$| for clumps detected using 20 annotations per subject, while the right-hand panel shows how the distribution |$p^{\sigma }_{l}$| of evolves as more volunteers annotate each subject. Again, our model priors appear to dominate for fewer than five annotations per subject and the distribution medians decrease monotonically as the number of annotations per subject n → 20. This pattern indicates that providing more volunteer annotations per subject allows our framework to more precisely determine the locations of clumps.

$Left-hand panel: Distribution of the estimated probability that an individual clump location is inaccurate ($p^{\sigma }_{l}$) for clumps identified using 20 annotations per subject (i.e. using $\tilde{Z}_{20}$). The distribution is concentrated close to zero with all clumps having $p^{\sigma }_{l}\lesssim 0.3$. The inset shows the distribution in for $p^{\sigma }_{l} < 0.01$. Right-hand panel: Distributions of $p^{\sigma }_{l}$ corresponding to n between 3 and 20 volunteer annotations per subject. The distribution medians decrease monotonically from ≈0.05 for n = 3 to ≈4 × 10−4 for n = 20, while the distribution interquartile ranges become wider as more volunteers annotate each subject. Note that the colour scale shows the number density of clumps to account for the fact that the two-dimensional histogram bins cover different areas.$

Figure 8.

Left-hand panel: Distribution of the estimated probability that an individual clump location is inaccurate (⁠|$p^{\sigma }_{l}$|⁠) for clumps identified using 20 annotations per subject (i.e. using |$\tilde{Z}_{20}$|⁠). The distribution is concentrated close to zero with all clumps having |$p^{\sigma }_{l}\lesssim 0.3$|⁠. The inset shows the distribution in for |$p^{\sigma }_{l} < 0.01$|⁠. Right-hand panel: Distributions of |$p^{\sigma }_{l}$| corresponding to n between 3 and 20 volunteer annotations per subject. The distribution medians decrease monotonically from ≈0.05 for n = 3 to ≈4 × 10⁻⁴ for n = 20, while the distribution interquartile ranges become wider as more volunteers annotate each subject. Note that the colour scale shows the number density of clumps to account for the fact that the two-dimensional histogram bins cover different areas.

Open in new tab Download slide

Fig. 9 illustrates the spatial distribution of the detected clump locations in bins of estimated clump false positive probability |$p^{\mathrm{fp}}_{l}$|⁠. We observe that 99.9 per cent of clumps with |$p^{\mathrm{fp}}_{l}\lesssim 0.5$| (i.e. likely true positives) are located within a central circular region occupying 20 per cent of the area of their corresponding images. In contrast, clumps with |$p^{\mathrm{fp}}_{l}\gtrsim 0.5$| (i.e. likely false positives) are 10 times more likely to fall outside this region. This central concentration of confidently identified clumps is reassuring because it reflects the typical footprints of the target galaxies in each subject image, which is where we would reasonably expect to find genuine clumps. For all clumps, regardless of their estimated false positive probability, we observe a clear under-density at the centre of the distribution, which likely reflects the fact that most volunteers correctly distinguish the target galaxies’ central bulges from clumps.

$Detected clump locations in normalized image coordinates in bins of estimated clump false positive probability, pfp. For pfp ≲ 0.5, 99.9 per cent of detected clumps have $R_{\mathrm{clump}}=\sqrt{(X_{\mathrm{clump}}/X_{\max })^{2}+(Y_{\mathrm{clump}}/Y_{\max })^{2}} < 0.25$. In contrast 10 times more clumps (∼1 per cent) with $p^{\mathrm{fp}}_{l}\gtrsim 0.5$ have Rclump > 0.25.$

Figure 9.

Detected clump locations in normalized image coordinates in bins of estimated clump false positive probability, p_fp. For p^fp ≲ 0.5, 99.9 per cent of detected clumps have |$R_{\mathrm{clump}}=\sqrt{(X_{\mathrm{clump}}/X_{\max })^{2}+(Y_{\mathrm{clump}}/Y_{\max })^{2}} < 0.25$|⁠. In contrast 10 times more clumps (∼1 per cent) with |$p^{\mathrm{fp}}_{l}\gtrsim 0.5$| have R_clump > 0.25.

Open in new tab Download slide

6.3 Comparison with expert annotations

To quantify the degree of correspondence between the clumps identified by volunteers and those identified by professional astronomers, we used the Galaxy Zoo: Clump Scout interface to collect annotations from three expert astronomers for 1000 randomly selected subjects and compared the recovered clump locations with those derived from volunteer clicks by our aggregation framework.

For each subject in this expert-annotated image set, we consider the 17 different estimated labels |$\hat{y}_{i,n}$| that were computed using 3 ≤ n ≤ 20 volunteer annotations per subject (see Section 6.1). We then filter each of these 17 labels by selecting a subsample of its bounding boxes that have associated false positive probabilities |$p^{\mathrm{fp}}_{l}$| that are less than a selectable threshold value, which we denote p^{⋆, fp}. By setting p^{⋆, fp} close to one, we expect to select only the bounding boxes that mark real clumps. Conversely, we expect that setting p^{⋆, fp} close to zero results in a subsample that is likely to contain more false positive bounding boxes. We use the symbol |$\hat{Y}^{\star }_{n}(p^{\star ,\mathrm{fp}})$| to denote the set of estimated labels for all expert-annotated subjects that were computed using n volunteer annotations per subject and filtered to include only those bounding boxes with false positive probabilities less than p^{⋆, fp}.

For a particular false positive filtering threshold p^{⋆, fp} and number of annotations per subject n, we consider the filtered labels for all 1000 expert-annotated subjects and define |$N_{n}^{\mathrm{FP}}$| to be the total number of empirically false positive aggregated clump bounding boxes in |$\hat{Y}^{\star }_{n}(p^{\star ,\mathrm{fp}})$| that contain zero expert click locations. Conversely, |$N_{n}^{\mathrm{FN}}$| denotes the total number of expert clicks located outside of any aggregated box, which we designate as false negatives. We identify the remaining |$N_{n}^{\mathrm{TP}}$| aggregated boxes that coincided with an expert click location as true positives.

Using the set of aggregated clump designations, we compute the aggregated p^fp-threshold-dependent clump sample completeness

$$\begin{eqnarray} \mathcal {C}_{n}\big(p^{\star ,\mathrm{fp}}\big)=\frac{N_{n}^{\mathrm{TP}}}{N_{n}^{\mathrm{TP}} + N_{n}^{\mathrm{FN}}}, \end{eqnarray}$$

(44)

and purity

$$\begin{eqnarray} \mathcal {P}_{n}(p^{\star ,\mathrm{fp}})=\frac{N_{n}^{\mathrm{TP}}}{N_{n}^{\mathrm{TP}} + N_{n}^{\mathrm{FP}}}. \end{eqnarray}$$

(45)

Fig. 10 illustrates how the completeness and purity of our aggregated clump sample depend on n. In the left-hand panel, we plot |$\mathcal {C}_{n}$| and |$\mathcal {P}_{n}$| values derived using the whole expert-identified clump sample as a ground-truth set. The values plotted in the right-hand panel are derived by comparing a restricted set of nominally normal ground-truth clumps, which experts did not identify as ‘unusual’ (see Section 3.2) with aggregated clumps that the majority of volunteers who identified the clump classified it as being normal in appearance. In both panels, the crosses show the ‘optimal’ completeness and purity values that maximize the hypotenuse |$\sqrt{\mathcal {C}(p^{\star ,\mathrm{fp}})^{2} + \mathcal {P}(p^{\star ,\mathrm{fp}})^{2}}$| over all possible p^{⋆, fp} thresholds. For comparison, the square and triangular points in Fig. 10, respectively, illustrate the maximum values of completeness and purity that can be achieved independently.

Figure 10.

Purity versus completeness for different numbers of volunteers per subject. The left-hand panel shows values derived using the full volunteer label sets and all expert-identified clumps as a benchmark sample, while the values shown in the right-hand panel are derived by comparing the sets of clumps which experts and volunteers identified as ‘normal’. Squares indicate the values of the maximum possible completeness and purity for each number of volunteers, which can generally not be realized simultaneously. Crosses indicate the optimal completeness and purity values that can be simultaneously realized for each volunteer count.

Open in new tab Download slide

For both the full and the restricted ground truth sets, we observe a general trend that increasing the number of volunteers who inspect each subject increases the optimal sample completeness at the expense of reducing purity. Using the expert classifications as a benchmark it is clear that our most complete aggregated clump samples suffer substantial contamination. In the most extreme case, using the ‘normal’ clump comparison sets for n = 20 and letting p^{⋆, fp} = 1 yields ∼97 per cent completeness, but only ∼35 per cent purity. The high level of contamination indicates that volunteers are much more optimistic than experts when annotating clumps i.e. volunteers will mark features that experts will ignore. Moreover, while completeness values generally improve when comparing the restricted ‘normal’ clumps, the corresponding purity values are substantially worse than those derived from the full clump samples. This degradation in purity for the ‘normal’ clump subset likely indicates that volunteers and experts disagree about the definition of a ‘normal’ clump with volunteers being less likely to label a clump as unusual.

The top row of Fig. 11 shows the g, r, and i band flux distributions¹⁶ for aggregated clumps that are empirically determined to be false positive and true positive when comparing them with expert clump annotations. To better represent the appearance of the clumps that volunteers and experts actually see, the band-limited fluxes shown in Fig. 11 are independently scaled in the same way as the corresponding bands of the Galaxy Zoo: Clump Scout subject images (see Section 2.2). The distributions reveal that empirically false-positive clumps are ∼5–10 times fainter on average than empirically true positive clumps. The bottom row of Fig. 11 shows all non-redundant flux ratios for the g, r, and i bands. In general, the empirically false positive clumps are brighter in the g band and would appear bluer in the subject images. Overall, the distributions in Fig. 11 suggest that volunteers are more likely to mark faint features than experts, particularly when those features appear blue. Fig. 18 shows typical examples of the faint blue features that volunteers annotate but experts ignore.

Figure 11.

Top row: Flux distributions in g, r, and i bands for clumps that are empirically determined to be false positive or true positive by comparing with expert clump annotations. Dashed vertical lines indicate the distribution means. Bottom row: Flux ratio distributions for clumps that are empirically determined to be false positive or true positive by comparing with expert clump annotations. Dashed vertical lines indicate the distribution medians. In both rows, the fluxes in each band are scaled in the same way as the corresponding bands of the subject images (see Section 2.2) to better reflect the data that volunteers actually see.

Open in new tab Download slide

Fig. 12 illustrates the degree of correspondence between the value of |$p^{\mathrm{fp}}_{l}$| assigned to each clump by our aggregation framework and their empirical categorization as true or false positives. The figure compares the distributions of |$p^{\mathrm{fp}}_{l}$| for empirically true positive and false clumps identified using all available annotations for the expert-annotated subject set. The distributions represent the restricted subset of clumps in |$\hat{Y}_{20}$| that the majority of volunteers labelled as ‘normal’. However, we recognize that volunteers and experts may disagree about what criteria define a ‘normal’ clump. Therefore, to avoid conflating this categorical disagreement with genuine cases when experts and volunteers mark different features (regardless of the annotation tool used), we consider any expert identified clump when assigning true-positive or false-positive labels. The majority of aggregated clumps in both categories have very low estimated false positive probabilities (⁠|$p^{\mathrm{fp}}_{l}\ll 1$|⁠), indicating a high degree of consensus between volunteers, albeit that this consensus disagrees with the expert annotations. Although clumps in both empirical categories have estimated |$p^{\mathrm{fp}}_{l}$| values spanning the full range [0, 1], we note that 95 per cent of empirically true-positive clumps have |$p^{\mathrm{fp}}_{l} < 0.3$| compared with only 68 per cent of empirical false positives. This reinforces the evidence implicit in Fig. 10 that the aggregated clump sample can be made purer with respect to the expert sample by applying a threshold on |$p^{\mathrm{fp}}_{l}$|⁠.

$Distribution of estimated clump false positive probability ($p^{\mathrm{fp}}_{l}$) values for aggregated clump locations that coincide with expert annotations (orange) and those that did not (blue). We use coincidence with any expert clump to establish the true-positive or false-positive categories, but only aggregated clumps that the majority of volunteers labelled as ‘normal’ are considered. The inset shows a zoomed view of the distributions for $p^{\mathrm{fp}}_{l} < 0.01$.$

Figure 12.

Distribution of estimated clump false positive probability (⁠|$p^{\mathrm{fp}}_{l}$|⁠) values for aggregated clump locations that coincide with expert annotations (orange) and those that did not (blue). We use coincidence with any expert clump to establish the true-positive or false-positive categories, but only aggregated clumps that the majority of volunteers labelled as ‘normal’ are considered. The inset shows a zoomed view of the distributions for |$p^{\mathrm{fp}}_{l} < 0.01$|⁠.

Open in new tab Download slide

6.4 Volunteer skill parameters

Our aggregation framework allows us to monitor the evolution of volunteers’ skill parameters as they spend time in the project. The top panel of Fig. 13 shows the distribution of the Galaxy Zoo: Clump Scout volunteers’ subject classification counts. The distribution is bottom-heavy with a median of three subjects per volunteer and 19 859 volunteers (∼95 per cent) annotating fewer than 10 images, and only 176 volunteers (∼0.08 per cent) annotating more than 200.¹⁷ The remaining panels of Fig. 13 illustrate how our estimates of the volunteers’ skill parameters evolve as volunteers inspect and annotate increasing numbers of subjects. For all three skill parameters, the mean and median of the maximum likelihood estimates increase monotonically from their prior values as volunteers annotate more subjects. The relatively slow evolution of |$p^{\mathrm{fp}}_{j}$| for subject inspection counts below ∼10 reflects the strong regularization that results from setting the hyper-parameter |$n_{\beta }^{\mathrm{fp}}=500$| (see Table 1).

$Evolution of volunteer skill parameter statistics versus number of subjects inspected. The top panel show the distribution of the number of volunteers who have inspected at least as many subjects as indicated by the upper boundary of each bin. This means volunteers who annotate many subjects will contribute to several bins. However, their skill parameters are sampled at the point that they had inspected the maximum number of subjects represented by a particular bin. Statistics for the different volunteer skill parameters $p_{j}^{\mathrm{fp}}$, $p_{j}^{\mathrm{fn}}$ and σj are shown in the upper-middle, lower-middle, and bottom panels, respectively. Red and blue markers plot the median and mean skill parameter of all volunteers contributing to a particular bin, respectively. The orange band illustrates the inter-quartile ranges of the bin-wise distributions. Dotted and dashed lines indicate the 5th and 95th percentiles, respectively.$

Figure 13.

Evolution of volunteer skill parameter statistics versus number of subjects inspected. The top panel show the distribution of the number of volunteers who have inspected at least as many subjects as indicated by the upper boundary of each bin. This means volunteers who annotate many subjects will contribute to several bins. However, their skill parameters are sampled at the point that they had inspected the maximum number of subjects represented by a particular bin. Statistics for the different volunteer skill parameters |$p_{j}^{\mathrm{fp}}$|⁠, |$p_{j}^{\mathrm{fn}}$| and σ_j are shown in the upper-middle, lower-middle, and bottom panels, respectively. Red and blue markers plot the median and mean skill parameter of all volunteers contributing to a particular bin, respectively. The orange band illustrates the inter-quartile ranges of the bin-wise distributions. Dotted and dashed lines indicate the 5th and 95th percentiles, respectively.

Open in new tab Download slide

6.5 Subject risk and its components

The distributions shown in Fig. 14 reveal how the expected numbers of false positive bounding boxes |$N_{i}^{\mathrm{fp}}$|⁠, missed clumps (or false negatives) |$N_{i}^{\mathrm{fn}}$| and inaccurate clump locations |$N_{i}^{\mathrm{\sigma }}$| (see Section 5.7) evolve for the subjects in the the Galaxy Zoo: Clump Scout subject set as more volunteers annotate them. For the majority of subjects, our framework estimates values less than one for all risk components, regardless of how many volunteers annotated them. The distributions of |$N_{i}^{\mathrm{fp}}$|⁠, |$N_{i}^{\mathrm{fn}}$|⁠, and |$N_{i}^{\mathrm{\sigma }}$| become broader and their median values decrease monotonically as n → 20. This pattern indicates that for the majority of subjects, increasing the number of volunteers who annotate each subject improves the reliability of their consensus labels.

$Evolution of the distributions for components of subject risk as the number of volunteer annotations per subject increases. Distributions for the expected numbers of false positive bounding boxes $N_{i}^{\mathrm{fp}}$, missed clumps (or false negatives) $N_{i}^{\mathrm{fn}}$, and inaccurate clump locations $N_{i}^{\mathrm{\sigma }}$ are shown in the upper left, lower left-hand, and lower right-hand panels, respectively. The upper right-hand panel shows the distributions of $N_{i}^{\mathrm{fp}}$ after discarding individual clumps with false positive probabilities $p^{\mathrm{fp}}_{l} > 0.85$. Note that the y-axis changes from logarithmic to linear scaling at the values indicated by the black horizontal dashed lines to better illustrate the evolution of structures in each distribution.$

Figure 14.

Evolution of the distributions for components of subject risk as the number of volunteer annotations per subject increases. Distributions for the expected numbers of false positive bounding boxes |$N_{i}^{\mathrm{fp}}$|⁠, missed clumps (or false negatives) |$N_{i}^{\mathrm{fn}}$|⁠, and inaccurate clump locations |$N_{i}^{\mathrm{\sigma }}$| are shown in the upper left, lower left-hand, and lower right-hand panels, respectively. The upper right-hand panel shows the distributions of |$N_{i}^{\mathrm{fp}}$| after discarding individual clumps with false positive probabilities |$p^{\mathrm{fp}}_{l} > 0.85$|⁠. Note that the y-axis changes from logarithmic to linear scaling at the values indicated by the black horizontal dashed lines to better illustrate the evolution of structures in each distribution.

Open in new tab Download slide

A minority of subjects have estimated values for one or more of |$N_{i}^{\mathrm{fp}}$|⁠, |$N_{i}^{\mathrm{fn}}$|⁠, or |$N_{i}^{\mathrm{\sigma }}$| that are greater than one. For this subset of subjects, their associated risk component distributions appear to stabilize after five or more volunteers have annotated each subject. We suggest that estimates for subjects that are annotated by fewer than five volunteers (i.e. for n ≲ 5) are noise-dominated or prior-dominated and somewhat unreliable. The structure that is visible in the distributions of |$N_{i}^{\mathrm{fp}}$| in the upper-left-hand panel is produced by a strong bimodality in the distribution of false positive probabilities (⁠|$p^{\mathrm{fp}}_{l}$|⁠) for the clumps in the corresponding sets of estimated labels (i.e. the clumps in the corresponding |$\hat{Y}_{n}$| – see Fig. 7). For each clump in the estimated label for a particular subject, its false positive probability is very likely to be close to zero or one. The expected number of false positive clumps in a subject’s estimated label is derived by summing a term that includes these probabilities in its denominator, so the distributions of will naturally be concentrated into peaks around integer values of |$N_{i}^{\mathrm{fp}}$|⁠. Similar structures that are visible in the distributions of |$N_{i}^{\mathrm{fn}}$| are produced by a strong bimodality in the summand in equation (39). The fraction of subjects for which |$N_{i}^{\mathrm{fp}} > 1$| peaks at ∼10 per cent for n = 15 and decreases to ∼8 per cent for n ∼ 20. In contrast, the fraction of subjects for which |$N_{i}^{\mathrm{fn}} > 1$| does not peak, but increases quasi-monotonically to reach ∼2 per cent as n → 20. The fraction of subjects for which |$N_{i}^{\sigma } > 1$| is negligible and <0.05 per cent for all n.

The overall median values for the estimated numbers of missed clumps and inaccurate clump locations per subject both decrease monotonically as the number of volunteers who inspect each subject increases. However the overall median for the expected number of false positive clumps per subject increases slowly until n = 13 before beginning to decrease. We assess the feasibility of reducing |$N_{i}^{\mathrm{fp}}$| by discarding aggregated clumps with high individual false positive probabilities. The upper right-hand panel of the Fig. 14 shows the effect of filtering clumps with |$p^{\mathrm{fp}}_{l} > 0.85$| on the distribution of |$N_{i}^{\mathrm{fp}}$|⁠. Applying this filter substantially reduces the estimated number of false positive clumps after five or more volunteers annotate each subject and moreover, the fraction of subjects for which the expected number of false positive clumps per subject exceeds one now peaks at ∼0.1 per cent for n = 5 and decreases rapidly thereafter.

We note that filtering clumps based solely on their estimated false-positive probabilities may inadvertently discard real clumps if |$p^{\mathrm{fp}}_{l}$| does not correlate appropriately with observable quantities like brightness and colour that can indicate whether a particular feature is a genuine clump or spurious.¹⁸ Fig. 15 illustrates the overall effect of discarding clumps with individual false positive probabilities larger than 0.85 on the number of clumps per galaxy that our framework identifies using different numbers of volunteer annotations per subject. The impact is strongest for n ≳ 7 but the overall effect is small with ≲0.5 fewer clumps identified per galaxy. The left-hand panel of Fig. 16 plots fluxes in the g, r, and i bands versus the estimated individual false positive probability (⁠|$p^{\mathrm{fp}}_{l}$|⁠) for all clumps that our framework identifies using 20 annotations per subject. In all three bands, the mean flux of clumps with |$p^{\mathrm{fp}}_{l} < 0.2$| is ∼1.5 times larger than the mean flux for clumps with |$p^{\mathrm{fp}}_{l} > 0.2$|⁠. The right-hand panel of Fig. 16 plots the non-redundant flux ratios i/g, r/g, and i/r versus |$p^{\mathrm{fp}}_{l}$|⁠. On average, clumps with low estimated false positive probability appear brighter in bluer bands. Overall, we observe a pattern whereby clumps that appear brighter and bluer in the subject images tend to have lower |$p^{\mathrm{fp}}_{l}$|⁠. We verified that this pattern does not change significantly when clumps are filtered according to the fraction of volunteers that labelled them as ‘unusual’. This is reassuring because real clumps are expected to be bright and blue in colour and suggests that filtering clumps based on |$p^{\mathrm{fp}}_{l}$| is well motivated physically. The correlations with flux and colour also resemble the empirical patterns described in Section 6.3, where we observed that the sample of clumps that coincided with expert clump annotations were brighter and bluer than the sample of clumps that did not.

$Evolution of the distribution of the number of clumps per galaxy as more volunteers inspect and annotate each subject. The red markers and lines plot the distribution means for the different numbers of volunteers per subject. Top panel: Number of clumps per galaxy with any value for their estimated false positive probability $p^{\mathrm{fp}}_{l}$. Bottom panel: Number of clumps per galaxy with $p^{\mathrm{fp}}_{l} < 0.85$.$

Figure 15.

Evolution of the distribution of the number of clumps per galaxy as more volunteers inspect and annotate each subject. The red markers and lines plot the distribution means for the different numbers of volunteers per subject. Top panel: Number of clumps per galaxy with any value for their estimated false positive probability |$p^{\mathrm{fp}}_{l}$|⁠. Bottom panel: Number of clumps per galaxy with |$p^{\mathrm{fp}}_{l} < 0.85$|⁠.

Open in new tab Download slide

$Left-hand panel: Clump flux in g, r, and i bands versus estimated clump false positive probability $p^{\mathrm{fp}}_{l}$. Right-hand panel: Clump flux ratios g, r, and i bands versus estimated clump $p^{\mathrm{fp}}_{l}$. In both panels, the fluxes in each band are scaled in the same way as the corresponding bands of the subject images (see Section 2.2) to better reflect the data that volunteers actually see. On average, clumps with low $p^{\mathrm{fp}}_{l}$ appear brighter and bluer in the subject images.$

Figure 16.

Left-hand panel: Clump flux in g, r, and i bands versus estimated clump false positive probability |$p^{\mathrm{fp}}_{l}$|⁠. Right-hand panel: Clump flux ratios g, r, and i bands versus estimated clump |$p^{\mathrm{fp}}_{l}$|⁠. In both panels, the fluxes in each band are scaled in the same way as the corresponding bands of the subject images (see Section 2.2) to better reflect the data that volunteers actually see. On average, clumps with low |$p^{\mathrm{fp}}_{l}$| appear brighter and bluer in the subject images.

Open in new tab Download slide

Fig. 17 shows how the fractions of subjects that are retired for different reasons vary as more volunteers annotate each subject. More than 90 per cent of subjects meet the subject retirement criterion specified in Section 5.8 regardless of how many volunteers annotate each subject. Of the remaining subjects, ∼7–9 per cent become stale after persisting in the working batch for more than 10 replenishment cycles and are removed. The fraction of stale subjects peaks for n = 6 annotations per subject and decreases monotonically thereafter as more annotations per subject are used. Fewer than 1 per cent of subjects failed to retire for any n. The fraction of unretired subjects is maximally 0.9 per cent for n = 3 and falls to <0.1 per cent for n = 20. We comment that for n < 6 the computation of |$\mathcal {R}$| and its components is likely to be dominated by our model priors, and therefore the apparent decrease in the number of stale subjects should probably not be interpreted as improved performance within this domain.

Figure 17.

Fraction of subjects retired for different reasons versus number of volunteers per subject.

Open in new tab Download slide

7 DISCUSSION

Using the annotations provided by the Galaxy Zoo: Clump Scout volunteers, our framework has identified a large catalogue of potential clumps. In addition, our aggregation framework provides quantitative metrics for the reliability of the estimated subject labels it computes. These diagnostics allow us to better understand how volunteers interpreted the definition for a clump that they were provided with and how they execute the annotation task.

The observable properties of the clumps we detect appear plausible, both in terms of their spatial distribution within the subject images and their fluxes in the SDSS g, r, and i bands. The central concentration of confidently identified clumps in Fig. 9 is reassuring, because it reflects the typical footprints of the target galaxies in each subject image, which is where we would reasonably expect to find genuine clumps. For clumps with any estimated false positive probability |$p^{\mathrm{fp}}_{l}$|⁠, we observe a clear under-density at the centre of the distribution, which likely reflects the fact that most volunteers correctly distinguish the target galaxies’ central bulges from clumps.

The clump flux and colour distributions in Fig. 16 reveal that brighter, bluer clumps tend to have lower false positive probabilities (⁠|$p^{\mathrm{fp}}_{l}$|⁠). This trend is also reassuring because real clumps are expected to be bright and blue in colour and suggests that filtering clumps based on |$p^{\mathrm{fp}}_{l}$| is well motivated physically. The correlations with flux and colour also resemble the empirical patterns described in Section 6.3, where we observed that the sample of clumps that coincided with expert clump annotations were brighter and bluer than the sample of clumps that did not.

By comparing expert labels for 1000 subjects with those estimated by our framework using volunteer annotations, we showed that volunteers are much more optimistic that experts when annotating clumps. Overall, the distributions in Fig. 11 suggest that volunteers are more likely to mark faint features than experts, particularly when those features appear blue. This results in aggregated clump samples for the 1000 test subjects that appear quite heavily contaminated with respect to the expert labels. Moreover, this apparent contamination worsens if clumps that experts or the majority of volunteers labelled as ‘unusual’ are discarded. This degradation in purity for the ‘normal’ clump subset likely indicates that volunteers and experts disagree about the definition of a ‘normal’ clump with volunteers being less likely to label a clump as unusual.

Using Fig. 12, we illustrated that our framework tends to estimate lower false positive probabilities for clumps that were marked by both volunteers and experts. The formulation of our likelihood model means that smaller estimated false positive probabilities correlate broadly with a greater degree of consensus between skilled volunteers that a clump exists at a particular location. Therefore, it seems that while many volunteers mark features that experts would not identify as clumps’ features that experts do mark tend to have also been marked by a majority of more skilled volunteers who inspected the corresponding subject. The correlation between clumps’ false positive probabilities and their expert classifications also reinforces the evidence implicit in Fig. 10 that the aggregated clump sample can be made purer with respect to the expert sample by applying a threshold on |$p^{\mathrm{fp}}_{l}$|⁠.

We note that while using a visual labelling approach to identify, clumps provides more flexibility than relying on a fixed set of brightness or colour thresholds, it is also unavoidably subjective. To illustrate how this subjectivity may be impacting the empirically determined purity and completeness of our clump sample, Fig. 18 shows typical examples of the faint blue features that volunteers annotate but experts ignore. Many of these do appear clump-like and it is not always obvious why experts have not marked them. Based on these observations, we suggest that the sample of clumps identified by our framework using volunteer annotations may not be as severely contaminated as Fig. 10 implies. We also note that the clump samples our framework derives are generally very complete and include the majority of expert-labelled clumps. This means that that subsets of clumps for particular scientific analyses can be selected from a nominally impure sample using physically motivated criteria based on directly observable or derived characteristics of the individual clumps. For example, Adams et al. (2022) derive samples of bright clumps by using criteria based on photometry extracted from clumps and their host galaxies.

Figure 18.

Six curated but representative examples of subject images that show agreements and disagreements between volunteers and experts. Features labelled as clumps by volunteers but ignored by experts are highlighted by white boxes. Red boxes highlight features that were annotated by both experts and volunteers. Red circles highlight features that were annotated by experts but not by volunteers. Volunteers tend to mark fainter features than experts, particularly if those features appear blue in colour. None of the features highlighted in this figure were labelled as ‘odd’ by a majority of volunteers or the experts who marked them.

Open in new tab Download slide

In addition to providing quantitative estimates for the reliability of individual clump labels, our framework allows us to investigate the performance of individual volunteers and the entire volunteer cohort. The positive gradients of skill parameter evolution curves in Fig. 13 decrease with increasing number of subjects inspected (their second derivatives are negative except in the final bin which contains relatively few volunteers). This suggests that the the volunteer skill parameters may converge to stable asymptotic values for very large numbers of inspected subjects. The fact that this convergence was not achieved for the Galaxy Zoo: Clump Scout data set likely indicates that the global maximum likelihood solution is dominated by the large number of volunteers who inspect very few images and may provide noisy annotations due to their relative inexperience.

The noisiness of volunteer annotations probably indicates that identifying clumps within star forming galaxies, which can have complex underlying morphologies, is relatively difficult for inexperienced non-experts. In Section 6.4, we noted that most volunteers only annotated a small number of galaxies and may not have had time to learn the visible characteristics of genuine clumps. While it may be the case that the task of clump identification is too difficult for typical Zooniverse volunteers, this seems unlikely and there are several plausible strategies for making complex and subtle image analysis tasks more feasible for citizen scientists. The most obvious is to improve the amount and quality of the initial training that is provided to volunteers. However, Zooniverse volunteers are accustomed to participating in projects with minimal tutorial material so imposing a more rigorous training requirement may discourage widespread participation. As discussed in Section 3.1, the volunteers to who contributed to Galaxy Zoo: Clump Scout received real-time feedback for a small number of expert-labelled subjects that they annotated during the early stages of their participation. Providing more detailed feedback for a larger sample of subjects may help volunteers to better understand the task they are being asked to perform.

Some Zooniverse projects also provide a dedicated tutorial workflow with an accompanying video tutorial in which experts annotate the same subjects that volunteers see and explain their reasoning.¹⁹ When using feedback as a training tool, it is important that the feedback subjects contain galaxies and clumps that are properly representative of the global populations within the full subject set, but it is difficult to ensure that this is the case unless the experts themselves inspect a large number of subjects. Moreover, the feedback messages that volunteers receive must be carefully chosen to avoid discouraging volunteers if their annotations disagree with those of experts.

An alternative to explicit training and feedback that was pioneered by the Gravity Spy project²⁰ involves incrementally increasing the difficulty of subjects that volunteers inspect and annotate as they spend longer engaged with the project and their skill improves (Zevin et al. 2017). Using this ‘leveling up’ approach requires an a priori metric for the relative difficulty of subjects for volunteers, as well as ongoing assessment of volunteers’ skills. While our framework naturally fulfills the latter requirement, it does not facilitate prior segregation of subjects to populate the different difficulty levels. It might be possible to formulate a heuristic approach to estimating subject difficulty based on observable properties of the clumps’ host galaxies, but that is beyond the scope of this paper.

As we discuss in Section 4.1, the consensus reliability metrics that our framework computes may enable quantitatively motivated early retirement of subjects if it can be established that a stable consensus solution has been reached. In Section 5.7, we described how our framework formulates a subject retirement criterion based on estimated metrics that are proxies for the completeness (⁠|$N_{i}^{\mathrm{fn}}$|⁠), purity (⁠|$N_{i}^{\mathrm{fp}}$|⁠), and accuracy (⁠|$N_{i}^{\sigma }$|⁠) of that subject’s label. Fig. 17 seems to show that more than 90 per cent of subjects fulfil this criterion, even when only n = 3 volunteers inspect each subject. However, the distributions shown in 14 appear to be noise or prior dominated for n ≲ 5, and we suggest that estimates of the subject risk |$\mathcal {R}$| and its components |$\big\lbrace N_{i}^{\mathrm{fn}},N_{i}^{\mathrm{fn}},N_{i}^{\sigma }\big\rbrace$| for that domain should be treated with some caution.

In Fig. 14, we showed that discarding clumps with estimated individual false positive probabilities |$p_{l}^{\mathrm{fp}} > 0.85$| substantially reduces the number of subject labels that are expected to include one or more false positive clumps and that this number reduces rapidly once more than seven volunteers have annotated each subject.

We interpret the fact that the estimated number of missed clumps per subject (⁠|$N_{i}^{\mathrm{fn}}$|⁠) increases as more volunteers annotate each subject as an effect of some of those volunteers marking very faint features. Potential false-negative clumps identified by the second constrained run of the facility location algorithm (see Section 5.7) are typically on the threshold of identification by our framework, which normally means that several volunteers have marked them.²¹ If the fraction of highly optimistic volunteers within the overall cohort is small, then a relatively large number of volunteers must inspect each subject for faint features to reach the threshold, where they are considered potential false negatives. The increase in |$N_{i}^{\mathrm{fn}}$| as the number of annotations per subject n → 20 is then an indication that more faint features are reaching, but not surpassing, our framework’s detection threshold. Fig. 15 provides an empirical estimate for the number of clumps per galaxy that are missed when fewer volunteers inspect each subject. Although the mean number of identified clumps per galaxy does increase in the interval 7 < n < 20, the rate of increase is very slow and increasing n from 7 to 20 results in just 0.5 more clumps with individual false positive probabilities |$p_{l}^{\mathrm{fp}} < 0.85$| per galaxy on average. In line with our previous observations regarding volunteer optimism, we suggest that many of these additional clumps may in fact be faint, blue features within the target galaxies. As Fig. 10 illustrates, our comparison with expert labels also suggests that n ∼ 7 provides that best compromise between the completeness and purity of our aggregated clump sample.

Empirically, it seems like at least five volunteers must inspect each subject to obtain a stable solution for the subject labels and that the majority of genuine clumps could be identified by our framework for most subjects using the annotations provided by approx seven volunteers. Increasing the number of volunteers beyond this threshold seems to introduce more noise into the annotation data and also results in progressively fainter features being identified. Retiring the majority of subjects after inspection by seven volunteers, if it could have been well motivated, would have reduced the volunteer effort required for the Galaxy Zoo: Clump Scout project by a factor >2. Unfortunately, we must acknowledge that the reliability metrics computed by our framework do not seem to converge in a way that is useful to facilitate an early retirement decision. For most subjects, our framework predicts expected numbers of false positives, false negatives and inaccurate true positives that are less than one for any number of annotations (i.e. |$N_{i}^{\mathrm{fp}},N_{i}^{\mathrm{fn}},N_{i}^{\sigma }\ll 1\,\,\forall \, n$|⁠) and so these subjects would have been retired when n < 7 based on the thresholds listed in Table 2. As we show in Fig. 10, retiring subjects this early would yield a lower sample completeness, even for the brighter clumps that experts also identified.

Moreover, while predicted numbers of subject labels containing false positive or inaccurate clump locations both decrease for n ≳ 7 as n → 20, the predicted number of subjects labels that are missing real clumps increases. Using any retirement criterion predicated on |$N_{i}^{\mathrm{fn}}\ll 1$|⁠, considering the annotations from more volunteers would result in more subjects becoming stale in the working batch and therefore requiring inspection by experts. Fortunately in the case of Galaxy Zoo: Clump Scout , the fraction of subjects for which the estimated number of false positive clumps |$N_{i}^{\mathrm{fn}} > 1$| for any n is <3 per cent of the overall data set (∼2500 subjects), so visual inspection by experts would be feasible.

8 SUMMARY AND CONCLUSION

In this paper, we have presented a software framework that uses a probabilistic model to aggregate multiple annotations that mark two-dimensional locations in images of distant galaxies and derive a consensus label based on those annotations. The annotations themselves were provided via the Galaxy Zoo: Clump Scout citizen science project by non-expert volunteers who were asked to mark the locations of giant star forming clumps within the target galaxies. Among a sample of 85 286 galaxy images that were inspected by volunteers, our software framework identified 44 126 that contained at least one visible clump and detected 128 100 potential clumps overall.

To empirically evaluate the validity of the clumps we identify, we compared our aggregated labels with annotations provided by expert astronomers for a subset of 1000 galaxy images. We found that Galaxy Zoo: Clump Scout volunteers are much more optimistic than experts, and are willing to mark much fainter features as potential clumps, particularly if those features appear blue in colour. However, volunteers also mark the vast majority of bright clumps that experts identify, so although the sample of clumps we identify is ∼50 per cent contaminated with respect to the expert identifications, it is |${\gtrsim}90\ \hbox{per cent}$| complete.

In addition to our empirical evaluation, we have used the statistical model that underpins our framework to compute quantitative metrics for the reliability of the overall aggregated labels that we derive for each image. These metrics suggest that stable consensus for most images’ labels is achieved after ∼7 volunteers have annotated it, which is <50 per cent of the 20 annotations that were collected for each image via Galaxy Zoo: Clump Scout and would represent a significant saving in volunteer effort. However, the annotation data are quite noisy with large variation between the numbers of locations that are marked by different volunteers and this noise makes it difficult to define a robust ‘early retirement’ criterion that could be used to safely curtail collection of annotations before 20 have been acquired.

We suggest that the noisy annotation data reflect the fact that inexperienced non-experts find the task of identifying clumps difficult, or that the task was not properly explained. In Section 7, we discuss how different approaches to volunteer training could be used to help volunteers better distinguish the visible characteristics of genuine clumps from those of the faint blue features that many ultimately marked. On the other hand, one of the benefits of using citizen science to identify clumps is that it avoids being overly prescriptive regarding the definition of a clump. Galaxy Zoo: Clump Scout represents the first extensive wide-field search for clumpy galaxies in the local Universe and it may be that low-redshift clumps have different properties to their more distant counterparts. Using strict thresholds on brightness or colour might result an unexpected population of fainter clumps being missed. Moreover, the sample of clumps identified by volunteers appears to be very complete and so, if a subset of bright clumps is required for science analysis, such a sample can be straightforwardly constructed using photometric measurements for each clump (e.g. Adams et al. 2022).

Although our framework was developed to aggregate annotations for a specific citizen science project, its applicability is more general. A large number of projects running on the Zooniverse platform collect two dimensional image annotations. Many of those projects consider subjects that are more familiar to non-experts and may be less prone to noise. In such cases, our framework may be able to substantially reduce the amount of effort and time taken to reach consensus for each subject.

ACKNOWLEDGEMENTS

HD and SS were partly supported by the ESCAPE project; ESCAPE – The European Science Cluster of Astronomy & Particle Physics ESFRI Research Infrastructures has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement no. 824064. SS also thanks the Science and Technology Facilities Council for financial support under grant ST/P000584/1. MW gratefully acknowledges support from the Alan Turing Institute, grant reference EP/V030302/1. This research is partially supported by the National Science Foundation under grants AST 1716602 and IIS 2006894.. This material is based upon work supported by the National Aeronautics and Space Administration (NASA) under Grant No. HST-AR-15792.002-A. This publication uses data generated via the Zooniverse.org platform, development of which is funded by generous support, including a Global Impact Award from Google, and by a grant from the Alfred P. Sloan Foundation.

This research made use of the open-source python scientific computing ecosystem, including NumPy (Harris et al. 2020), Matplotlib (Hunter 2007), and Pandas (McKinney 2010). This research made use of Astropy, a community-developed core Python package for Astronomy (The Astropy Collaboration et al. 2018). This research made use of Numba (Lam, Pitrou & Seibert 2015).

DATA AVAILABILITY

The data underlying this article were used in Adams et al. (2022) and can be obtained as a machine-readable table by downloading the associated article data from https://doi.org/10.3847/1538-4357/ac6512.

Footnotes

https://www.mturk.com

A negative response corresponds to selecting the answer ‘Features or disk.’

This minimum size criterion is designed to handle galaxies that have very small, incorrectly measured Petrosian radii.

www.zooniverse.org/lab

Even if volunteers are able to submit such nearby marks, our algorithm is designed to only recognize one of them. The choice of which nearby clicks to discard depends on the clicks provided by other volunteers.

Specifying a conjugate prior π(θ) for parameter θ in Bayes’s rule yields a posterior distribution p(θ|z) ∝ π(θ) · p(z|θ) that has the same functional form as the prior itself. Note that in general the conjugate prior depends on both the likelihood model and the parameter of interest. For example, the variance and mean of a Gaussian likelihood function have different conjugate priors.

Although our implementation does not explicitly limit batch sizes in practice, we found that model data storage requirements for batches containing ≳25 thousand classifications exhausted the 32 GB memory capacity of our available hardware.

The chosen algorithm implements approximate computation of the maximum log-likelihood solution and is guaranteed to find a solution for which the log-likelihood is at most 1.61 times the optimal one.

This nomenclature reflects a common application of facility location algorithms to optimize distribution of some essential commodity from facilities located at a small number of locations within a larger network of cities.

The correspondence is approximate because d_kl in equation equation (5) represents the Jaccard distance between a volunteer box and the true clump location, whereas d in equation equation (17) is the Jaccard distance between two volunteer boxes, one of which is labelled as a facility.

Concretely, let |$\mathbf {r}$| be a generalized two dimensional coordinate and the index m enumerate the corner vertices of a box, beginning in the upper-left and proceeding along the box edges in a clockwise direction, then

$$\begin{eqnarray} \mathbf {r}_{i,m}^{l} = \frac{1}{|Z_{i}^{l}|} \displaystyle \sum _{b_{ij}^{k}\in Z_{i}^{l}}\mathbf {r}_{ij,\,m}^{k}. \end{eqnarray}$$

Recall that insufficient accuracy implies that the Jaccard distance between the estimated and true clump locations is likely to exceed the value of the hyper-parameter δ

As described in Section 3.3, volunteer boxes have a side-length equal to twice the FWHM of subject’s PSF, and may have different absolute pixel dimensions. When computing |$N_{i}^{\mathrm{fn}}$|⁠, we account for this by using normalized image coordinates {x′, y′} ≡ {x/x_max, y/y_max} to define box boundaries when we compute the Jaccard distance between boxes in the global set.

Recall (Section 4.2) that we define an annotation |$z_{ij}=\big\lbrace b_{ij}^{k}\big\rbrace _{k=1}^{|B_{ij}|}$| to be the set of box markings provided by a particular volunteer when they inspect a particular subject, so the number of annotations is generally less than the size of the working batch.

Note that the labels for each subject may in principle depend on all annotations in |$\tilde{Z}_{n}$| via those annotations’ influence on the volunteers’ skill parameters.

See §;3 in Adams et al. (2022) for a detailed explanation of how clump fluxes are computed.

This skewed non-uniform distribution for the nuber of annotations per volunteer is also seen in many other Zooniverse projects (e.g. Spiers et al. 2019).

Indeed, for this reason Adams et al. (2022) apply a very permissive |$p^{\mathrm{fp}}_{l}$| threshold before filtering further based on observable clump parameters.

e.g. https://www.zooniverse.org/projects/chrismrp/radio-galaxy-zoo-lofar

https://www.zooniverse.org/projects/zooniverse/gravity-spy

The precise number of marks required depends on the skill parameters of the volunteers who provide them.

See https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MeanShift.html

REFERENCES

Adams

Mehta

Dickinson

Scarlata

Fortson

Kruk

Simmons

Lintott

2022

ApJ

931

Category	Symbol	Description
Object indices	i	Index over subjects.
	j	Index over volunteers.
	k	Index over a volunteer j’s clump identifications for a single subject.
	l	Index over aggregated clump locations.
Subjects	S	The global set of subject images.
	S_j	The set of subject images inspected by volunteer j.
	s_i	A single subject image in S.
	\|$\mathcal {R}_{i}$\|	The risk for subject s_i.
	\|$N_{i}^{\mathrm{fp}}$\|	The expected number of spurious clump locations (false positives) in the label for subject i.
	\|$N_{i}^{\mathrm{fn}}$\|	The expected number of missed clumps (false negatives) in the label for subject i.
	\|$N_{i}^{\sigma }$\|	The expected number nominally true positive clump locations in the label for subject i that differ from the (unknown) true clump location by a Jaccard distance greater than 0.5.
Subject difficulties	\|${\sigma _{i}^{l}}^{2}$\|	The variance of a Gaussian model for the Jaccard distance offset between the estimated location of the lth detected clump for subject i and its corresponding (unknown) true location.
	\|$\mathcal {D}_{i}$\|	The difficulty of subject i defined the set of \|${\sigma _{i}^{l}}^{2}$\| values for all detected clumps in the image.
Volunteers	V	The global set of volunteers.
	V_i	The subset of volunteers who inspected subject i.
Volunteer skills	\|$p_{j}^{\mathrm{fp}}$\|	The probability that volunteer j will click on a spurious clump.
	\|$p_{j}^{\mathrm{fn}}$\|	The probability that volunteer j will miss a real clump.
	\|$\sigma _{j}^{2}$\|	The variance of a Gaussian model for the Jaccard distance offset between volunteer j’s true positive click locations and the corresponding (unknown) true clump locations, independent of subject.
	\|$\mathcal {S}_{j}$\|	The skill of volunteer j defined as the set \|$\lbrace p_{j}^{\mathrm{fp}}, p_{j}^{\mathrm{fn}}, \sigma _{j}^{2}\rbrace$\|⁠.
Annotations	Z	The global set of volunteer annotations.
	Z_i	The set of annotations for a single subject image provided by all the volunteers who inspected it.
	\|$\tilde{Z}_{n}$\|	A randomly selected subset of Z containing exactly n annotations per subject.
	z_ij	A single annotation provided by volunteer j after inspecting subject i.
	B_ij	The set of boxes, corresponding to click locations provided by volunteer j for subject i.
	B_i	The set of all boxes, corresponding to click locations provided for subject i by all volunteers who inspected it.
	\|$b_{ij}^{k}$\|	A single box, corresponding to the location of a single click provided by volunteer j for subject i.
	\|${\sigma _{ij}^{k}}^{2}$\|	The variance of a Gaussian model for the Jaccard distance offset between volunteer j’s kth true positive click location for subject i, and its corresponding (unknown) true clump location.
	\|$a_{ij}^{k}$\|	An integer value that maps the kth click in volunteer j’s annotation of subject i to a specific clump in that subject’s estimated label (or to the dummy facility if it is deemed to be a false positive).
Labels	Y	The global set of subject labels.
	y_i	The unknown true label for subject i.
	\|$b_{i}^{l}$\|	A single box comprising part of the unknown true label for subject i.
	\|$\hat{y}_{i}$\|	The estimated label for subject i that is computed by our framework.
	\|$\hat{b}_{i}^{l}$\|	A single box comprising part of the estimated label for subject i.
	\|$p^{\mathrm{fp}}_{l}$\|	The probability that the lth clump in the estimated label for a subject is a false positive.
	\|$p^{\sigma }_{l}$\|	The probability that the Jaccard distance between the lth clump in the estimated label and the corresponding (unknown) true clump location exceeds 0.5.

Month:	Total Views:
October 2022	18
November 2022	42
December 2022	40
January 2023	40
February 2023	25
March 2023	28
April 2023	34
May 2023	18
June 2023	13
July 2023	21
August 2023	11
September 2023	10
October 2023	16
November 2023	6
December 2023	23
January 2024	29
February 2024	10
March 2024	25
April 2024	22
May 2024	18
June 2024	23
July 2024	25
August 2024	21
September 2024	13
October 2024	37
November 2024	27
December 2024	9
January 2025	12
February 2025	19
March 2025	17
April 2025	10

Article Contents

Galaxy Zoo: Clump Scout – Design and first application of a two-dimensional aggregation tool for citizen science

ABSTRACT

1 INTRODUCTION

2 DATA

2.1 Galaxy image selection

2.2 Galaxy image preparation

3 COLLECTING ANNOTATIONS

3.1 Volunteer training

3.2 The annotation workflow

3.3 Initial annotation processing

3.4 A scale-free distance metric

4 DATA AGGREGATION MODEL

4.1 Overview

4.2 Associating subject annotations with subject labels

4.3 Modelling volunteer skill

4.4 Modelling subject difficulty

4.5 Modelling volunteer annotations

4.6 Global model and parameter priors

5 COMPUTING AGGREGATED LABELS

5.1 The working batch

5.2 Initialization

5.3 Computing box associations

5.4 Computing image difficulty

5.5 Computing volunteer skill

5.6 Computing maximum likelihood labels

5.7 Computing subject risks

5.8 Subject retirement and batch finalization

6 RESULTS

6.1 Testing the effect of volunteer multiplicity

6.2 Aggregated clump properties

6.3 Comparison with expert annotations

6.4 Volunteer skill parameters

6.5 Subject risk and its components

7 DISCUSSION

8 SUMMARY AND CONCLUSION

ACKNOWLEDGEMENTS

DATA AVAILABILITY

Footnotes

REFERENCES

APPENDIX A: MODEL PARAMETER PRIORS

A1 Beta priors for pfp and pfn

A2 Scaled inverse χ2 priors for σ2

APPENDIX B: DOMAIN-SPECIFIC TERMS

APPENDIX C: TABLE OF SYMBOLS

APPENDIX D: COMPARISON WITH SCIKIT LEARN MEANSHIFT CLUSTERING ALGORITHM

Citations

Views

Altmetric

Email alerts

Astrophysics Data System

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

A1 Beta priors for p^fp and p^fn

A2 Scaled inverse χ² priors for σ²