-
PDF
- Split View
-
Views
-
Cite
Cite
Hugh Dickinson, Dominic Adams, Vihang Mehta, Claudia Scarlata, Lucy Fortson, Stephen Serjeant, Coleman Krawczyk, Sandor Kruk, Chris Lintott, Kameswara Bharadwaj Mantha, Brooke D Simmons, Mike Walmsley, Galaxy Zoo: Clump Scout – Design and first application of a two-dimensional aggregation tool for citizen science, Monthly Notices of the Royal Astronomical Society, Volume 517, Issue 4, December 2022, Pages 5882–5911, https://doi.org/10.1093/mnras/stac2919
- Share Icon Share
ABSTRACT
Galaxy Zoo: Clump Scout is a web-based citizen science project designed to identify and spatially locate giant star forming clumps in galaxies that were imaged by the Sloan Digital Sky Survey Legacy Survey. We present a statistically driven software framework that is designed to aggregate two-dimensional annotations of clump locations provided by multiple independent Galaxy Zoo: Clump Scout volunteers and generate a consensus label that identifies the locations of probable clumps within each galaxy. The statistical model our framework is based on allows us to assign false-positive probabilities to each of the clumps we identify, to estimate the skill levels of each of the volunteers who contribute to Galaxy Zoo: Clump Scout and also to quantitatively assess the reliability of the consensus labels that are derived for each subject. We apply our framework to a data set containing 3561 454 two-dimensional points, which constitute 1739 259 annotations of 85 286 distinct subjects provided by 20 999 volunteers. Using this data set, we identify 128 100 potential clumps distributed among 44 126 galaxies. This data set can be used to study the prevalence and demographics of giant star forming clumps in low-redshift galaxies. The code for our aggregation software framework is publicly available at: https://github.com/ou-astrophysics/BoxAggregator
1 INTRODUCTION
One of the main goals for modern observational cosmology is to discover and understand how galaxies and their constituent substructures have assembled and evolved throughout cosmic history.
During the last two decades, a large number of observational data have been assembled, which show strong evidence for a substantial evolution in the dominant mode of star formation in galaxies between z ∼ 3 and 0.2 (e.g. Madau & Dickinson 2014; Murata et al. 2014; Guo et al. 2015; Shibuya et al. 2016; Guo et al. 2018).
Early observations using the Hubble Space Telescope (HST) revealed that typical massive galaxies (M ≳ 1010M⊙), populating the z ∼ 2 star forming main sequence (Noeske et al. 2007), exhibit thick, gas-rich, clumpy discs with star formation rates |$\dot{M}_{\star }\sim 100\mathrm{M}_{\odot }\, \mathrm{yr}^{-1}$| (e.g. Elmegreen, Elmegreen & Sheets 2004a; Elmegreen, Elmegreen & Hirst 2004b; Genzel et al. 2011). Many of these z ∼ 2 galaxies were found to exhibit discrete, subgalactic regions of enhanced star formation (hereafter referred to as ‘clumps’) with apparent radii |${\lesssim}1\,{\rm kpc}$| and stellar masses |$M_{\star } \gtrsim 10^7\mathrm{M}_{\odot }$| (Elmegreen 2007). More recent evidence suggests that these clumps may in fact be aggregations of smaller substructures that could not be resolved by HST (e.g. Wuyts et al. 2014; Fisher et al. 2017; Dessauges-Zavadsky & Adamo 2018), but this remains to be confirmed. The prevalence of giant star-forming clumps at high redshift and the overall characteristics of their host galaxies are in stark contrast with the thin, uniform and generally quiescent (|$\dot{M}_{\star }\sim 1 \mathrm{yr}^{-1}$|) disc morphologies that prevail among star-forming galaxies in the local Universe (e.g. Simard et al. 2011; Willett et al. 2013a).
The mechanisms that drove this evolution of star formation activity, their onset epochs and the time-scales over which they operated, remain to be fully established. If they can be accurately determined, the abundances of clumps within galaxies at different redshifts, together with their spatial distributions and intrinsic properties, provide obvious diagnostics for the transition from clumpy to more diffuse star formation. Historically, the most extensive surveys of clumpy star formation have relied on HST imaging and focused on intermediate and high-redshift galaxies (e.g. Murata et al. 2014; Guo et al. 2015; Guo et al. 2018). A common conclusion of these studies is that the overall fraction of massive (|$M_{\star }\gtrsim 10^{9.5}\mathrm{M}_{\odot }$|), clumpy star forming galaxies decreases rapidly for z ≲ 2 and falls below ∼5 per cent by z ∼ 0.2.
The scarcity of clumpy galaxies in the local Universe makes the task of identifying them in large numbers much more challenging and related studies at low redshift have entailed focused investigations of small samples containing ∼50 galaxies, or fewer (See, however Mehta et al. 2021). Identifying enough low-redshift clumpy galaxies to enable accurate inference of their overall population demographics and characteristics requires wide-field imaging surveys that encompass a large fraction of the sky and a reliable method for discovering candidate systems. In recent years, extensive ground-based surveys like the Sloan Digital Sky Survey Legacy Survey (SDSS; York et al. 2000) and the Dark Energy Camera Legacy Survey (DECaLS; Dey et al. 2019) have delivered publicly available wide field imaging data that make systematic searches for large numbers of low-redshift clumpy galaxies possible. Galaxy Zoo: Clump Scout (Adams et al. 2022) is a citizen science project that used SDSS imaging data and was designed to let volunteers from the general public identify clumpy galaxies and the clumps they contain. Multiple volunteers inspect images of galaxies and provide two dimensional annotations marking the locations of any clumps the galaxies contain.
One of the most challenging aspects of collecting data using a citizen science approach is calibrating the reliability of the responses that volunteers provide. Translating astrophysical analyses into a citizen science context can be difficult because the subject matter and related concepts are often not familiar to non-experts. This unfamiliarity can result in annotations that are noisy with large variations between the responses of different volunteers. The traditional approach for mitigating such noise is to collect a large number of independent annotations and derive an average result representing the overall consensus between volunteers. This has two obvious disadvantages: firstly, volunteer effort may be wasted if more responses are accumulated than are actually required to mitigate the variation between responses and secondly, even after a large number of responses have been collected, there is no formal guarantee that the consensus is accurate or sufficiently precise.
To address these issues, more quantitative approaches have been developed that attempt to infer statistical estimates for the reliability of consensus derived from citizen science annotations and classifications. For example, Marshall et al. (2016) developed the Space Warps Analysis Pipeline (SWAP) which used a binomial model for a simple true-or-false response to derive a Bayesian estimate for the probability that astrophysical images included signatures of strong gravitational lensing. The SWAP algorithm was also used by Wright et al. (2017) to accelerate consensus for citizen-science classification of potential supernova flashes and assign false-alarm probabilities to candidate events. Later, Beck et al. (2018) showed that applying SWAP to galaxy morphology labels collected via the Galaxy Zoo platform (Lintott et al. 2008; Willett et al. 2013b) increased the rate of classification by 500 per cent and reduced the volunteer effort that was required by a factor of ∼6.5, relative to the Galaxy Zoo standard requirement for 40 volunteers to inspect each galaxy.
In this paper, we build on the principle of SWAP and develop an aggregation approach to derive quantitative estimates for the reliability of two dimensional labels of clump locations within galaxies based on annotations provided by Galaxy Zoo: Clump Scout volunteers. Like SWAP, we rely on a statistical model to derive probabilistic estimates for several quantities that determine the reliability of a label that represents the consensus of multiple independent annotations. Two dimensional annotations are more complex than the simple binary classification tasks that SWAP was designed to process and our statistical model is necessarily also more complicated. We base our approach on a method that was initially presented by Branson, Van Horn & Perona (2017) (Hereafter BVP17), who tested their algorithm on small and relatively noise-free annotation data sets that contained a few thousand annotations and were collected from paid workers on the Amazon Mechanical Turk platform.1 We have developed a new implementation of this algorithm that is computationally efficient enough to process millions of independent annotations provided for tens of thousands of images by the Galaxy Zoo: Clump Scout volunteers. Our goal is to find out whether this algorithm can be used successfully to derive complicated two-dimensional labels with quantitative reliability estimates in a mass-participation citizen-science context using noisy annotations provided by a cohort of non-expert volunteers. We also aim to determine whether the reliability estimates we derive can be used to accelerate the labeling process and reduce the amount of volunteer effort that is required to accurately label the clumps in each galaxy.
The remainder of this paper is organized as follows: In Section 2, we describe how the imaging data presented to volunteers in Galaxy Zoo: Clump Scout were selected and prepared. In Section 3, we outline the annotation workflow that volunteers used to annotate the images and the training they received. In Section 4, we provide details of the statistical model that underpins our aggregation algorithm. In Section 5, we explain how our algorithm actually computes the labels it derives. In Section 6, we present the results of applying our algorithm to the Galaxy Zoo: Clump Scout data and analyse the quantitative reliability metrics that are generated. In Section 7, we discuss the implications of these results in the context of the goals outlined above and the suitability of citizen science as a method for complex astrophysical image analysis. Finally, in Section 8, we summarize our findings and conclude.
2 DATA
In this section, we briefly describe the the galaxy selection criteria and the image preparation pipeline used for Galaxy Zoo: Clump Scout . A much more detailed description is provided by Adams et al. (2022).
2.1 Galaxy image selection
The galaxy images used in Galaxy Zoo: Clump Scout comprise three subsets of the sample that was visually inspected and morphologically classified by volunteers contributing to the Galaxy Zoo 2 (GZ2) citizen science project (Willett et al. 2013a). The criteria that were used to select these subsets are described in detail in Adams et al. (2022). For convenience, this section summarizes the most relevant properties of the galaxies that were inspected by the Galaxy Zoo: Clump Scout volunteers.
A primary sample of 53 613 galaxies with 0.02 ≤ z ≤ 0.25 was selected based on the morphological labels provided by GZ2 volunteers. We anticipated that the presence of obvious star-forming clumps in images of smooth elliptical galaxies was very unlikely so for this primary sample, we limited our selection to galaxies for which more than 50 per cent of volunteers responded negatively2 to the question ‘Is the galaxy simply smooth and rounded, with no sign of a disk?’.
To estimate the number of clumpy galaxies that were observed by SDSS, but which were excluded from our primary sample, we also include a smaller, secondary sample. This sample contains 4937 galaxies for which fewer than 50 per cent of GZ2 volunteers identified features or a disc and was selected within a more restricted redshift range 0.02 ≤ z ≤ 0.075.
Finally, Galaxy Zoo: Clump Scout volunteers also annotated a sample of 26 736 galaxies matching the selection criteria used for the primary sample, but which had simulated emission from clumps with known photometric and physical properties superimposed (see Adams et al. 2022, for details of the simulation procedure). Annotations of these simulated clumps were used by Adams et al. (2022) to derive an estimate of the Galaxy Zoo: Clump Scout sample completeness for clumps with specified photometric properties.
Stellar mass estimates for galaxies in all three samples were taken from the SDSS DR7 MPA-JHU value-added catalogue (Kauffmann et al. 2003; Brinchmann et al. 2004). All three samples include galaxies with stellar masses 108.5M⊙ ≲ M⋆ ≲ 1012M⊙.
2.2 Galaxy image preparation
3 COLLECTING ANNOTATIONS
To identify the locations of clumps within their host galaxies, we designed a web-based citizen science project using the Zooniverse project builder interface.4
3.1 Volunteer training
For non-expert volunteers, identifying genuine clumps among the potentially complex features of their host galaxies can be daunting. To improve volunteers’ confidence and help them to provide accurate annotations, we provided several pedagogical and training resources. Following the approach of other Zooniverse projects, we designed a detailed practical tutorial explaining each step of the annotation workflow. This tutorial was automatically presented to volunteers when they joined the project and remained available for reference thereafter. Additional reference images and explanatory text were provided using the Field Guide feature of the Zooniverse interface. A separate About section of the project provided pedagogical material explaining the scientific motivation of the project. Finally, to guide the progress of first-time volunteers, we provided expert labels for a small subset of our galaxy images. Ten such images were interspersed with decreasing frequency among the first ∼20 subjects that each volunteer inspected. We implemented a system to provide real-time feedback for volunteer annotations of expert-labelled galaxy images and inform them if they missed genuine clumps or mistakenly annotated an object that experts had disregarded. This feedback system was designed to refine volunteers expectations regarding the visual appearance of genuine clumps during the early stages of their engagement with the project.
3.2 The annotation workflow
Volunteers following the Galaxy Zoo: Clump Scout workflow inspect a sequence of single subject galaxy images (hereafter ‘subjects’) that are randomly drawn from a global subject set. The subject selection ensures that no volunteer inspects the same image more than once and each subject is inspected by a group of approximately 20 volunteers. Each volunteer first annotates the two-dimensional location of the central bulge of the central galaxy in the image if it is visible, before proceeding to annotate the locations of any clumps they can discern. To mitigate against the possibility that volunteers would disregard genuine clumps with appearances that confound their expectations, we provided an opportunity to mark clumps as ‘unusual’. We investigate the impact of including or discarding this unusual clump subset in Section 6.
The full Galaxy Zoo: Clump Scout data set contains 3561 454 click locations, which constitute 1739 259 annotations of 85 286 distinct subjects provided by 20 999 volunteers.
3.3 Initial annotation processing
We expect that even the largest individual clumps will be at best marginally resolved for the lowest redshift galaxies in our data sample. This implies that almost all clumps will appear as point-sources with a light profile equal to the instrumental point-spread function (PSF). Our data preparation procedure (Section 2.2) results in subject images that have different pixel sampling of the PSF depending on the angular size of the central host galaxy. To account for this fact, we transform the two-dimensional point estimates for clump locations that volunteers provide into square boxes with side-length equal to twice the full width at half maximum (FWHM) of the pertinent subject’s PSF. Assigning a finite, instrumentally motivated clump extension allows us to identify groups of volunteer clicks with separations that are smaller than the PSF. A prior assumption of our data aggregation approach that it is impossible for a single volunteer to mark separate clumps within the same subject that are closer than twice the PSF FWHM.5 It is likely that any such multiplets that volunteers do provide represent noise peaks in contrast-enhanced subject images or are simply accidents. In Section 4, we describe how our aggregation algorithm effectively deduplicates multiple nearby annotations by individual volunteers.
3.4 A scale-free distance metric

Geometric illustration of the ratio between the area of the intersection between two boxes (dotted region) and the area of their union (dashed region). We use the complement of this ratio as a scale-free distance metric bounded between zero and unity.
4 DATA AGGREGATION MODEL
The core of our data aggregation approach is based on a custom implementation of the probabilistic model and algorithm proposed by BVP17 . In this section, we present a detailed description of the model, and explain how it is used to optimize the efficiency of clump detection using the volunteers’ annotations. We recognize that this paper contains a lot of somewhat complicated notation, so to aid the reader we have included a reference table of the most commonly recurring symbols in Table C1.
4.1 Overview
We construct a global model that simultaneously considers NS individual elements of the full subject set |$S\equiv \lbrace s_{i}\rbrace _{i=1}^{N_{\mathrm{S}}}$| and individual members of the entire volunteer cohort V. Each subject si ∈ S, is inspected by a randomly selected group of volunteers Vi ∈ V, who each provide a set of independent two dimensional annotations of visible clump locations |$Z_{i}\equiv \lbrace z_{ij}\rbrace _{j=1}^{|V_{i}|}$|. Throughout this paper, we will use the notation |X| to denote the number of elements in the set X, so here |Vi| denotes the number of volunteers who annotate the subject si. For convenience, we define Sj ∈ S to denote the subset of subjects that are inspected by the jth volunteer. For every subject si, we define a true label yi to encode the unknown locations of all real clumps in the image. Using only the information provided by the global set of volunteer annotations |$Z\equiv \bigcup _{i}Z_{i}$|, we wish to derive a separate estimated label |$\hat{y}_{i}$| for each subject that closely approximates yi. Our goal is to minimize the mismatch between |$\hat{y}_{i}$| and yi, while keeping the number of volunteers who annotate the subject si as small as possible and thereby to optimize our use of volunteers’ effort. We facilitate this aim by computing a ‘risk’ metric |$\mathcal {R}_{i}$| for each subject that represents a weighted combination of quantitative magnitude estimates for several sources of approximation error in the estimated label (see Section 5.7 for more details). We expect that the risk for a particular subject will decrease as the number of volunteer annotations for that subject increases. Accordingly, by choosing an appropriate global risk threshold |$\mathcal {R}_{i} < \tau$|, we aim to be able to confidently retire individual subjects from the classification pool as soon as the expected error is acceptably small. This approach differs from many traditional crowd-sourcing techniques, which require a fixed number of volunteers to inspect each subject. Such approaches are generally less efficient because stable consensus between volunteers is often achieved before the prescribed number of annotations have been gathered. An additional benefit of our approach is that particularly difficult subjects can be segregated for expert inspection if their risk remains high after many volunteers have inspected the subject.
4.2 Associating subject annotations with subject labels
Each of the volunteer annotations zij ∈ Zi forms a set of |Bij| ≥ 0 square boxes |$z_{ij}=\big\lbrace b_{ij}^{k}\big\rbrace _{k=1}^{|B_{ij}|}$| that encodes the locations of any clumps that the volunteer perceived in the subject si. Analogously, we model the true clump locations for si as an abstract set of |Bi| ≥ 0 rectangular boxes such that |$y_{i}\equiv \big\lbrace b_{i}^{l}\big\rbrace _{l=1}^{|B_{i}|}$|. The concrete sizes and shapes of these boxes are ultimately determined by our aggregation algorithm, but for subject si they are guaranteed to be at least as large as the boxes comprising the volunteer annotations for that subject. Our goal is to associate each of the click locations corresponding to volunteer annotations for a particular subject with a single true clump location. Formally, we aim to associate each of the concrete elements of Zi with a single abstract element of yi. This task is complicated for several reasons. Different volunteers may annotate different subsets of clumps and the order in which they do so is not defined nor even constrained. Volunteers may miss some real clumps, so there may be elements of yi that have no counterpart annotations in a particular zij. Conversely, the set of annotations provided by a particular volunteer for a particular subject may contain false positives, so some elements of a particular zij may not correspond with any elements of yi.
Fig. 2 provides a schematic illustration of the process by which we associate volunteer annotations with probable clump locations and Section 5.3 explains the notation and the computational details. Formally, our aggregation algorithm computes an optimal set of mapping indices |$\big\lbrace a_{ij}^{k}\big\rbrace _{k=1}^{|B_{ij}|}$| such that each volunteer-provided box |$b_{ij}^{k}\in z_{ij}$| is associated with real clump location |$b_{i}^{a_{ij}^{k}}\in y_{i}$|. The possibility of false positive boxes in zij is accounted for by defining a singleton ‘|$\varnothing$|’ element to which they can be associated.

Schematic illustration of how elements of volunteers’ annotations are associated with elements of the subject label yi. We illustrate a case in which three volunteers have provided three independent annotations of the same subject. Volunteers 1 and 2 both annotate subsets of the real clumps in the image. Volunteer 3 mistakenly marks two foreground stars as clumps. The central column lists the value of |$\lbrace a_{ij}^{k}\rbrace$| computed for each of the boxes forming the volunteers’ annotations. For volunteers 1 and 2, these values define the index of the corresponding box in yi. Both annotations provided by volunteer 3 probably mark foreground stars and neither is marked by another volunteer. In this toy example, the algorithm maps both to the ‘|$\varnothing$|’ element, thereby defining them as false positives.
4.3 Modelling volunteer skill
For a given subject, the visibility of clumps to a particular volunteer, and the positional accuracy with which they are able annotate the clumps they do perceive is likely to be influenced by several factors. These may include: domain expertise, experience gained from time spent contributing to Galaxy Zoo: Clump Scout, confusion regarding the detailed task instructions, and even the screen size and resolution of device they typically use to provide annotations.
To model the impact of these factors we consider three scenarios, which relate a particular volunteer’s annotations to the locations of real clumps in the subject image. Consider the annotations provided by the jth volunteer in our cohort.
In the second scenario, the volunteer provides a false positive by marking a location which does not correspond to the location of a real clump. We model the rate of false positive annotations for volunteer j by considering each mark they provide as a Bernoulli trial with ‘success’ probability |$p_{j}^{\mathrm{fp}}$|.
Finally, volunteer j may provide an implicit false negative by failing to mark the location of a real clump. We model the false negative rate for volunteer j by considering each opportunity to mark a real clump location as a Bernoulli trial with ‘success’ probability |$p_{j}^{\mathrm{fn}}$|.
Hereafter, we refer collectively to the three model parameters |$\mathcal {S}_{j} \equiv \big\lbrace \sigma _{j}, p_{j}^{\mathrm{fp}}, p_{j}^{\mathrm{fn}}\big\rbrace$| as volunteer j’s ‘skill’ parameters.
4.4 Modelling subject difficulty
4.5 Modelling volunteer annotations
The third term considers the Jaccard distances dkl between any true positive (i.e. |$a_{ij}^{k}\ne \varnothing$|) box |$b_{ij}^{k}\in z_{ij}$| and their counterparts |$b_{i}^{l}\in y_{i}$| as well as the subject’s difficulty |$\mathcal {D}_{i}$| and the volunteer’s skill |$\mathcal {S}_{j}$|.
4.6 Global model and parameter priors
|$\pi (\mathcal {D}_{i})$| models the prior probabilities of observing the difficulty parameters associated with the ith subject.
|$\pi (\mathcal {S}_{j})$| models the prior probability of observing the volunteer skill parameters associated with the jth volunteer.
π(yj) models the prior probability that the unknown true label for si is yi. For simplicity, we assume that all possible labels are equally likely.
For practical reasons, we choose prior distributions for each parameter that are the conjugate priors6 of that parameter for the corresponding likelihood model distribution. This choice facilitates straightforward computation of model parameter updates when new annotations are collected.
The initial values for parameters of our prior models |$\big\lbrace p_{0}^{\mathrm{fp}}, p_{0}^{\mathrm{fp}}, n_{\beta }^{\mathrm{fp}}, n_{\beta }^{\mathrm{fn}}, \sigma _{0,S}^{2}, n_{\chi , S}, \sigma _{0,V}^{2}, n_{\chi , V}\big\rbrace$| are hyper-parameters of our algorithm which must be chosen a-priori. Table 1 lists the values that we assign to each of these hyper-parameters when processing the Galaxy Zoo: Clump Scout data set.
Framework hyper-parameter values used to process the Galaxy Zoo: Clump Scout data set.
Parameter . | Value . |
---|---|
|$p_{0}^{\mathrm{fp}}$| | 0.1 |
|$p_{0}^{\mathrm{fp}}$| | 0.1 |
|$n_{\beta }^{\mathrm{fp}}$| | 500 |
|$n_{\beta }^{\mathrm{fn}}$| | 50 |
|$\sigma _{0,S}^{2}$| | 0.1 |
nχ, S | 10 |
|$\sigma _{0,V}^{2}$| | 0.1 |
nχ, V | 10 |
fV | 0.1 |
dmax | 0.9 |
Parameter . | Value . |
---|---|
|$p_{0}^{\mathrm{fp}}$| | 0.1 |
|$p_{0}^{\mathrm{fp}}$| | 0.1 |
|$n_{\beta }^{\mathrm{fp}}$| | 500 |
|$n_{\beta }^{\mathrm{fn}}$| | 50 |
|$\sigma _{0,S}^{2}$| | 0.1 |
nχ, S | 10 |
|$\sigma _{0,V}^{2}$| | 0.1 |
nχ, V | 10 |
fV | 0.1 |
dmax | 0.9 |
Framework hyper-parameter values used to process the Galaxy Zoo: Clump Scout data set.
Parameter . | Value . |
---|---|
|$p_{0}^{\mathrm{fp}}$| | 0.1 |
|$p_{0}^{\mathrm{fp}}$| | 0.1 |
|$n_{\beta }^{\mathrm{fp}}$| | 500 |
|$n_{\beta }^{\mathrm{fn}}$| | 50 |
|$\sigma _{0,S}^{2}$| | 0.1 |
nχ, S | 10 |
|$\sigma _{0,V}^{2}$| | 0.1 |
nχ, V | 10 |
fV | 0.1 |
dmax | 0.9 |
Parameter . | Value . |
---|---|
|$p_{0}^{\mathrm{fp}}$| | 0.1 |
|$p_{0}^{\mathrm{fp}}$| | 0.1 |
|$n_{\beta }^{\mathrm{fp}}$| | 500 |
|$n_{\beta }^{\mathrm{fn}}$| | 50 |
|$\sigma _{0,S}^{2}$| | 0.1 |
nχ, S | 10 |
|$\sigma _{0,V}^{2}$| | 0.1 |
nχ, V | 10 |
fV | 0.1 |
dmax | 0.9 |
In Appendix A, we provide detailed rationale for our choice of prior distribution models and show how they yield estimates for our likelihood model parameters that become increasingly data-dominated as more annotations are collected.
5 COMPUTING AGGREGATED LABELS
Fig. 3 provides a schematic overview of how our implementation computes aggregated labels for subjects. In subsequent subsections we describe the illustrated operations in detail.

5.1 The working batch
To minimize the dependence of aggregated clump locations on our choice of model prior hyper-parameters, we design our aggregation framework to process elements from a dynamically maintained working batch containing data and metadata for ≲ 25 thousand classifications.7 Each element in the working batch represents a single click location marking a clump as part of the annotation provided by a single volunteer.
To populate the working batch, we select subjects that have been inspected by at least three volunteers and have at least one annotated clump. For each selected subject, we assemble all its available annotation data and append them to the working batch in a single block of elements. This ensures that any subject retirement decision is made on the basis of all available information. We specify a minimum target batch size and new blocks are added until the size of working batch exceeds this target. If five or more volunteers inspect a subject and none annotate a clump, we assume that no clumps are present and preemptively retire the subject instead of adding its data to the working batch. Whenever a volunteer inspects a subject that has at least one clump annotation, but does not annotate any clumps themselves, we append a single empty classification element to the working batch. We require records of these empty classifications in order to compute the probability that a particular volunteer fails to annotate a real clump, i.e. |$p_{j}^{\mathrm{fn}}$|.
After processing a single batch of classification data, the most likely outcome is that only a subset of the corresponding subjects will have |$\mathcal {R}_{i} < \tau$| (see Section 4.1 and Section 5.7), and be deemed sufficiently low risk for retirement. We update the working batch by removing the classification data for retired subjects and replenishing them with new blocks of classification data for active subjects. Once a subject is retired, the aggregated estimated label |$\hat{y}_{i}$| is considered final and any subsequently submitted classifications for that subject will not be included in subsequent batches.
We impose a maximum lifetime for any data element by specifying the maximum number of batch replenishment cycles that they can persist within the working batch. Subjects whose data remain after this lifetime has expired are retired and flagged for inspection by experts. This forced retirement strategy prevents the working batch becoming stale and dominated by inherently difficult or high-risk subjects that never retire normally.
5.2 Initialization
To complete the initialization phase for each new working batch, we use the algorithm described in Section 5.3 to perform preliminary clustering of overlapping volunteer annotations for each subject. The subsequent subsections explain how we apply iterative expectation maximization to refine the initial clusters, while simultaneously computing the maximum likelihood solution of equation (9).
5.3 Computing box associations
For each subject si ∈ S, we follow the approach of BVP17 and implement a Facility Location algorithm (Mahdian et al. 2001) to approximately8 derive the maximum likelihood mapping |$A=\big\lbrace a_{ij}^{k}\big\rbrace _{k=1}^{|B_{ij}|}$| between the click locations comprising individual volunteers’ annotations |$z_{ij} = \big\lbrace b_{ij}^{k}\big\rbrace _{k=1}^{|B_{ij}|}$| and the set |$y_{i} = \big\lbrace b_{i}^{l}\big\rbrace _{l=1}^{|B_{i}|}$| (see Section 4.2 and Fig. 2).
Facility location algorithms form clusters with a specific topology comprising one or more cities, uniquely connected to a single, central facility.9 This topology is illustrated in Fig. 4.

Top: The topology of the clusters that are assembled by the Facility Location algorithm. In this case, the set of boxes has been partitioned into three clusters. Within each cluster, the central facility (F 1-3) is connected to one or more cities (C 1-5). Each city is connected to exactly one facility. Bottom: Possible arrangement of aggregated box clusters corresponding to the illustrated topology for an image after inspection by three volunteers. Blue boxes |$b_{i}^{l}$| correspond to facilities (F 1-3) and red boxes |$b_{ij}^{k}$| correspond with the cities (C 1-5). Note that each volunteer may contribute at most one box to each cluster and in this case the same volunteer contributed the boxes that were assigned facility status.
Our implementation identifies disjoint, spatially concentrated subsets of the boxes in Zi which we then identify with true clump locations |$b_{i}^{l}\in y_{i}$|. We label each of these aggregated clusters with the index l and denote them as |$Z_{i}^{l}$|. Establishing a new cluster entails labelling a particular box |$b_{ij}^{k}\in Z_{i}$| as a facility and connecting at least one other box |$b_{ij^{\prime }}^{k^{\prime }}$| that was provided by a different volunteer. Note that by associating box |$b_{ij}^{k}$| with cluster |$Z_{i}^{l}$| as either a city or a facility, we establish the mapping |$a_{ij}^{k} = l$|. Each box in the set of volunteer annotations is associated with at most one true clump and each subset may contain at most one box per volunteer. These constraints reflect our assumption that separate marks provided by the same volunteer are intended to indicate separate clumps.
The facility location algorithm is designed to compute the box-to-cluster mapping that minimizes Ci, which simultaneously yields the approximate maximum likelihood solution of equation (5) for given volunteer skill and image difficulty parameters.
To derive the aggregated estimate for the subject label |$\hat{y}_{i}$|, we merge the individual boxes comprising each cluster by computing the mean coordinates of their corresponding vertex indices.11 This yields a rectangular representation for each true clump location that is at least as large as each of the boxes comprising the set of annotations for the ith subject, Zi.
Table 1 lists the values we adopt for fV and dmax .
5.4 Computing image difficulty
5.5 Computing volunteer skill
As a consequence of our prior specifications the formulations of equations (29), (30), and (31) can all be factored into terms that depend only on the current working batch and terms that depend only on prior information. This allows us to straightforwardly update the skill parameters of returning volunteers without having to reconsider the annotations they contributed to previous working batches.
5.6 Computing maximum likelihood labels
Once the associated clusters have been defined and the subject difficulties and volunteer skills have been computed, we are able to compute the likelihood of each subject’s estimated label using equations (8), (6), and (5). Practically, we compute the log-likelihood for each subject, and sum these to derive a global likelihood for all annotation data that comprise the current working batch.
Recall (Section 5.3) that we use a simplified set of facility location costs to derive an initial clustering solution for each new working batch. These costs are used for initialization because they can be computed without having estimated volunteer skills or subject difficulties, but they will generally not yield a set of clusters that correspond with the maximum likelihood solution of equation (5) for any subject. Similarly, the likelihood model parameters that we compute based on the initial clustering solution are unlikely to be good estimates of the subject difficulties or volunteer skills. As illustrated by the red boxes in Fig. 3, we use an iterative approach to derive the maximum likelihood solution for equation (5) and the corresponding best estimates of the likelihood model parameters.
After the initial set of volunteer skills have been computed, we recompute the box associations for all subjects using the nominal facility location costs specified in equation (16), equations (17), (19). Using these clusters we recompute the likelihood model parameters and the corresponding subject label likelihoods. We repeat this procedure until the sum of log-likelihoods for all subjects converges to its maximum value.
5.7 Computing subject risks
In Section 4.1, we introduced the concept of a ‘risk’ metric |$\mathcal {R}_{i}$| that can be computed for any subject si and used to quantitatively determine whether the estimated label |$\hat{y}_{i}$| is sufficiently representative of the unknown true label yi to be scientifically useful. Specifying a risk that decreases monotonically as the reliability of |$\hat{y}_{i}$| increases, enables a principled decision to retire the subject si when its risk falls below a predefined threshold value which we denote τ.
The weight terms αfp, αfn, and ασ are hyper-parameters that allow the properties of the clump sample for retired subjects to be tuned for particular scientific investigations. For a specific value of τ, increasing the value of αfp relative to the other weights will result in a purer clump sample, while a relative increase in αfn increases the sample completeness. Specifying a larger value for ασ will result in more accurate clump locations, which may be useful for studies considering the radial distribution of clumps within their host galaxies.
Our approach for estimating the expected number of genuine clumps that are not represented in estimated label for the ith subject (i.e. the number of false negatives) emulates the one used by BVP17 . We begin by using the facility location algorithm to re-cluster the annotations for each subject, subject to three additional constraints that are based on the original maximum likelihood solution.
Volunteer boxes that were originally associated with true positive clusters are not considered as potential cities. This means that the only way that true positive annotations can contribute to clusters is by becoming facilities.
Only annotations that were not defined as facilities originally are considered as potential facilities. This prevents rediscovery of the clumps that were indicated by the maximum likelihood solution for the subject.
There is no dummy facility available, so all annotations must either become a facility or connect to an existing facility, regardless of how high the connection or establishment costs are.

Computing the random coincidence probability using all boxes in the working batch. Left-hand panel: Shaded boxes represent all elements in the first working batch. Solid boundaries indicate groups of boxes that coincided using the dmax = 0.9 criterion. Note that large boxes may validly encompass all or most of smaller ones without coinciding if the ratio of the box areas areas in normalized coordinates less than 0.9dmax . Boxes that did not coincide with any others are shown using dashed lines. Middle panel: The elements of B∩ coloured according to the number of boxes they were found to coincide with. Right-hand panel: Two dimensional map showing the mean probability that one or more boxes in the working batch will accidentally coincide at a given two-dimensional location.
5.8 Subject retirement and batch finalization
Computing the expected false positive, false negative and inaccurate true positive counts (i.e. |$N_{i}^{\mathrm{fp}}$|, |$N_{i}^{\mathrm{fn}}$|, and |$N_{i}^{\sigma }(\delta)$|) independently for each subject allows us to define a compound retirement criterion that specifies maximum permissible values, |$N_{i,\max }^{\mathrm{fp}}$|, |$N_{i,\max }^{\mathrm{fn}}$|, and |$N_{i,\max }^{\sigma }$|, for each of these quantities as well as a threshold τ on the overall subject risk. Table 2 lists the thresholds we use in practise as well as the values we adopt for the coefficients specified in equation (32).
Parameters used to determine subject retirement and compute overall subject risk.
Parameter . | Value . |
---|---|
αfp | 1 |
αfn | 1 |
ασ | 2 |
δ | 0.5 |
|$N_{i,\max }^{\mathrm{fp}}$| | 1 |
|$N_{i,\max }^{\mathrm{fn}}$| | 0.3 |
|$N_{i, \max }^{\sigma }$| | 3 |
τ | 5 |
Parameter . | Value . |
---|---|
αfp | 1 |
αfn | 1 |
ασ | 2 |
δ | 0.5 |
|$N_{i,\max }^{\mathrm{fp}}$| | 1 |
|$N_{i,\max }^{\mathrm{fn}}$| | 0.3 |
|$N_{i, \max }^{\sigma }$| | 3 |
τ | 5 |
Parameters used to determine subject retirement and compute overall subject risk.
Parameter . | Value . |
---|---|
αfp | 1 |
αfn | 1 |
ασ | 2 |
δ | 0.5 |
|$N_{i,\max }^{\mathrm{fp}}$| | 1 |
|$N_{i,\max }^{\mathrm{fn}}$| | 0.3 |
|$N_{i, \max }^{\sigma }$| | 3 |
τ | 5 |
Parameter . | Value . |
---|---|
αfp | 1 |
αfn | 1 |
ασ | 2 |
δ | 0.5 |
|$N_{i,\max }^{\mathrm{fp}}$| | 1 |
|$N_{i,\max }^{\mathrm{fn}}$| | 0.3 |
|$N_{i, \max }^{\sigma }$| | 3 |
τ | 5 |
Once the subject risks have been computed, we retire those subjects for which the overall risk |$\mathcal {R}_{i} < \tau$| and |$N_{i}^{\mathrm{fp}}$|, |$N_{i}^{\mathrm{fn}}$|, and |$N_{i}^{\sigma }$| are all less than their specified maximum permissible values, before removing their elements from the working batch. We also identify and remove any stale subject data that have persisted for the maximum allowed number of batch replenishment cycles without retiring. Such subjects are likely very difficult or complicated, so we mark them for expert inspection, assessment and labelling. For the remaining subjects that were not retired, we re-initialise their difficulty parameters and discard any associated clusters that were established when the working batch was processed.
Annotation data that were provided by a single volunteer for different subjects can appear in separate working batches, especially if volunteers return to the project regularly over an extended period of time. It is also possible that only a subset of the subjects annotated by a volunteer in a single working batch are retired when batch processing completes. If a volunteer’s annotation data persist between batches, those persistent data should not be used to update volunteer skills multiple times during multiple batch processing cycles. This could lead to pathological subjects unfairly inflating or reducing the skill parameter values (|$p_{j}^{\mathrm{fp}}$|, |$p_{j}^{\mathrm{fn}}$|, |$\sigma ^{2}_{j}$|) for a particular volunteer. To avoid this scenario, we restore the volunteer skills that were cached at the start of the latest cycle and update them using only annotation for subjects that did retire.
The batch processing cycle then restarts by acquiring new annotation data and repopulating the working batch.
6 RESULTS
Recall that the full Galaxy Zoo: Clump Scout data set (Z) contains 3561 454 click locations, which constitute 1739 259 annotations of 85 286 distinct subjects provided by 20 999 volunteers and that approximately 20 volunteers inspected each subject. Using this data set, we identify 128 100 potential clumps distributed among 44 126 galaxies. Fig. 6 shows five examples of galaxies in which clumps were detected.

Examples of clump-hosting galaxies, illustrating the ability of our framework to exclude false-positive annotations. The left-hand column shows galaxy images as they were seen by volunteers. The second column overlays all volunteer annotations on a grey-scale image of the same galaxy. In the third column, volunteer annotations that were assigned to a facility and identified as clumps are shown in colour. Annotations that were assigned to the dummy facility are shown in black. The fourth column shows the clump locations that we ultimately identify.
6.1 Testing the effect of volunteer multiplicity
We expect that the performance of our aggregation framework will vary depending upon the number of volunteers who inspect each subject. To investigate this dependence, we assemble 17 subsamples of annotations |$\lbrace \tilde{Z}_{n}\rbrace _{n=3}^{20}\in Z$|, that contain between 3 and 20 annotations per galaxy. Each |$\tilde{Z}_{n}$| is constructed by randomly sampling n annotations for each subject si ∈ S. For example, |$\tilde{Z}_{5}$| includes five randomly sampled annotations for each galaxy in the Galaxy Zoo: Clump Scout subject set. We then use our aggregation framework to derive the set of corresponding estimated subject labels |$\hat{Y}(\tilde{Z}_{n})\equiv \lbrace \hat{y}_{i,n} \rbrace _{i=1}^{\vert S\vert }$|, where |$\hat{y}_{i,n} = \hat{y}_{i}(Z=\tilde{Z}_{n})$| is the label for si based only on the n annotations for that subject within |$\tilde{Z}_{n}$|.15 In subsequent sections, we will examine the differences between results derived using these different restricted data sets. Note that the data set containing 20 annotations per subject, denoted |$\tilde{Z}_{20}$|, is not quite the full Galaxy Zoo: Clump Scout data set Z because the Zooniverse interface occasionally collects more than 20 annotations per subject.
6.2 Aggregated clump properties
Our aggregation algorithm assigns a separate false positive probability |$p^{\mathrm{fp}}_{l}$| to each clump it identifies (see Section 5.7). The left-hand panel of Fig. 7 shows the distribution of this false positive probability for clumps detected using 20 annotations per subject, which is strongly bimodal with ≈90 per cent of clumps having |$0.2 < p^{\mathrm{fp}}_{l} >0.8$|. The right-hand panel shows how the distribution of the false positive probabilities for all identified clumps evolves as more volunteers annotate each subject. For fewer than five annotations per subject (i.e. n ≲ 5), the estimates for the clumps’ false positive probabilities remain somewhat prior-dominated and the distributions are unimodal with medians close to the hyper-parameter value |$p_{0}^{\mathrm{fp}}=0.1$|. For more than five annotations per subject (i.e. n > 5), the distributions become progressively more bimodal which increases their interquartile ranges. The distribution medians decrease monotonically as the number of annotations per subject n → 20, which indicates that providing more volunteer annotations per subject allows our framework to more confidently predict the presence of clumps.

Left-hand panel: Distribution of estimated false positive probability |$p^{\mathrm{fp}}_{l}$| for clumps identified using 20 annotations per subject (i.e. using |$\tilde{Z}_{20}$|). The distribution is strongly bimodal with ≈90 per cent of clumps having |$0.2 < p^{\mathrm{fp}}_{l} >0.8$|. The inset shows the distribution in for |$p^{\mathrm{fp}}_{l} < 0.01$|. Right-hand panel: Distributions of |$p^{\mathrm{fp}}_{l}$| corresponding to n between 3 and 20 volunteer annotations per subject. The distribution medians decrease monotonically from ≈0.04 for n = 3 to ≈5 × 10−5 for n = 20, while the distribution interquartile ranges become wider as more volunteers annotate each subject. We use a ‘logistic’ scaling for the y-axis to highlight the development of the bimodal structure for large n. Note that the colour scale shows the number density of clumps to account for the fact that the two-dimensional histogram bins cover different areas.
For every bounding box in each subject’s maximum likelihood label, we also compute the probability |$p^{\sigma }_{l}$| that the Jaccard distance between it and the unknown true location of the clump exceeds δ = 0.5. The left-hand panel of Fig. 8 shows the distribution of |$p^{\sigma }_{l}$| for clumps detected using 20 annotations per subject, while the right-hand panel shows how the distribution |$p^{\sigma }_{l}$| of evolves as more volunteers annotate each subject. Again, our model priors appear to dominate for fewer than five annotations per subject and the distribution medians decrease monotonically as the number of annotations per subject n → 20. This pattern indicates that providing more volunteer annotations per subject allows our framework to more precisely determine the locations of clumps.

Left-hand panel: Distribution of the estimated probability that an individual clump location is inaccurate (|$p^{\sigma }_{l}$|) for clumps identified using 20 annotations per subject (i.e. using |$\tilde{Z}_{20}$|). The distribution is concentrated close to zero with all clumps having |$p^{\sigma }_{l}\lesssim 0.3$|. The inset shows the distribution in for |$p^{\sigma }_{l} < 0.01$|. Right-hand panel: Distributions of |$p^{\sigma }_{l}$| corresponding to n between 3 and 20 volunteer annotations per subject. The distribution medians decrease monotonically from ≈0.05 for n = 3 to ≈4 × 10−4 for n = 20, while the distribution interquartile ranges become wider as more volunteers annotate each subject. Note that the colour scale shows the number density of clumps to account for the fact that the two-dimensional histogram bins cover different areas.
Fig. 9 illustrates the spatial distribution of the detected clump locations in bins of estimated clump false positive probability |$p^{\mathrm{fp}}_{l}$|. We observe that 99.9 per cent of clumps with |$p^{\mathrm{fp}}_{l}\lesssim 0.5$| (i.e. likely true positives) are located within a central circular region occupying 20 per cent of the area of their corresponding images. In contrast, clumps with |$p^{\mathrm{fp}}_{l}\gtrsim 0.5$| (i.e. likely false positives) are 10 times more likely to fall outside this region. This central concentration of confidently identified clumps is reassuring because it reflects the typical footprints of the target galaxies in each subject image, which is where we would reasonably expect to find genuine clumps. For all clumps, regardless of their estimated false positive probability, we observe a clear under-density at the centre of the distribution, which likely reflects the fact that most volunteers correctly distinguish the target galaxies’ central bulges from clumps.

Detected clump locations in normalized image coordinates in bins of estimated clump false positive probability, pfp. For pfp ≲ 0.5, 99.9 per cent of detected clumps have |$R_{\mathrm{clump}}=\sqrt{(X_{\mathrm{clump}}/X_{\max })^{2}+(Y_{\mathrm{clump}}/Y_{\max })^{2}} < 0.25$|. In contrast 10 times more clumps (∼1 per cent) with |$p^{\mathrm{fp}}_{l}\gtrsim 0.5$| have Rclump > 0.25.
6.3 Comparison with expert annotations
To quantify the degree of correspondence between the clumps identified by volunteers and those identified by professional astronomers, we used the Galaxy Zoo: Clump Scout interface to collect annotations from three expert astronomers for 1000 randomly selected subjects and compared the recovered clump locations with those derived from volunteer clicks by our aggregation framework.
For each subject in this expert-annotated image set, we consider the 17 different estimated labels |$\hat{y}_{i,n}$| that were computed using 3 ≤ n ≤ 20 volunteer annotations per subject (see Section 6.1). We then filter each of these 17 labels by selecting a subsample of its bounding boxes that have associated false positive probabilities |$p^{\mathrm{fp}}_{l}$| that are less than a selectable threshold value, which we denote p⋆, fp. By setting p⋆, fp close to one, we expect to select only the bounding boxes that mark real clumps. Conversely, we expect that setting p⋆, fp close to zero results in a subsample that is likely to contain more false positive bounding boxes. We use the symbol |$\hat{Y}^{\star }_{n}(p^{\star ,\mathrm{fp}})$| to denote the set of estimated labels for all expert-annotated subjects that were computed using n volunteer annotations per subject and filtered to include only those bounding boxes with false positive probabilities less than p⋆, fp.
For a particular false positive filtering threshold p⋆, fp and number of annotations per subject n, we consider the filtered labels for all 1000 expert-annotated subjects and define |$N_{n}^{\mathrm{FP}}$| to be the total number of empirically false positive aggregated clump bounding boxes in |$\hat{Y}^{\star }_{n}(p^{\star ,\mathrm{fp}})$| that contain zero expert click locations. Conversely, |$N_{n}^{\mathrm{FN}}$| denotes the total number of expert clicks located outside of any aggregated box, which we designate as false negatives. We identify the remaining |$N_{n}^{\mathrm{TP}}$| aggregated boxes that coincided with an expert click location as true positives.
Fig. 10 illustrates how the completeness and purity of our aggregated clump sample depend on n. In the left-hand panel, we plot |$\mathcal {C}_{n}$| and |$\mathcal {P}_{n}$| values derived using the whole expert-identified clump sample as a ground-truth set. The values plotted in the right-hand panel are derived by comparing a restricted set of nominally normal ground-truth clumps, which experts did not identify as ‘unusual’ (see Section 3.2) with aggregated clumps that the majority of volunteers who identified the clump classified it as being normal in appearance. In both panels, the crosses show the ‘optimal’ completeness and purity values that maximize the hypotenuse |$\sqrt{\mathcal {C}(p^{\star ,\mathrm{fp}})^{2} + \mathcal {P}(p^{\star ,\mathrm{fp}})^{2}}$| over all possible p⋆, fp thresholds. For comparison, the square and triangular points in Fig. 10, respectively, illustrate the maximum values of completeness and purity that can be achieved independently.

Purity versus completeness for different numbers of volunteers per subject. The left-hand panel shows values derived using the full volunteer label sets and all expert-identified clumps as a benchmark sample, while the values shown in the right-hand panel are derived by comparing the sets of clumps which experts and volunteers identified as ‘normal’. Squares indicate the values of the maximum possible completeness and purity for each number of volunteers, which can generally not be realized simultaneously. Crosses indicate the optimal completeness and purity values that can be simultaneously realized for each volunteer count.
For both the full and the restricted ground truth sets, we observe a general trend that increasing the number of volunteers who inspect each subject increases the optimal sample completeness at the expense of reducing purity. Using the expert classifications as a benchmark it is clear that our most complete aggregated clump samples suffer substantial contamination. In the most extreme case, using the ‘normal’ clump comparison sets for n = 20 and letting p⋆, fp = 1 yields ∼97 per cent completeness, but only ∼35 per cent purity. The high level of contamination indicates that volunteers are much more optimistic than experts when annotating clumps i.e. volunteers will mark features that experts will ignore. Moreover, while completeness values generally improve when comparing the restricted ‘normal’ clumps, the corresponding purity values are substantially worse than those derived from the full clump samples. This degradation in purity for the ‘normal’ clump subset likely indicates that volunteers and experts disagree about the definition of a ‘normal’ clump with volunteers being less likely to label a clump as unusual.
The top row of Fig. 11 shows the g, r, and i band flux distributions16 for aggregated clumps that are empirically determined to be false positive and true positive when comparing them with expert clump annotations. To better represent the appearance of the clumps that volunteers and experts actually see, the band-limited fluxes shown in Fig. 11 are independently scaled in the same way as the corresponding bands of the Galaxy Zoo: Clump Scout subject images (see Section 2.2). The distributions reveal that empirically false-positive clumps are ∼5–10 times fainter on average than empirically true positive clumps. The bottom row of Fig. 11 shows all non-redundant flux ratios for the g, r, and i bands. In general, the empirically false positive clumps are brighter in the g band and would appear bluer in the subject images. Overall, the distributions in Fig. 11 suggest that volunteers are more likely to mark faint features than experts, particularly when those features appear blue. Fig. 18 shows typical examples of the faint blue features that volunteers annotate but experts ignore.

Top row: Flux distributions in g, r, and i bands for clumps that are empirically determined to be false positive or true positive by comparing with expert clump annotations. Dashed vertical lines indicate the distribution means. Bottom row: Flux ratio distributions for clumps that are empirically determined to be false positive or true positive by comparing with expert clump annotations. Dashed vertical lines indicate the distribution medians. In both rows, the fluxes in each band are scaled in the same way as the corresponding bands of the subject images (see Section 2.2) to better reflect the data that volunteers actually see.
Fig. 12 illustrates the degree of correspondence between the value of |$p^{\mathrm{fp}}_{l}$| assigned to each clump by our aggregation framework and their empirical categorization as true or false positives. The figure compares the distributions of |$p^{\mathrm{fp}}_{l}$| for empirically true positive and false clumps identified using all available annotations for the expert-annotated subject set. The distributions represent the restricted subset of clumps in |$\hat{Y}_{20}$| that the majority of volunteers labelled as ‘normal’. However, we recognize that volunteers and experts may disagree about what criteria define a ‘normal’ clump. Therefore, to avoid conflating this categorical disagreement with genuine cases when experts and volunteers mark different features (regardless of the annotation tool used), we consider any expert identified clump when assigning true-positive or false-positive labels. The majority of aggregated clumps in both categories have very low estimated false positive probabilities (|$p^{\mathrm{fp}}_{l}\ll 1$|), indicating a high degree of consensus between volunteers, albeit that this consensus disagrees with the expert annotations. Although clumps in both empirical categories have estimated |$p^{\mathrm{fp}}_{l}$| values spanning the full range [0, 1], we note that 95 per cent of empirically true-positive clumps have |$p^{\mathrm{fp}}_{l} < 0.3$| compared with only 68 per cent of empirical false positives. This reinforces the evidence implicit in Fig. 10 that the aggregated clump sample can be made purer with respect to the expert sample by applying a threshold on |$p^{\mathrm{fp}}_{l}$|.

Distribution of estimated clump false positive probability (|$p^{\mathrm{fp}}_{l}$|) values for aggregated clump locations that coincide with expert annotations (orange) and those that did not (blue). We use coincidence with any expert clump to establish the true-positive or false-positive categories, but only aggregated clumps that the majority of volunteers labelled as ‘normal’ are considered. The inset shows a zoomed view of the distributions for |$p^{\mathrm{fp}}_{l} < 0.01$|.
6.4 Volunteer skill parameters
Our aggregation framework allows us to monitor the evolution of volunteers’ skill parameters as they spend time in the project. The top panel of Fig. 13 shows the distribution of the Galaxy Zoo: Clump Scout volunteers’ subject classification counts. The distribution is bottom-heavy with a median of three subjects per volunteer and 19 859 volunteers (∼95 per cent) annotating fewer than 10 images, and only 176 volunteers (∼0.08 per cent) annotating more than 200.17 The remaining panels of Fig. 13 illustrate how our estimates of the volunteers’ skill parameters evolve as volunteers inspect and annotate increasing numbers of subjects. For all three skill parameters, the mean and median of the maximum likelihood estimates increase monotonically from their prior values as volunteers annotate more subjects. The relatively slow evolution of |$p^{\mathrm{fp}}_{j}$| for subject inspection counts below ∼10 reflects the strong regularization that results from setting the hyper-parameter |$n_{\beta }^{\mathrm{fp}}=500$| (see Table 1).

Evolution of volunteer skill parameter statistics versus number of subjects inspected. The top panel show the distribution of the number of volunteers who have inspected at least as many subjects as indicated by the upper boundary of each bin. This means volunteers who annotate many subjects will contribute to several bins. However, their skill parameters are sampled at the point that they had inspected the maximum number of subjects represented by a particular bin. Statistics for the different volunteer skill parameters |$p_{j}^{\mathrm{fp}}$|, |$p_{j}^{\mathrm{fn}}$| and σj are shown in the upper-middle, lower-middle, and bottom panels, respectively. Red and blue markers plot the median and mean skill parameter of all volunteers contributing to a particular bin, respectively. The orange band illustrates the inter-quartile ranges of the bin-wise distributions. Dotted and dashed lines indicate the 5th and 95th percentiles, respectively.
6.5 Subject risk and its components
The distributions shown in Fig. 14 reveal how the expected numbers of false positive bounding boxes |$N_{i}^{\mathrm{fp}}$|, missed clumps (or false negatives) |$N_{i}^{\mathrm{fn}}$| and inaccurate clump locations |$N_{i}^{\mathrm{\sigma }}$| (see Section 5.7) evolve for the subjects in the the Galaxy Zoo: Clump Scout subject set as more volunteers annotate them. For the majority of subjects, our framework estimates values less than one for all risk components, regardless of how many volunteers annotated them. The distributions of |$N_{i}^{\mathrm{fp}}$|, |$N_{i}^{\mathrm{fn}}$|, and |$N_{i}^{\mathrm{\sigma }}$| become broader and their median values decrease monotonically as n → 20. This pattern indicates that for the majority of subjects, increasing the number of volunteers who annotate each subject improves the reliability of their consensus labels.

Evolution of the distributions for components of subject risk as the number of volunteer annotations per subject increases. Distributions for the expected numbers of false positive bounding boxes |$N_{i}^{\mathrm{fp}}$|, missed clumps (or false negatives) |$N_{i}^{\mathrm{fn}}$|, and inaccurate clump locations |$N_{i}^{\mathrm{\sigma }}$| are shown in the upper left, lower left-hand, and lower right-hand panels, respectively. The upper right-hand panel shows the distributions of |$N_{i}^{\mathrm{fp}}$| after discarding individual clumps with false positive probabilities |$p^{\mathrm{fp}}_{l} > 0.85$|. Note that the y-axis changes from logarithmic to linear scaling at the values indicated by the black horizontal dashed lines to better illustrate the evolution of structures in each distribution.
A minority of subjects have estimated values for one or more of |$N_{i}^{\mathrm{fp}}$|, |$N_{i}^{\mathrm{fn}}$|, or |$N_{i}^{\mathrm{\sigma }}$| that are greater than one. For this subset of subjects, their associated risk component distributions appear to stabilize after five or more volunteers have annotated each subject. We suggest that estimates for subjects that are annotated by fewer than five volunteers (i.e. for n ≲ 5) are noise-dominated or prior-dominated and somewhat unreliable. The structure that is visible in the distributions of |$N_{i}^{\mathrm{fp}}$| in the upper-left-hand panel is produced by a strong bimodality in the distribution of false positive probabilities (|$p^{\mathrm{fp}}_{l}$|) for the clumps in the corresponding sets of estimated labels (i.e. the clumps in the corresponding |$\hat{Y}_{n}$| – see Fig. 7). For each clump in the estimated label for a particular subject, its false positive probability is very likely to be close to zero or one. The expected number of false positive clumps in a subject’s estimated label is derived by summing a term that includes these probabilities in its denominator, so the distributions of will naturally be concentrated into peaks around integer values of |$N_{i}^{\mathrm{fp}}$|. Similar structures that are visible in the distributions of |$N_{i}^{\mathrm{fn}}$| are produced by a strong bimodality in the summand in equation (39). The fraction of subjects for which |$N_{i}^{\mathrm{fp}} > 1$| peaks at ∼10 per cent for n = 15 and decreases to ∼8 per cent for n ∼ 20. In contrast, the fraction of subjects for which |$N_{i}^{\mathrm{fn}} > 1$| does not peak, but increases quasi-monotonically to reach ∼2 per cent as n → 20. The fraction of subjects for which |$N_{i}^{\sigma } > 1$| is negligible and <0.05 per cent for all n.
The overall median values for the estimated numbers of missed clumps and inaccurate clump locations per subject both decrease monotonically as the number of volunteers who inspect each subject increases. However the overall median for the expected number of false positive clumps per subject increases slowly until n = 13 before beginning to decrease. We assess the feasibility of reducing |$N_{i}^{\mathrm{fp}}$| by discarding aggregated clumps with high individual false positive probabilities. The upper right-hand panel of the Fig. 14 shows the effect of filtering clumps with |$p^{\mathrm{fp}}_{l} > 0.85$| on the distribution of |$N_{i}^{\mathrm{fp}}$|. Applying this filter substantially reduces the estimated number of false positive clumps after five or more volunteers annotate each subject and moreover, the fraction of subjects for which the expected number of false positive clumps per subject exceeds one now peaks at ∼0.1 per cent for n = 5 and decreases rapidly thereafter.
We note that filtering clumps based solely on their estimated false-positive probabilities may inadvertently discard real clumps if |$p^{\mathrm{fp}}_{l}$| does not correlate appropriately with observable quantities like brightness and colour that can indicate whether a particular feature is a genuine clump or spurious.18 Fig. 15 illustrates the overall effect of discarding clumps with individual false positive probabilities larger than 0.85 on the number of clumps per galaxy that our framework identifies using different numbers of volunteer annotations per subject. The impact is strongest for n ≳ 7 but the overall effect is small with ≲0.5 fewer clumps identified per galaxy. The left-hand panel of Fig. 16 plots fluxes in the g, r, and i bands versus the estimated individual false positive probability (|$p^{\mathrm{fp}}_{l}$|) for all clumps that our framework identifies using 20 annotations per subject. In all three bands, the mean flux of clumps with |$p^{\mathrm{fp}}_{l} < 0.2$| is ∼1.5 times larger than the mean flux for clumps with |$p^{\mathrm{fp}}_{l} > 0.2$|. The right-hand panel of Fig. 16 plots the non-redundant flux ratios i/g, r/g, and i/r versus |$p^{\mathrm{fp}}_{l}$|. On average, clumps with low estimated false positive probability appear brighter in bluer bands. Overall, we observe a pattern whereby clumps that appear brighter and bluer in the subject images tend to have lower |$p^{\mathrm{fp}}_{l}$|. We verified that this pattern does not change significantly when clumps are filtered according to the fraction of volunteers that labelled them as ‘unusual’. This is reassuring because real clumps are expected to be bright and blue in colour and suggests that filtering clumps based on |$p^{\mathrm{fp}}_{l}$| is well motivated physically. The correlations with flux and colour also resemble the empirical patterns described in Section 6.3, where we observed that the sample of clumps that coincided with expert clump annotations were brighter and bluer than the sample of clumps that did not.

Evolution of the distribution of the number of clumps per galaxy as more volunteers inspect and annotate each subject. The red markers and lines plot the distribution means for the different numbers of volunteers per subject. Top panel: Number of clumps per galaxy with any value for their estimated false positive probability |$p^{\mathrm{fp}}_{l}$|. Bottom panel: Number of clumps per galaxy with |$p^{\mathrm{fp}}_{l} < 0.85$|.

Left-hand panel: Clump flux in g, r, and i bands versus estimated clump false positive probability |$p^{\mathrm{fp}}_{l}$|. Right-hand panel: Clump flux ratios g, r, and i bands versus estimated clump |$p^{\mathrm{fp}}_{l}$|. In both panels, the fluxes in each band are scaled in the same way as the corresponding bands of the subject images (see Section 2.2) to better reflect the data that volunteers actually see. On average, clumps with low |$p^{\mathrm{fp}}_{l}$| appear brighter and bluer in the subject images.
Fig. 17 shows how the fractions of subjects that are retired for different reasons vary as more volunteers annotate each subject. More than 90 per cent of subjects meet the subject retirement criterion specified in Section 5.8 regardless of how many volunteers annotate each subject. Of the remaining subjects, ∼7–9 per cent become stale after persisting in the working batch for more than 10 replenishment cycles and are removed. The fraction of stale subjects peaks for n = 6 annotations per subject and decreases monotonically thereafter as more annotations per subject are used. Fewer than 1 per cent of subjects failed to retire for any n. The fraction of unretired subjects is maximally 0.9 per cent for n = 3 and falls to <0.1 per cent for n = 20. We comment that for n < 6 the computation of |$\mathcal {R}$| and its components is likely to be dominated by our model priors, and therefore the apparent decrease in the number of stale subjects should probably not be interpreted as improved performance within this domain.

Fraction of subjects retired for different reasons versus number of volunteers per subject.
7 DISCUSSION
Using the annotations provided by the Galaxy Zoo: Clump Scout volunteers, our framework has identified a large catalogue of potential clumps. In addition, our aggregation framework provides quantitative metrics for the reliability of the estimated subject labels it computes. These diagnostics allow us to better understand how volunteers interpreted the definition for a clump that they were provided with and how they execute the annotation task.
The observable properties of the clumps we detect appear plausible, both in terms of their spatial distribution within the subject images and their fluxes in the SDSS g, r, and i bands. The central concentration of confidently identified clumps in Fig. 9 is reassuring, because it reflects the typical footprints of the target galaxies in each subject image, which is where we would reasonably expect to find genuine clumps. For clumps with any estimated false positive probability |$p^{\mathrm{fp}}_{l}$|, we observe a clear under-density at the centre of the distribution, which likely reflects the fact that most volunteers correctly distinguish the target galaxies’ central bulges from clumps.
The clump flux and colour distributions in Fig. 16 reveal that brighter, bluer clumps tend to have lower false positive probabilities (|$p^{\mathrm{fp}}_{l}$|). This trend is also reassuring because real clumps are expected to be bright and blue in colour and suggests that filtering clumps based on |$p^{\mathrm{fp}}_{l}$| is well motivated physically. The correlations with flux and colour also resemble the empirical patterns described in Section 6.3, where we observed that the sample of clumps that coincided with expert clump annotations were brighter and bluer than the sample of clumps that did not.
By comparing expert labels for 1000 subjects with those estimated by our framework using volunteer annotations, we showed that volunteers are much more optimistic that experts when annotating clumps. Overall, the distributions in Fig. 11 suggest that volunteers are more likely to mark faint features than experts, particularly when those features appear blue. This results in aggregated clump samples for the 1000 test subjects that appear quite heavily contaminated with respect to the expert labels. Moreover, this apparent contamination worsens if clumps that experts or the majority of volunteers labelled as ‘unusual’ are discarded. This degradation in purity for the ‘normal’ clump subset likely indicates that volunteers and experts disagree about the definition of a ‘normal’ clump with volunteers being less likely to label a clump as unusual.
Using Fig. 12, we illustrated that our framework tends to estimate lower false positive probabilities for clumps that were marked by both volunteers and experts. The formulation of our likelihood model means that smaller estimated false positive probabilities correlate broadly with a greater degree of consensus between skilled volunteers that a clump exists at a particular location. Therefore, it seems that while many volunteers mark features that experts would not identify as clumps’ features that experts do mark tend to have also been marked by a majority of more skilled volunteers who inspected the corresponding subject. The correlation between clumps’ false positive probabilities and their expert classifications also reinforces the evidence implicit in Fig. 10 that the aggregated clump sample can be made purer with respect to the expert sample by applying a threshold on |$p^{\mathrm{fp}}_{l}$|.
We note that while using a visual labelling approach to identify, clumps provides more flexibility than relying on a fixed set of brightness or colour thresholds, it is also unavoidably subjective. To illustrate how this subjectivity may be impacting the empirically determined purity and completeness of our clump sample, Fig. 18 shows typical examples of the faint blue features that volunteers annotate but experts ignore. Many of these do appear clump-like and it is not always obvious why experts have not marked them. Based on these observations, we suggest that the sample of clumps identified by our framework using volunteer annotations may not be as severely contaminated as Fig. 10 implies. We also note that the clump samples our framework derives are generally very complete and include the majority of expert-labelled clumps. This means that that subsets of clumps for particular scientific analyses can be selected from a nominally impure sample using physically motivated criteria based on directly observable or derived characteristics of the individual clumps. For example, Adams et al. (2022) derive samples of bright clumps by using criteria based on photometry extracted from clumps and their host galaxies.

Six curated but representative examples of subject images that show agreements and disagreements between volunteers and experts. Features labelled as clumps by volunteers but ignored by experts are highlighted by white boxes. Red boxes highlight features that were annotated by both experts and volunteers. Red circles highlight features that were annotated by experts but not by volunteers. Volunteers tend to mark fainter features than experts, particularly if those features appear blue in colour. None of the features highlighted in this figure were labelled as ‘odd’ by a majority of volunteers or the experts who marked them.
In addition to providing quantitative estimates for the reliability of individual clump labels, our framework allows us to investigate the performance of individual volunteers and the entire volunteer cohort. The positive gradients of skill parameter evolution curves in Fig. 13 decrease with increasing number of subjects inspected (their second derivatives are negative except in the final bin which contains relatively few volunteers). This suggests that the the volunteer skill parameters may converge to stable asymptotic values for very large numbers of inspected subjects. The fact that this convergence was not achieved for the Galaxy Zoo: Clump Scout data set likely indicates that the global maximum likelihood solution is dominated by the large number of volunteers who inspect very few images and may provide noisy annotations due to their relative inexperience.
The noisiness of volunteer annotations probably indicates that identifying clumps within star forming galaxies, which can have complex underlying morphologies, is relatively difficult for inexperienced non-experts. In Section 6.4, we noted that most volunteers only annotated a small number of galaxies and may not have had time to learn the visible characteristics of genuine clumps. While it may be the case that the task of clump identification is too difficult for typical Zooniverse volunteers, this seems unlikely and there are several plausible strategies for making complex and subtle image analysis tasks more feasible for citizen scientists. The most obvious is to improve the amount and quality of the initial training that is provided to volunteers. However, Zooniverse volunteers are accustomed to participating in projects with minimal tutorial material so imposing a more rigorous training requirement may discourage widespread participation. As discussed in Section 3.1, the volunteers to who contributed to Galaxy Zoo: Clump Scout received real-time feedback for a small number of expert-labelled subjects that they annotated during the early stages of their participation. Providing more detailed feedback for a larger sample of subjects may help volunteers to better understand the task they are being asked to perform.
Some Zooniverse projects also provide a dedicated tutorial workflow with an accompanying video tutorial in which experts annotate the same subjects that volunteers see and explain their reasoning.19 When using feedback as a training tool, it is important that the feedback subjects contain galaxies and clumps that are properly representative of the global populations within the full subject set, but it is difficult to ensure that this is the case unless the experts themselves inspect a large number of subjects. Moreover, the feedback messages that volunteers receive must be carefully chosen to avoid discouraging volunteers if their annotations disagree with those of experts.
An alternative to explicit training and feedback that was pioneered by the Gravity Spy project20 involves incrementally increasing the difficulty of subjects that volunteers inspect and annotate as they spend longer engaged with the project and their skill improves (Zevin et al. 2017). Using this ‘leveling up’ approach requires an a priori metric for the relative difficulty of subjects for volunteers, as well as ongoing assessment of volunteers’ skills. While our framework naturally fulfills the latter requirement, it does not facilitate prior segregation of subjects to populate the different difficulty levels. It might be possible to formulate a heuristic approach to estimating subject difficulty based on observable properties of the clumps’ host galaxies, but that is beyond the scope of this paper.
As we discuss in Section 4.1, the consensus reliability metrics that our framework computes may enable quantitatively motivated early retirement of subjects if it can be established that a stable consensus solution has been reached. In Section 5.7, we described how our framework formulates a subject retirement criterion based on estimated metrics that are proxies for the completeness (|$N_{i}^{\mathrm{fn}}$|), purity (|$N_{i}^{\mathrm{fp}}$|), and accuracy (|$N_{i}^{\sigma }$|) of that subject’s label. Fig. 17 seems to show that more than 90 per cent of subjects fulfil this criterion, even when only n = 3 volunteers inspect each subject. However, the distributions shown in 14 appear to be noise or prior dominated for n ≲ 5, and we suggest that estimates of the subject risk |$\mathcal {R}$| and its components |$\big\lbrace N_{i}^{\mathrm{fn}},N_{i}^{\mathrm{fn}},N_{i}^{\sigma }\big\rbrace$| for that domain should be treated with some caution.
In Fig. 14, we showed that discarding clumps with estimated individual false positive probabilities |$p_{l}^{\mathrm{fp}} > 0.85$| substantially reduces the number of subject labels that are expected to include one or more false positive clumps and that this number reduces rapidly once more than seven volunteers have annotated each subject.
We interpret the fact that the estimated number of missed clumps per subject (|$N_{i}^{\mathrm{fn}}$|) increases as more volunteers annotate each subject as an effect of some of those volunteers marking very faint features. Potential false-negative clumps identified by the second constrained run of the facility location algorithm (see Section 5.7) are typically on the threshold of identification by our framework, which normally means that several volunteers have marked them.21 If the fraction of highly optimistic volunteers within the overall cohort is small, then a relatively large number of volunteers must inspect each subject for faint features to reach the threshold, where they are considered potential false negatives. The increase in |$N_{i}^{\mathrm{fn}}$| as the number of annotations per subject n → 20 is then an indication that more faint features are reaching, but not surpassing, our framework’s detection threshold. Fig. 15 provides an empirical estimate for the number of clumps per galaxy that are missed when fewer volunteers inspect each subject. Although the mean number of identified clumps per galaxy does increase in the interval 7 < n < 20, the rate of increase is very slow and increasing n from 7 to 20 results in just 0.5 more clumps with individual false positive probabilities |$p_{l}^{\mathrm{fp}} < 0.85$| per galaxy on average. In line with our previous observations regarding volunteer optimism, we suggest that many of these additional clumps may in fact be faint, blue features within the target galaxies. As Fig. 10 illustrates, our comparison with expert labels also suggests that n ∼ 7 provides that best compromise between the completeness and purity of our aggregated clump sample.
Empirically, it seems like at least five volunteers must inspect each subject to obtain a stable solution for the subject labels and that the majority of genuine clumps could be identified by our framework for most subjects using the annotations provided by approx seven volunteers. Increasing the number of volunteers beyond this threshold seems to introduce more noise into the annotation data and also results in progressively fainter features being identified. Retiring the majority of subjects after inspection by seven volunteers, if it could have been well motivated, would have reduced the volunteer effort required for the Galaxy Zoo: Clump Scout project by a factor >2. Unfortunately, we must acknowledge that the reliability metrics computed by our framework do not seem to converge in a way that is useful to facilitate an early retirement decision. For most subjects, our framework predicts expected numbers of false positives, false negatives and inaccurate true positives that are less than one for any number of annotations (i.e. |$N_{i}^{\mathrm{fp}},N_{i}^{\mathrm{fn}},N_{i}^{\sigma }\ll 1\,\,\forall \, n$|) and so these subjects would have been retired when n < 7 based on the thresholds listed in Table 2. As we show in Fig. 10, retiring subjects this early would yield a lower sample completeness, even for the brighter clumps that experts also identified.
Moreover, while predicted numbers of subject labels containing false positive or inaccurate clump locations both decrease for n ≳ 7 as n → 20, the predicted number of subjects labels that are missing real clumps increases. Using any retirement criterion predicated on |$N_{i}^{\mathrm{fn}}\ll 1$|, considering the annotations from more volunteers would result in more subjects becoming stale in the working batch and therefore requiring inspection by experts. Fortunately in the case of Galaxy Zoo: Clump Scout , the fraction of subjects for which the estimated number of false positive clumps |$N_{i}^{\mathrm{fn}} > 1$| for any n is <3 per cent of the overall data set (∼2500 subjects), so visual inspection by experts would be feasible.
8 SUMMARY AND CONCLUSION
In this paper, we have presented a software framework that uses a probabilistic model to aggregate multiple annotations that mark two-dimensional locations in images of distant galaxies and derive a consensus label based on those annotations. The annotations themselves were provided via the Galaxy Zoo: Clump Scout citizen science project by non-expert volunteers who were asked to mark the locations of giant star forming clumps within the target galaxies. Among a sample of 85 286 galaxy images that were inspected by volunteers, our software framework identified 44 126 that contained at least one visible clump and detected 128 100 potential clumps overall.
To empirically evaluate the validity of the clumps we identify, we compared our aggregated labels with annotations provided by expert astronomers for a subset of 1000 galaxy images. We found that Galaxy Zoo: Clump Scout volunteers are much more optimistic than experts, and are willing to mark much fainter features as potential clumps, particularly if those features appear blue in colour. However, volunteers also mark the vast majority of bright clumps that experts identify, so although the sample of clumps we identify is ∼50 per cent contaminated with respect to the expert identifications, it is |${\gtrsim}90\ \hbox{per cent}$| complete.
In addition to our empirical evaluation, we have used the statistical model that underpins our framework to compute quantitative metrics for the reliability of the overall aggregated labels that we derive for each image. These metrics suggest that stable consensus for most images’ labels is achieved after ∼7 volunteers have annotated it, which is <50 per cent of the 20 annotations that were collected for each image via Galaxy Zoo: Clump Scout and would represent a significant saving in volunteer effort. However, the annotation data are quite noisy with large variation between the numbers of locations that are marked by different volunteers and this noise makes it difficult to define a robust ‘early retirement’ criterion that could be used to safely curtail collection of annotations before 20 have been acquired.
We suggest that the noisy annotation data reflect the fact that inexperienced non-experts find the task of identifying clumps difficult, or that the task was not properly explained. In Section 7, we discuss how different approaches to volunteer training could be used to help volunteers better distinguish the visible characteristics of genuine clumps from those of the faint blue features that many ultimately marked. On the other hand, one of the benefits of using citizen science to identify clumps is that it avoids being overly prescriptive regarding the definition of a clump. Galaxy Zoo: Clump Scout represents the first extensive wide-field search for clumpy galaxies in the local Universe and it may be that low-redshift clumps have different properties to their more distant counterparts. Using strict thresholds on brightness or colour might result an unexpected population of fainter clumps being missed. Moreover, the sample of clumps identified by volunteers appears to be very complete and so, if a subset of bright clumps is required for science analysis, such a sample can be straightforwardly constructed using photometric measurements for each clump (e.g. Adams et al. 2022).
Although our framework was developed to aggregate annotations for a specific citizen science project, its applicability is more general. A large number of projects running on the Zooniverse platform collect two dimensional image annotations. Many of those projects consider subjects that are more familiar to non-experts and may be less prone to noise. In such cases, our framework may be able to substantially reduce the amount of effort and time taken to reach consensus for each subject.
ACKNOWLEDGEMENTS
HD and SS were partly supported by the ESCAPE project; ESCAPE – The European Science Cluster of Astronomy & Particle Physics ESFRI Research Infrastructures has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement no. 824064. SS also thanks the Science and Technology Facilities Council for financial support under grant ST/P000584/1. MW gratefully acknowledges support from the Alan Turing Institute, grant reference EP/V030302/1. This research is partially supported by the National Science Foundation under grants AST 1716602 and IIS 2006894.. This material is based upon work supported by the National Aeronautics and Space Administration (NASA) under Grant No. HST-AR-15792.002-A. This publication uses data generated via the Zooniverse.org platform, development of which is funded by generous support, including a Global Impact Award from Google, and by a grant from the Alfred P. Sloan Foundation.
This research made use of the open-source python scientific computing ecosystem, including NumPy (Harris et al. 2020), Matplotlib (Hunter 2007), and Pandas (McKinney 2010). This research made use of Astropy, a community-developed core Python package for Astronomy (The Astropy Collaboration et al. 2018). This research made use of Numba (Lam, Pitrou & Seibert 2015).
DATA AVAILABILITY
The data underlying this article were used in Adams et al. (2022) and can be obtained as a machine-readable table by downloading the associated article data from https://doi.org/10.3847/1538-4357/ac6512.
Footnotes
A negative response corresponds to selecting the answer ‘Features or disk.’
This minimum size criterion is designed to handle galaxies that have very small, incorrectly measured Petrosian radii.
Even if volunteers are able to submit such nearby marks, our algorithm is designed to only recognize one of them. The choice of which nearby clicks to discard depends on the clicks provided by other volunteers.
Specifying a conjugate prior π(θ) for parameter θ in Bayes’s rule yields a posterior distribution p(θ|z) ∝ π(θ) · p(z|θ) that has the same functional form as the prior itself. Note that in general the conjugate prior depends on both the likelihood model and the parameter of interest. For example, the variance and mean of a Gaussian likelihood function have different conjugate priors.
Although our implementation does not explicitly limit batch sizes in practice, we found that model data storage requirements for batches containing ≳25 thousand classifications exhausted the 32 GB memory capacity of our available hardware.
The chosen algorithm implements approximate computation of the maximum log-likelihood solution and is guaranteed to find a solution for which the log-likelihood is at most 1.61 times the optimal one.
This nomenclature reflects a common application of facility location algorithms to optimize distribution of some essential commodity from facilities located at a small number of locations within a larger network of cities.
Concretely, let |$\mathbf {r}$| be a generalized two dimensional coordinate and the index m enumerate the corner vertices of a box, beginning in the upper-left and proceeding along the box edges in a clockwise direction, then
Recall that insufficient accuracy implies that the Jaccard distance between the estimated and true clump locations is likely to exceed the value of the hyper-parameter δ
As described in Section 3.3, volunteer boxes have a side-length equal to twice the FWHM of subject’s PSF, and may have different absolute pixel dimensions. When computing |$N_{i}^{\mathrm{fn}}$|, we account for this by using normalized image coordinates {x′, y′} ≡ {x/xmax , y/ymax } to define box boundaries when we compute the Jaccard distance between boxes in the global set.
Recall (Section 4.2) that we define an annotation |$z_{ij}=\big\lbrace b_{ij}^{k}\big\rbrace _{k=1}^{|B_{ij}|}$| to be the set of box markings provided by a particular volunteer when they inspect a particular subject, so the number of annotations is generally less than the size of the working batch.
Note that the labels for each subject may in principle depend on all annotations in |$\tilde{Z}_{n}$| via those annotations’ influence on the volunteers’ skill parameters.
See §;3 in Adams et al. (2022) for a detailed explanation of how clump fluxes are computed.
This skewed non-uniform distribution for the nuber of annotations per volunteer is also seen in many other Zooniverse projects (e.g. Spiers et al. 2019).
Indeed, for this reason Adams et al. (2022) apply a very permissive |$p^{\mathrm{fp}}_{l}$| threshold before filtering further based on observable clump parameters.
The precise number of marks required depends on the skill parameters of the volunteers who provide them.
REFERENCES
APPENDIX A: MODEL PARAMETER PRIORS
In this section, we derive formulae that we use to compute and update the priors for our likelihood model’s volunteer skill and subject difficulty parameters. Crucially for the efficiency of our framework, these formulae can all be factored into terms that depend only on the current working batch and terms that depend only on prior information. This allows us to straightforwardly update the skill parameters of returning volunteers without having to reconsider the annotations they contributed to previous working batches.
A1 Beta priors for pfp and pfn
Our model for volunteer skill assumes that the event in which volunteer j incorrectly provides a false positive clump annotation is Bernoulli |${\rm Bern}\big(p_{j}^{\mathrm{fp}}\big)$|, and similarly the event that a volunteer misses a real clump is |${\rm Bern}\big(p_{j}^{\mathrm{fn}}\big)$|. Note that in general |$p_{j}^{\mathrm{fp}} \ne 1 - p_{j}^{\mathrm{fn}}$|.
A2 Scaled inverse χ2 priors for σ2
We use a scaled inverse χ2 priors to compute a posterior distribution over variance parameters of our Gaussian models for dj and di, l.
APPENDIX B: DOMAIN-SPECIFIC TERMS
In this section, we provide short definitions for some of the potentially unfamiliar terms that are used in this paper.
Subject A single image of a galaxy that volunteers are shown via the Zooniverse platform and they inspect to search for clumps.
Volunteer A member of the public who participated in the Galaxy Zoo: Clump Scout project by inspecting one or more subjects and used Zooniverse platform interface to search for and annotate the locations on clumps.
Annotation A set of click locations provided by a single volunteer as they inspect a single subject. The click locations are later expanded into a set of square boxes as explained in Section 4.2.
Label A set of zero or more rectangular bounding boxes, derived by our aggregation framework for a single subject image, that estimates the locations of any clumps it contains.
Skill A compound metric, describing a particular volunteer, that estimates the probability will mark a spurious clump, the probability that they will miss a real clump, and the accuracy of the locations they provide for any real clumps they mark.
Difficulty A quantitative metric for the degree to which the properties of a single subject image affect the ability of volunteers to perceive and accurately label any clumps it contains.
Risk A metric that is designed to quantify the reliability and scientific utility of a single subject’s consensus label.
Retire Stop collecting annotations for a subject.
APPENDIX C: TABLE OF SYMBOLS
In this section, we provide a reference table for symbols that recur in multiple sections of this paper.
Table of the most commonly recurring symbols used in this paper. We divide the symbols into categories and provide a brief description of how they should be interpreted. Complete descriptions of each symbol are provided in the main text at the point they are first introduced.
Category . | Symbol . | Description . |
---|---|---|
Object indices | i | Index over subjects. |
j | Index over volunteers. | |
k | Index over a volunteer j’s clump identifications for a single subject. | |
l | Index over aggregated clump locations. | |
Subjects | S | The global set of subject images. |
Sj | The set of subject images inspected by volunteer j. | |
si | A single subject image in S. | |
|$\mathcal {R}_{i}$| | The risk for subject si. | |
|$N_{i}^{\mathrm{fp}}$| | The expected number of spurious clump locations (false positives) in the label for subject i. | |
|$N_{i}^{\mathrm{fn}}$| | The expected number of missed clumps (false negatives) in the label for subject i. | |
|$N_{i}^{\sigma }$| | The expected number nominally true positive clump locations in the label for subject i that differ from the (unknown) true clump location by a Jaccard distance greater than 0.5. | |
Subject difficulties | |${\sigma _{i}^{l}}^{2}$| | The variance of a Gaussian model for the Jaccard distance offset between the estimated location of the lth detected clump for subject i and its corresponding (unknown) true location. |
|$\mathcal {D}_{i}$| | The difficulty of subject i defined the set of |${\sigma _{i}^{l}}^{2}$| values for all detected clumps in the image. | |
Volunteers | V | The global set of volunteers. |
Vi | The subset of volunteers who inspected subject i. | |
Volunteer skills | |$p_{j}^{\mathrm{fp}}$| | The probability that volunteer j will click on a spurious clump. |
|$p_{j}^{\mathrm{fn}}$| | The probability that volunteer j will miss a real clump. | |
|$\sigma _{j}^{2}$| | The variance of a Gaussian model for the Jaccard distance offset between volunteer j’s true positive click locations and the corresponding (unknown) true clump locations, independent of subject. | |
|$\mathcal {S}_{j}$| | The skill of volunteer j defined as the set |$\lbrace p_{j}^{\mathrm{fp}}, p_{j}^{\mathrm{fn}}, \sigma _{j}^{2}\rbrace$|. | |
Annotations | Z | The global set of volunteer annotations. |
Zi | The set of annotations for a single subject image provided by all the volunteers who inspected it. | |
|$\tilde{Z}_{n}$| | A randomly selected subset of Z containing exactly n annotations per subject. | |
zij | A single annotation provided by volunteer j after inspecting subject i. | |
Bij | The set of boxes, corresponding to click locations provided by volunteer j for subject i. | |
Bi | The set of all boxes, corresponding to click locations provided for subject i by all volunteers who inspected it. | |
|$b_{ij}^{k}$| | A single box, corresponding to the location of a single click provided by volunteer j for subject i. | |
|${\sigma _{ij}^{k}}^{2}$| | The variance of a Gaussian model for the Jaccard distance offset between volunteer j’s kth true positive click location for subject i, and its corresponding (unknown) true clump location. | |
|$a_{ij}^{k}$| | An integer value that maps the kth click in volunteer j’s annotation of subject i to a specific clump in that subject’s estimated label (or to the dummy facility if it is deemed to be a false positive). | |
Labels | Y | The global set of subject labels. |
yi | The unknown true label for subject i. | |
|$b_{i}^{l}$| | A single box comprising part of the unknown true label for subject i. | |
|$\hat{y}_{i}$| | The estimated label for subject i that is computed by our framework. | |
|$\hat{b}_{i}^{l}$| | A single box comprising part of the estimated label for subject i. | |
|$p^{\mathrm{fp}}_{l}$| | The probability that the lth clump in the estimated label for a subject is a false positive. | |
|$p^{\sigma }_{l}$| | The probability that the Jaccard distance between the lth clump in the estimated label and the corresponding (unknown) true clump location exceeds 0.5. |
Category . | Symbol . | Description . |
---|---|---|
Object indices | i | Index over subjects. |
j | Index over volunteers. | |
k | Index over a volunteer j’s clump identifications for a single subject. | |
l | Index over aggregated clump locations. | |
Subjects | S | The global set of subject images. |
Sj | The set of subject images inspected by volunteer j. | |
si | A single subject image in S. | |
|$\mathcal {R}_{i}$| | The risk for subject si. | |
|$N_{i}^{\mathrm{fp}}$| | The expected number of spurious clump locations (false positives) in the label for subject i. | |
|$N_{i}^{\mathrm{fn}}$| | The expected number of missed clumps (false negatives) in the label for subject i. | |
|$N_{i}^{\sigma }$| | The expected number nominally true positive clump locations in the label for subject i that differ from the (unknown) true clump location by a Jaccard distance greater than 0.5. | |
Subject difficulties | |${\sigma _{i}^{l}}^{2}$| | The variance of a Gaussian model for the Jaccard distance offset between the estimated location of the lth detected clump for subject i and its corresponding (unknown) true location. |
|$\mathcal {D}_{i}$| | The difficulty of subject i defined the set of |${\sigma _{i}^{l}}^{2}$| values for all detected clumps in the image. | |
Volunteers | V | The global set of volunteers. |
Vi | The subset of volunteers who inspected subject i. | |
Volunteer skills | |$p_{j}^{\mathrm{fp}}$| | The probability that volunteer j will click on a spurious clump. |
|$p_{j}^{\mathrm{fn}}$| | The probability that volunteer j will miss a real clump. | |
|$\sigma _{j}^{2}$| | The variance of a Gaussian model for the Jaccard distance offset between volunteer j’s true positive click locations and the corresponding (unknown) true clump locations, independent of subject. | |
|$\mathcal {S}_{j}$| | The skill of volunteer j defined as the set |$\lbrace p_{j}^{\mathrm{fp}}, p_{j}^{\mathrm{fn}}, \sigma _{j}^{2}\rbrace$|. | |
Annotations | Z | The global set of volunteer annotations. |
Zi | The set of annotations for a single subject image provided by all the volunteers who inspected it. | |
|$\tilde{Z}_{n}$| | A randomly selected subset of Z containing exactly n annotations per subject. | |
zij | A single annotation provided by volunteer j after inspecting subject i. | |
Bij | The set of boxes, corresponding to click locations provided by volunteer j for subject i. | |
Bi | The set of all boxes, corresponding to click locations provided for subject i by all volunteers who inspected it. | |
|$b_{ij}^{k}$| | A single box, corresponding to the location of a single click provided by volunteer j for subject i. | |
|${\sigma _{ij}^{k}}^{2}$| | The variance of a Gaussian model for the Jaccard distance offset between volunteer j’s kth true positive click location for subject i, and its corresponding (unknown) true clump location. | |
|$a_{ij}^{k}$| | An integer value that maps the kth click in volunteer j’s annotation of subject i to a specific clump in that subject’s estimated label (or to the dummy facility if it is deemed to be a false positive). | |
Labels | Y | The global set of subject labels. |
yi | The unknown true label for subject i. | |
|$b_{i}^{l}$| | A single box comprising part of the unknown true label for subject i. | |
|$\hat{y}_{i}$| | The estimated label for subject i that is computed by our framework. | |
|$\hat{b}_{i}^{l}$| | A single box comprising part of the estimated label for subject i. | |
|$p^{\mathrm{fp}}_{l}$| | The probability that the lth clump in the estimated label for a subject is a false positive. | |
|$p^{\sigma }_{l}$| | The probability that the Jaccard distance between the lth clump in the estimated label and the corresponding (unknown) true clump location exceeds 0.5. |
Table of the most commonly recurring symbols used in this paper. We divide the symbols into categories and provide a brief description of how they should be interpreted. Complete descriptions of each symbol are provided in the main text at the point they are first introduced.
Category . | Symbol . | Description . |
---|---|---|
Object indices | i | Index over subjects. |
j | Index over volunteers. | |
k | Index over a volunteer j’s clump identifications for a single subject. | |
l | Index over aggregated clump locations. | |
Subjects | S | The global set of subject images. |
Sj | The set of subject images inspected by volunteer j. | |
si | A single subject image in S. | |
|$\mathcal {R}_{i}$| | The risk for subject si. | |
|$N_{i}^{\mathrm{fp}}$| | The expected number of spurious clump locations (false positives) in the label for subject i. | |
|$N_{i}^{\mathrm{fn}}$| | The expected number of missed clumps (false negatives) in the label for subject i. | |
|$N_{i}^{\sigma }$| | The expected number nominally true positive clump locations in the label for subject i that differ from the (unknown) true clump location by a Jaccard distance greater than 0.5. | |
Subject difficulties | |${\sigma _{i}^{l}}^{2}$| | The variance of a Gaussian model for the Jaccard distance offset between the estimated location of the lth detected clump for subject i and its corresponding (unknown) true location. |
|$\mathcal {D}_{i}$| | The difficulty of subject i defined the set of |${\sigma _{i}^{l}}^{2}$| values for all detected clumps in the image. | |
Volunteers | V | The global set of volunteers. |
Vi | The subset of volunteers who inspected subject i. | |
Volunteer skills | |$p_{j}^{\mathrm{fp}}$| | The probability that volunteer j will click on a spurious clump. |
|$p_{j}^{\mathrm{fn}}$| | The probability that volunteer j will miss a real clump. | |
|$\sigma _{j}^{2}$| | The variance of a Gaussian model for the Jaccard distance offset between volunteer j’s true positive click locations and the corresponding (unknown) true clump locations, independent of subject. | |
|$\mathcal {S}_{j}$| | The skill of volunteer j defined as the set |$\lbrace p_{j}^{\mathrm{fp}}, p_{j}^{\mathrm{fn}}, \sigma _{j}^{2}\rbrace$|. | |
Annotations | Z | The global set of volunteer annotations. |
Zi | The set of annotations for a single subject image provided by all the volunteers who inspected it. | |
|$\tilde{Z}_{n}$| | A randomly selected subset of Z containing exactly n annotations per subject. | |
zij | A single annotation provided by volunteer j after inspecting subject i. | |
Bij | The set of boxes, corresponding to click locations provided by volunteer j for subject i. | |
Bi | The set of all boxes, corresponding to click locations provided for subject i by all volunteers who inspected it. | |
|$b_{ij}^{k}$| | A single box, corresponding to the location of a single click provided by volunteer j for subject i. | |
|${\sigma _{ij}^{k}}^{2}$| | The variance of a Gaussian model for the Jaccard distance offset between volunteer j’s kth true positive click location for subject i, and its corresponding (unknown) true clump location. | |
|$a_{ij}^{k}$| | An integer value that maps the kth click in volunteer j’s annotation of subject i to a specific clump in that subject’s estimated label (or to the dummy facility if it is deemed to be a false positive). | |
Labels | Y | The global set of subject labels. |
yi | The unknown true label for subject i. | |
|$b_{i}^{l}$| | A single box comprising part of the unknown true label for subject i. | |
|$\hat{y}_{i}$| | The estimated label for subject i that is computed by our framework. | |
|$\hat{b}_{i}^{l}$| | A single box comprising part of the estimated label for subject i. | |
|$p^{\mathrm{fp}}_{l}$| | The probability that the lth clump in the estimated label for a subject is a false positive. | |
|$p^{\sigma }_{l}$| | The probability that the Jaccard distance between the lth clump in the estimated label and the corresponding (unknown) true clump location exceeds 0.5. |
Category . | Symbol . | Description . |
---|---|---|
Object indices | i | Index over subjects. |
j | Index over volunteers. | |
k | Index over a volunteer j’s clump identifications for a single subject. | |
l | Index over aggregated clump locations. | |
Subjects | S | The global set of subject images. |
Sj | The set of subject images inspected by volunteer j. | |
si | A single subject image in S. | |
|$\mathcal {R}_{i}$| | The risk for subject si. | |
|$N_{i}^{\mathrm{fp}}$| | The expected number of spurious clump locations (false positives) in the label for subject i. | |
|$N_{i}^{\mathrm{fn}}$| | The expected number of missed clumps (false negatives) in the label for subject i. | |
|$N_{i}^{\sigma }$| | The expected number nominally true positive clump locations in the label for subject i that differ from the (unknown) true clump location by a Jaccard distance greater than 0.5. | |
Subject difficulties | |${\sigma _{i}^{l}}^{2}$| | The variance of a Gaussian model for the Jaccard distance offset between the estimated location of the lth detected clump for subject i and its corresponding (unknown) true location. |
|$\mathcal {D}_{i}$| | The difficulty of subject i defined the set of |${\sigma _{i}^{l}}^{2}$| values for all detected clumps in the image. | |
Volunteers | V | The global set of volunteers. |
Vi | The subset of volunteers who inspected subject i. | |
Volunteer skills | |$p_{j}^{\mathrm{fp}}$| | The probability that volunteer j will click on a spurious clump. |
|$p_{j}^{\mathrm{fn}}$| | The probability that volunteer j will miss a real clump. | |
|$\sigma _{j}^{2}$| | The variance of a Gaussian model for the Jaccard distance offset between volunteer j’s true positive click locations and the corresponding (unknown) true clump locations, independent of subject. | |
|$\mathcal {S}_{j}$| | The skill of volunteer j defined as the set |$\lbrace p_{j}^{\mathrm{fp}}, p_{j}^{\mathrm{fn}}, \sigma _{j}^{2}\rbrace$|. | |
Annotations | Z | The global set of volunteer annotations. |
Zi | The set of annotations for a single subject image provided by all the volunteers who inspected it. | |
|$\tilde{Z}_{n}$| | A randomly selected subset of Z containing exactly n annotations per subject. | |
zij | A single annotation provided by volunteer j after inspecting subject i. | |
Bij | The set of boxes, corresponding to click locations provided by volunteer j for subject i. | |
Bi | The set of all boxes, corresponding to click locations provided for subject i by all volunteers who inspected it. | |
|$b_{ij}^{k}$| | A single box, corresponding to the location of a single click provided by volunteer j for subject i. | |
|${\sigma _{ij}^{k}}^{2}$| | The variance of a Gaussian model for the Jaccard distance offset between volunteer j’s kth true positive click location for subject i, and its corresponding (unknown) true clump location. | |
|$a_{ij}^{k}$| | An integer value that maps the kth click in volunteer j’s annotation of subject i to a specific clump in that subject’s estimated label (or to the dummy facility if it is deemed to be a false positive). | |
Labels | Y | The global set of subject labels. |
yi | The unknown true label for subject i. | |
|$b_{i}^{l}$| | A single box comprising part of the unknown true label for subject i. | |
|$\hat{y}_{i}$| | The estimated label for subject i that is computed by our framework. | |
|$\hat{b}_{i}^{l}$| | A single box comprising part of the estimated label for subject i. | |
|$p^{\mathrm{fp}}_{l}$| | The probability that the lth clump in the estimated label for a subject is a false positive. | |
|$p^{\sigma }_{l}$| | The probability that the Jaccard distance between the lth clump in the estimated label and the corresponding (unknown) true clump location exceeds 0.5. |
APPENDIX D: COMPARISON WITH SCIKIT LEARN MEANSHIFT CLUSTERING ALGORITHM
We emphasize that the aim of this paper is not to present a novel and very complicated clustering algorithm. Indeed, our focus is the likelihood model that we use to estimate the Galaxy Zoo: Clump Scout volunteers’ skills, the difficulty of the subjects that they inspect, and the reliability of the consensus labels that we derive. None the less, we recognize that there are many well established clustering algorithms in the literature and that some of them may outperform our framework’s ability to actually detect clumps, even if they cannot provide the same auxiliary information about the final subject labels. Presenting exhaustive comparison between our framework and every alternative algorithm is beyond the scope of this paper. However, we have tested several of the methods available from the scikit learn python package (Pedregosa et al. 2011). In this section, we present a representative comparison between our framework and the scikit learnMeanShift clustering algorithm. We set the MeanShift algorithm’s bandwidth parameter set equal to the size of the SDSS imaging PSF for each subject image and all other parameters were left set to their default values.22
Fig. D1 shows the distribution of the difference between the number of clumps detected by our framework and the number detected using the MeanShift algorithm for each subject in the Galaxy Zoo: Clump Scout subject set. For the majority of subjects, our framework detects more clumps than the MeanShift algorithm. In Fig. D2, we show some representative subjects for which our framework detects more clumps than the MeanShift algorithm and in Fig. D3, we show subjects for which the reverse is true. It is not obvious from these figure that either algorithm is particularly biased towards detecting clumps with specific properties. There is some evidence that our algorithm detects fainter potential clumps than the MeanShift algorithm, and seems less vulnerable to misidentifying objects like stars and background galaxies as clumps. Even when such objects are detected by our framework, they tend to be assigned false positive probabilities greater than 0.8. In some cases, our framework fails to detect clumps that many volunteers identify. We speculate that this is a result of a small number of volunteers with very high |$p_{j}^{\mathrm{fp}}$| identifying the clump, which causes our framework to deem other volunteers’ clicks as false positives as well.

Distribution of the distribution of the difference between the number of clumps detected by our framework and the number detected using the MeanShift algorithm for each subject in the Galaxy Zoo: Clump Scout subject set.

Examples of clump-hosting galaxies for which our framework detects more clumps than the scikit learnMeanShift algorithm. The first column shows galaxy images as they were seen by volunteers. The second column overlays all volunteer annotations on a grey-scale image of the same galaxy. The coloured boxes in the third column show the clump locations that out framework identifies. Dashed boxes indicate clumps with false positive probabilities |$p_{l}^{\mathrm{fp}} > 0.8$| assigned by framework label. Finally, the red circles in the fourth column show the clumps detected by the MeanShift algorithm.

Examples of clump-hosting galaxies, for which our framework detects fewer clumps than the scikit learnMeanShift algorithm. The images, boxes and circles shown in the various columns have the same meaning as in Fig. D2.