Blind chemical tagging with DBSCAN: prospects for spectroscopic surveys

Uncertainties used to add normally distributed noise to cluster member abundances around the chosen centre. Conservative from Holtzman et al. (2015), optimistic from Leung & Bovy (2019), and theoretical from Ting et al. (2016).

Element	Conservative	Optimistic	Theoretical
[C/H]	3.5 × 10⁻²	3.3 × 10⁻²	5.0 × 10⁻³
[N/H]	6.7 × 10⁻²	3.1 × 10⁻²	1.0 × 10⁻²
[O/H]	5.0 × 10⁻²	2.0 × 10⁻²	1.0 × 10⁻²
[Na/H]	6.4 × 10⁻²	4.9 × 10⁻²	3.8 × 10⁻²
[Mg/H]	5.3 × 10⁻²	1.7 × 10⁻²	7.6 × 10⁻³
[Al/H]	6.7 × 10⁻²	3.3 × 10⁻²	2.0 × 10⁻²
[Si/H]	7.7 × 10⁻²	1.8 × 10⁻²	8.2 × 10⁻³
[S/H]	6.3 × 10⁻²	4.0 × 10⁻²	2.4 × 10⁻²
[K/H]	6.5 × 10⁻²	2.1 × 10⁻²	4.4 × 10⁻²
[Ca/H]	5.9 × 10⁻²	1.5 × 10⁻²	1.6 × 10⁻²
[Ti/H]	7.2 × 10⁻²	1.9 × 10⁻²	1.8 × 10⁻²
[V/H]	8.8 × 10⁻²	3.9 × 10⁻²	6.0 × 10⁻²
[Mn/H]	6.1 × 10⁻²	2.2 × 10⁻²	1.3 × 10⁻²
[Ni/H]	6.0 × 10⁻²	1.7 × 10⁻²	1.0 × 10⁻²
[Fe/H]	5.3 × 10⁻²	1.5 × 10⁻²	4.3 × 10⁻⁴

Element	Conservative	Optimistic	Theoretical
[C/H]	3.5 × 10⁻²	3.3 × 10⁻²	5.0 × 10⁻³
[N/H]	6.7 × 10⁻²	3.1 × 10⁻²	1.0 × 10⁻²
[O/H]	5.0 × 10⁻²	2.0 × 10⁻²	1.0 × 10⁻²
[Na/H]	6.4 × 10⁻²	4.9 × 10⁻²	3.8 × 10⁻²
[Mg/H]	5.3 × 10⁻²	1.7 × 10⁻²	7.6 × 10⁻³
[Al/H]	6.7 × 10⁻²	3.3 × 10⁻²	2.0 × 10⁻²
[Si/H]	7.7 × 10⁻²	1.8 × 10⁻²	8.2 × 10⁻³
[S/H]	6.3 × 10⁻²	4.0 × 10⁻²	2.4 × 10⁻²
[K/H]	6.5 × 10⁻²	2.1 × 10⁻²	4.4 × 10⁻²
[Ca/H]	5.9 × 10⁻²	1.5 × 10⁻²	1.6 × 10⁻²
[Ti/H]	7.2 × 10⁻²	1.9 × 10⁻²	1.8 × 10⁻²
[V/H]	8.8 × 10⁻²	3.9 × 10⁻²	6.0 × 10⁻²
[Mn/H]	6.1 × 10⁻²	2.2 × 10⁻²	1.3 × 10⁻²
[Ni/H]	6.0 × 10⁻²	1.7 × 10⁻²	1.0 × 10⁻²
[Fe/H]	5.3 × 10⁻²	1.5 × 10⁻²	4.3 × 10⁻⁴

Table 1.

Uncertainties used to add normally distributed noise to cluster member abundances around the chosen centre. Conservative from Holtzman et al. (2015), optimistic from Leung & Bovy (2019), and theoretical from Ting et al. (2016).

Element	Conservative	Optimistic	Theoretical
[C/H]	3.5 × 10⁻²	3.3 × 10⁻²	5.0 × 10⁻³
[N/H]	6.7 × 10⁻²	3.1 × 10⁻²	1.0 × 10⁻²
[O/H]	5.0 × 10⁻²	2.0 × 10⁻²	1.0 × 10⁻²
[Na/H]	6.4 × 10⁻²	4.9 × 10⁻²	3.8 × 10⁻²
[Mg/H]	5.3 × 10⁻²	1.7 × 10⁻²	7.6 × 10⁻³
[Al/H]	6.7 × 10⁻²	3.3 × 10⁻²	2.0 × 10⁻²
[Si/H]	7.7 × 10⁻²	1.8 × 10⁻²	8.2 × 10⁻³
[S/H]	6.3 × 10⁻²	4.0 × 10⁻²	2.4 × 10⁻²
[K/H]	6.5 × 10⁻²	2.1 × 10⁻²	4.4 × 10⁻²
[Ca/H]	5.9 × 10⁻²	1.5 × 10⁻²	1.6 × 10⁻²
[Ti/H]	7.2 × 10⁻²	1.9 × 10⁻²	1.8 × 10⁻²
[V/H]	8.8 × 10⁻²	3.9 × 10⁻²	6.0 × 10⁻²
[Mn/H]	6.1 × 10⁻²	2.2 × 10⁻²	1.3 × 10⁻²
[Ni/H]	6.0 × 10⁻²	1.7 × 10⁻²	1.0 × 10⁻²
[Fe/H]	5.3 × 10⁻²	1.5 × 10⁻²	4.3 × 10⁻⁴

Element	Conservative	Optimistic	Theoretical
[C/H]	3.5 × 10⁻²	3.3 × 10⁻²	5.0 × 10⁻³
[N/H]	6.7 × 10⁻²	3.1 × 10⁻²	1.0 × 10⁻²
[O/H]	5.0 × 10⁻²	2.0 × 10⁻²	1.0 × 10⁻²
[Na/H]	6.4 × 10⁻²	4.9 × 10⁻²	3.8 × 10⁻²
[Mg/H]	5.3 × 10⁻²	1.7 × 10⁻²	7.6 × 10⁻³
[Al/H]	6.7 × 10⁻²	3.3 × 10⁻²	2.0 × 10⁻²
[Si/H]	7.7 × 10⁻²	1.8 × 10⁻²	8.2 × 10⁻³
[S/H]	6.3 × 10⁻²	4.0 × 10⁻²	2.4 × 10⁻²
[K/H]	6.5 × 10⁻²	2.1 × 10⁻²	4.4 × 10⁻²
[Ca/H]	5.9 × 10⁻²	1.5 × 10⁻²	1.6 × 10⁻²
[Ti/H]	7.2 × 10⁻²	1.9 × 10⁻²	1.8 × 10⁻²
[V/H]	8.8 × 10⁻²	3.9 × 10⁻²	6.0 × 10⁻²
[Mn/H]	6.1 × 10⁻²	2.2 × 10⁻²	1.3 × 10⁻²
[Ni/H]	6.0 × 10⁻²	1.7 × 10⁻²	1.0 × 10⁻²
[Fe/H]	5.3 × 10⁻²	1.5 × 10⁻²	4.3 × 10⁻⁴

Considered together, the three possible choices of abundances for each star constitute three abundance spaces on which we perform chemical tagging. However, we do not use the member abundances to generate a spectrum for each star, as the relationship between uncertainty in stellar abundance and uncertainty in flux measurement is somewhat obscure. Our process for generating member spectra is described in the following section.

2.2.3 Synthetic spectra

Separate from the assignment of abundances, we create spectra for each of the member stars in a cluster, assuming that all members truly share the abundances of their cluster centre. In this case, spread in the abundance measurements within a cluster is a combination of uncertainty in the spectra and uncertainty in the method with which abundances are derived. Assuming that each member has the abundances of the cluster centre, we select a T_eff and log g for each member star by assigning it the properties of a random star from the APOGEE sample described in Section 2.2 above. This method allows us to preserve the relationship between T_eff and log g observed in real stars. Stellar spectra are created using the polynomial spectral model developed in Rix et al. (2016), as this allows us to quickly generate large stellar samples. The model produces spectra that emulate the spectra produced using spectral synthesis with model atmospheres. We add normally distributed noise to the spectra, at the order of one-hundredth of the signal. This noise represents the observational uncertainties in each spectrum.

Since these synthetic spectra have non chemical photospheric information (T_eff and log g), they cannot be used in raw form for chemical tagging. To remove the influence of the photospheric properties on the spectra, we employ the same polynomial fitting technique used in Price-Jones & Bovy (2018). In this approach, we correct for T_eff and log g at each wavelength independently. Once a wavelength is selected, we use the T_eff and log g assigned when we created our sample in a quadratic fit to the flux at that wavelength across all stars in the sample. We then subtract that fit from the flux and repeat with the next wavelength until the flux values at all wavelengths are corrected. The residuals of the fits constitute a corrected set of spectra where photospheric effects have been removed, ideally leaving only variations due to the differences in stellar chemistry and the noise we inserted. We refer to the fit residuals as the unmodified spectra henceforth, since they receive no further processing.

2.2.4 Projecting the spectra

The spectra are high-dimensional, spanning over 7000 pixels, and many of these pixels are dominated by continuum flux and observational noise instead of chemical information. To emphasize the contribution of pixels with the greatest variation between spectra, we create a new chemical space by projecting the spectra into fewer dimensions before implementing a cluster finding algorithm. We do this using PCA (Jolliffe 2002, Ivezić et al. 2014). PCA decomposes the spectra into a linear combination of principal components (PCs) by solving for the eigenvectors of the covariance matrix of the data.

We choose to use 30 PCs to reduce the dimensionality of the spectra. Fig. 1 provides the rationale for this choice. A full PCA reduction will solve for PCs up to the minimum of the number of data dimensions and the number of observables – with that many PCs, the data are perfectly reproducible. We computed the full suite of PCs for a sample of 30 000 stars, as well as their associated explained variance ratio (EVR), given by

\begin{eqnarray*} \mathrm{EVR}(i) = \frac{\sigma ^2_{\rm model}(i)}{\sigma ^2_{\rm data}} \end{eqnarray*}

(3)

where |$\sigma ^2_{\rm model}(i)$| is the variance in the model for the data constructed of the i’th PC and |$\sigma ^2_{\rm data}$| is the variance in the original data. The PCs are ordered by the amount of variance they account for, with low i PCs accounting for most of the variance, and so this quantity decreases with increasing i.

Figure 1.

Top panel: cumulative EVR as a function of the number of PCs used to model a sample of 30 000 spectra. Bottom panel: gradient of the cumulative EVR as a function of the number of PCs. Beyond 30 PCs, adding additional PCs explains noise rather than variance in the spectra, as can be see by the abrupt levelling off of the gradient at that point.

The cumulative distribution of the EVR and its gradient are shown in the top and bottom panels of Fig. 1, respectively. At the maximum possible number of PCs, the cumulative EVR goes to 1, indicating that, as expected, all variance can be explained with this many PCs. However, it is not necessary to use the full possible suite of PCs to get a sufficiently accurate model of the data, especially since we construct the data to include both noise that we have no desire to model and continuum pixels with no chemical information. In Fig. 1, there is a plateau in both the cumulative EVR and its gradient at 30 PCs. Beyond this point, significant improvements in the variance explained require an order of magnitude increase in the number of PCs. Additional PCs beyond the first 30 are responsible for adding variations to the model that will only emulate noise or continuum variations in our synthetic spectra. Therefore, we choose to use 30 PCs to reduce the spectra into chemistry-relevant directions, reducing the influence of the noise and continuum emission pixels in this new chemical space.

We refer to the case where spectra are projected along their first 30 PCs as principal components. This projection is a data-driven approach to the challenge of reducing the importance of noise when clustering stars according to their spectra.

3 IDENTIFYING CLUSTERS

The goal of this study is to identify a clustering algorithm that can find groups of stars in high-dimensional chemical space without needing information about how many groups are expected. This makes DBSCAN (Ester et al. 1996) a logical choice of clustering algorithm, as it relies only on the density of stars in chemical space to locate groups, without requiring any prior information on the number of groups present.

In general, we describe a star s as having a position in chemical space x_s, where x_s may be an list of abundances (for an abundance-based chemical space) or a list of fluxes (for spectrum-based chemical spaces). Stars are indexed from 1 to N_survey, and the function used to compute pairwise distances between any two stars is denoted with |$\mathcal {D}$|⁠. While any distance metric can be chosen, we use a Euclidean metric. We summarize these and other relevant symbols in Table 2 in the order in which they first appear in this paper.

Table 2.

Symbols used in our methods and their meaning, presented in the order in which they first appear in this paper.

Symbol	Definition	Section
		and equation
V	Volume of physical space surveyed	Section 2.1
N_survey	Number of stars in the chemical space	Section 2.1
α	Index of the cluster mass function	Section 2.1 – (2)
\|$\boldsymbol {x}_\mathrm{ s}$\|	Position of star s in chemical space	Section 3
\|$\mathcal {D}$\|	Function that computes pairwise distances	Section 3
	between stars in chemical space
ϵ	Radius of a star’s ‘ϵ-neighbourhood’ in	Section 3.1
	chemical space
N_pts	Minimum number of stars needed in a	Section 3.1
	star’s ϵ-neighbourhood for it to be
	considered a ‘core star’
\|$\tilde{\epsilon }$\|	The normalized ϵ factor that	Section 3.1 – (4)
	converts the median pairwise distance
	between stars into ϵ
g_i	The i’th group identified by DBSCAN	Section 3.2
o_j	The j’th original cluster in the initial data set	Section 3.2
H_i	Homogeneity of g_i when compared to the	Section 3.2 – (5)
	original cluster that contributed the majority
	of its members
C_i	Completeness of g_i when compared to the	Section 3.2– (6)
	original cluster that contributed the majority
	of its members
RF	Fraction of clusters recovered with threshold	Section 3.2.1 – (7)
	for minimum homogeneity and completeness
S_i	Silhouette coefficient for group or cluster i	Section 3.3 – (10)

Symbol	Definition	Section
		and equation
V	Volume of physical space surveyed	Section 2.1
N_survey	Number of stars in the chemical space	Section 2.1
α	Index of the cluster mass function	Section 2.1 – (2)
\|$\boldsymbol {x}_\mathrm{ s}$\|	Position of star s in chemical space	Section 3
\|$\mathcal {D}$\|	Function that computes pairwise distances	Section 3
	between stars in chemical space
ϵ	Radius of a star’s ‘ϵ-neighbourhood’ in	Section 3.1
	chemical space
N_pts	Minimum number of stars needed in a	Section 3.1
	star’s ϵ-neighbourhood for it to be
	considered a ‘core star’
\|$\tilde{\epsilon }$\|	The normalized ϵ factor that	Section 3.1 – (4)
	converts the median pairwise distance
	between stars into ϵ
g_i	The i’th group identified by DBSCAN	Section 3.2
o_j	The j’th original cluster in the initial data set	Section 3.2
H_i	Homogeneity of g_i when compared to the	Section 3.2 – (5)
	original cluster that contributed the majority
	of its members
C_i	Completeness of g_i when compared to the	Section 3.2– (6)
	original cluster that contributed the majority
	of its members
RF	Fraction of clusters recovered with threshold	Section 3.2.1 – (7)
	for minimum homogeneity and completeness
S_i	Silhouette coefficient for group or cluster i	Section 3.3 – (10)

Table 2.

Symbols used in our methods and their meaning, presented in the order in which they first appear in this paper.

Symbol	Definition	Section
		and equation
V	Volume of physical space surveyed	Section 2.1
N_survey	Number of stars in the chemical space	Section 2.1
α	Index of the cluster mass function	Section 2.1 – (2)
\|$\boldsymbol {x}_\mathrm{ s}$\|	Position of star s in chemical space	Section 3
\|$\mathcal {D}$\|	Function that computes pairwise distances	Section 3
	between stars in chemical space
ϵ	Radius of a star’s ‘ϵ-neighbourhood’ in	Section 3.1
	chemical space
N_pts	Minimum number of stars needed in a	Section 3.1
	star’s ϵ-neighbourhood for it to be
	considered a ‘core star’
\|$\tilde{\epsilon }$\|	The normalized ϵ factor that	Section 3.1 – (4)
	converts the median pairwise distance
	between stars into ϵ
g_i	The i’th group identified by DBSCAN	Section 3.2
o_j	The j’th original cluster in the initial data set	Section 3.2
H_i	Homogeneity of g_i when compared to the	Section 3.2 – (5)
	original cluster that contributed the majority
	of its members
C_i	Completeness of g_i when compared to the	Section 3.2– (6)
	original cluster that contributed the majority
	of its members
RF	Fraction of clusters recovered with threshold	Section 3.2.1 – (7)
	for minimum homogeneity and completeness
S_i	Silhouette coefficient for group or cluster i	Section 3.3 – (10)

Symbol	Definition	Section
		and equation
V	Volume of physical space surveyed	Section 2.1
N_survey	Number of stars in the chemical space	Section 2.1
α	Index of the cluster mass function	Section 2.1 – (2)
\|$\boldsymbol {x}_\mathrm{ s}$\|	Position of star s in chemical space	Section 3
\|$\mathcal {D}$\|	Function that computes pairwise distances	Section 3
	between stars in chemical space
ϵ	Radius of a star’s ‘ϵ-neighbourhood’ in	Section 3.1
	chemical space
N_pts	Minimum number of stars needed in a	Section 3.1
	star’s ϵ-neighbourhood for it to be
	considered a ‘core star’
\|$\tilde{\epsilon }$\|	The normalized ϵ factor that	Section 3.1 – (4)
	converts the median pairwise distance
	between stars into ϵ
g_i	The i’th group identified by DBSCAN	Section 3.2
o_j	The j’th original cluster in the initial data set	Section 3.2
H_i	Homogeneity of g_i when compared to the	Section 3.2 – (5)
	original cluster that contributed the majority
	of its members
C_i	Completeness of g_i when compared to the	Section 3.2– (6)
	original cluster that contributed the majority
	of its members
RF	Fraction of clusters recovered with threshold	Section 3.2.1 – (7)
	for minimum homogeneity and completeness
S_i	Silhouette coefficient for group or cluster i	Section 3.3 – (10)

To differentiate them from our synthetic clusters, we henceforth refer to the stars classified as belonging together by DBSCAN as ‘groups’, and save the designation ‘clusters’ to refer to the known membership of the stars.

3.1 DBSCAN

DBSCAN classifies stars as ‘core stars’, ‘border stars’, and ‘noise stars’, where the former two types are members of groups but the latter is a label for outlying stars. Each star’s type depends on the input choice of ϵ and N_pts, the quantities that parametrize DBSCAN. The region of chemical space that is within distance ϵ of a star is that star’s ‘ϵ-neighbourhood’. If a star has N_pts stars within its ϵ-neighbourhood, it is considered a core star. Any point that is within a core star’s ϵ-neighbourhood but is not itself a core star, is labelled a border star. Any star that is not in the ϵ-neighbourhood of any core star is a noise star.

The DBSCAN algorithm begins by assuming all stars are noise stars. The algorithm steps through each star, and for a given choice of ϵ and N_pts, checks whether it is a core star. Once the core stars are identified, the algorithm organizes the stars into groups, by considering each star sequentially. If the star is a core star, and is in the ϵ-neighbourhood of another core star with a group label, that star is given the same group label. If it is not, the core star is given a new group label. If the star is not a core star, the algorithm determines whether it is a border star (in which case it is assigned to the group its neighbouring core point belongs to) or a noise star. The method is summarized below as algorithm 1.

Algorithm 1:

Summary of DBSCAN as applied to stellar chemical space.

Note that because DBSCAN considers the points in the order they are given, changing the order of the data set can lead to different group assignments. For core points, only the label of the cluster will change, while membership remains consistent. However border points may be assigned to different clusters depending on data order.

In this work, we use the DBSCAN algorithm implemented in the scikit-learn python package (Pedregosa et al. 2011). We also define ϵ in terms of what we call the normalized ϵ, |$\tilde{\epsilon }$|⁠:

\begin{eqnarray*} \epsilon = \tilde{\epsilon } \cdot \mathrm{median}\left(\overset{N_{\rm survey}}{\underset{s,t}{\mathcal {D}}}(\boldsymbol {x}_\mathrm{ s},\boldsymbol {x}_\mathrm{ t})\right). \end{eqnarray*}

(4)

where |$\tilde{\epsilon }$| is a factor between 0 and 1 that scales the median of all pairwise distances between stars in chemical space.

3.2 External cluster validation

When clusters are known, we assess DBSCAN’s performance with homogeneity and completeness scores for each group g_i it finds when compared to each original cluster o_j. Homogeneity for group g_i is defined as

\begin{eqnarray*} H_\mathrm{ i}^j = \frac{\mathrm{number\, of\, stars\, in\, }g_\mathrm{ i}\mathrm{\, from\, }o_\mathrm{ j}}{\mathrm{number\, of\, stars\, in\, }g_\mathrm{ i}} \end{eqnarray*}

(5)

If group g_i contains only members of a single original cluster o_j, then |$H_\mathrm{ i}^j$| gives a perfect homogeneity score of 1.

This score alone is not enough; if DBSCAN splits large clusters into small groups, each might have a corresponding o_j that gives a perfect homogeneity score, but the groups would poorly represent the input clusters. In addition to homogeneity scores, we also compute completeness scores for each group, defined as

\begin{eqnarray*} C_\mathrm{ i}^j = \frac{\mathrm{number\, of\, stars\, in\, }g_\mathrm{ i}\mathrm{\, from\, }o_\mathrm{ j}}{\mathrm{number\, of\, stars \, in\, }o_\mathrm{ j}} \end{eqnarray*}

(6)

If all members of o_j are assigned to the same g_i, then |$C_\mathrm{ i}^j$| gives a perfect completeness score of 1.

For each group g_i, we assign a single homogeneity H_i and completeness C_i score by choosing the compared original cluster to be the one that contributes the most stars to g_i. We call this original cluster the ‘matched cluster’. This method permits the same original cluster to be matched to multiple groups, and has the effect of ensuring the homogeneity score for each group takes on a maximal value. However, as the number of groups matched with the same original cluster increases, the completeness score for each group will decrease.

3.2.1 Recovery fraction

Unlike the metrics discussed above, the recovery fraction for a given iteration of DBSCAN is not computed on a per cluster basis, but is a way to assess the overall performance of the algorithm across all input data for a given parametrization. While homogeneity and completeness provide good overall estimates of the ability of DBSCAN to reproduce the initial clustering, we need a combined score to ensure that individual clusters are identified properly. To compute this combined score, we first randomly choose a set of the original clusters that meet a size threshold O = {o_j}. We typically choose 10 clusters that meet a size threshold of 15 members. These numbers approximate the number and membership of open clusters we might use to validate clustering in observed data (e.g. Frinchaboy et al. 2013). We then find the set of DBSCAN identified groups that these clusters were matched to, {g_i}.

We are only interested in the fraction of clusters recovered successfully. To assess the quality of recovery, we place lower limits on the homogeneity and completeness scores for the groups. A cluster is considered to be ‘recovered’ if the group it is matched to exceeds our homogeneity and completeness thresholds, and so we form a subset of {g_i} called R that consists only of ‘recovered’ clusters. Therefore, our recovery fraction is

\begin{eqnarray*} \mathrm{RF} = \frac{\mathrm{number\, of\, groups\, in\, }R}{\mathrm{number\, of\, clusters\, in\, }O}, \end{eqnarray*}

(7)

and in this work we typically set our homogeneity and completeness thresholds to be 0.7. To get robust values for RF, we make many random selections for O, find the matching R, and compute the corresponding RF, taking the median of these as our overall recovery fraction.

Increasing the homogeneity and completeness thresholds results in only considering groups with very high fidelity to the original cluster as true recovery, and decreases RF. By making appropriate threshold sizes, we can test how well clusters of a given size might be found in observed data.

3.3 Internal cluster validation

In addition to these external methods, we need a validation criterion that can evaluate the performance of DBSCAN even when true clusters are unknown, because this is the situation we face when using real data. Criteria of this kind often rely on comparing the distances between groups members to the distances between groups. One well-known choice for this is the Silhouette coefficient (Rousseeuw 1987). This metric compares the pairwise distances between group members (internal distance) to the pairwise distances between group and nearby non-group members (external distance).

The stars in our sample have positions |$\lbrace \boldsymbol {x}_\mathrm{ s}\rbrace$| in spectral or chemical space. We compute the internal distance as the typical distance between members of the same group. For the i’th group g_i, this internal distance is

\begin{eqnarray*} d_{\rm int}^i = \mathrm{median}\left(\underset{s,t\in g_i,s\ne t}{\mathcal {D}}\left({\boldsymbol x}_\mathrm{ s},{\boldsymbol x}_\mathrm{ t}\right)\right) \end{eqnarray*}

(8)

where |$\mathcal {D}$| is the operator that calculates the distance between each pair of points and s and t index the members of g_i.

To calculate the overall external distance for group g_i, we find the set of distances between the members of g_i and members of nearby group g_j, repeating for other nearby g_j. For our computation, of |$d_{\mathrm{ext}}^i$|⁠, we use as the g_j’s the 20 groups with median positions closest to the median position of g_i. The median of these distances is the external distance.

\begin{eqnarray*} d_{\rm ext}^{i} = \mathrm{median}\left(\underset{s\in g_\mathrm{ i}, t\in g_\mathrm{ j}}{\mathcal {D}}\left({\boldsymbol x}_\mathrm{ s},{\boldsymbol x}_\mathrm{ t}\right)\right)_{\mathrm{nearby}\, g_\mathrm{ j}} \end{eqnarray*}

(9)

We define the Silhouette coefficient for the i’th group as

\begin{eqnarray*} S_\mathrm{ i} = \frac{d^{i}_{\rm ext} - d^{i}_{\rm int}}{\max \left(d^{i}_{\rm ext},d^{i}_{\rm int}\right)}. \end{eqnarray*}

(10)

For a very distinct group, |$d_{\rm int}^{i}$| will be much less than |$d_{\rm ext}^{i}$|⁠, and the Silhouette coefficient reaches a maximum value of 1. We show the typical distribution of Silhouette coefficients for our input data in Fig. 2. As group size increases, so too does the chance that groups will overlap, leading to a decrease in Silhouette coefficient with group size. In the following section, we investigate the success with which DBSCAN can identify these groups.

Figure 2.

The distribution of Silhouette coefficients S_i (equation 10) for our input clusters in PC space as a function of the number of stars in the input clusters. Three open clusters identified in by OCCAM (Donor et al. 2018) are marked with coloured symbols for comparison. At low sizes, S_i can take on higher values, since it is easier for a small cluster with normally distributed members to be compact relative to its background (note that all single-membered clusters have S_i of exactly one). The open clusters are typically more distinct from their surroundings than our simulated clusters of their size.

Note that although we have defined the Silhouette coefficient for a group, it can just as easily be computed for clusters in the initial dataset.

In Table 2, we summarize the symbols and quantities defined in the preceding sections. In the following section, we describe the results of applying DBSCAN to our synthetic data, using our metrics to assess the success of the algorithm.

4 RESULTS

Throughout this section, we explore a variety of chemical spaces (described in Section 2.2), which we list below for reference.

Spectra:
- Unmodified – synthetic spectra after the subtraction of a fit to each pixel in T_eff, log g and [Fe/H].
- Principal components – spectra projected on to the first 30 PCs derived from PCA.
Abundances (from Table 1):
- Conservative – from Holtzman et al. (2015).
- Optimistic – from Leung & Bovy (2019).
- Theoretical – from Ting et al. (2016).

We begin by explaining our parameter choices to create an APOGEE-like survey, followed by a description of our parameter choice for DBSCAN. We summarize the results of our assessment statistics before describing the recovery fraction in more detail.

4.1 Creating an APOGEE-like simulation

We have outlined our general approach to creating synthetic data in Section 2.2, and we describe here the specific choices we make to generate APOGEE-like synthetic data. The APOGEE survey releases all data publicly, and so we mimic the survey’s derived abundances and stellar spectra. Note however that the intent of our simulation is not to exactly reproduce the APOGEE survey, but to use the same abundances and spectral window.

Our simplified approach does not account for the APOGEE selection function (see Section 5.1 for more discussion of this), and we assume that stars observed by the survey are drawn from birth clusters that form in an annulus that contains the Sun and is centred on the Galactic Centre. This choice of volume allows our study to determine the capabilities of chemical tagging in the Milky Way disc. The annulus from which stars are drawn has a total height of 1 kpc, and we vary its width to observe the effects on clustering success, but use a 6 kpc width as a fiducial choice. For each choice of annulus width, we observe a sample of 5 × 10⁴ stars, or roughly 3 × 10⁴M_⊙. Our simplifying assumptions on Milky Way density imply that there is 1.5 × 10¹⁰M_⊙ of stellar mass in our fiducial survey volume, so in that case our sampling rate is 2 × 10⁻⁶.

Using this sampling rate and assuming the CMF (cumulative distribution function) has a power-law index of α = 2.1 (see equation 2), we apply the process described in Section 2.1. We determine the number of members observed from each cluster in the volume until we have enough clusters to reach our target number of stars. For our fiducial sample of 5 × 10⁴ stars, this results in roughly 1.5 × 10⁴ clusters.

4.2 Choosing DBSCAN parameters

As described in Section 3.1, DBSCAN relies on two parameters to define a high-density region: the radius of the region ϵ, and the required number of points within the region N_pts. Different choices of these parameters result in different groupings of stars. We choose ϵ as the product of the median pairwise distance between stars and the normalized ϵ (⁠|$\tilde{\epsilon }$| – equation 4). We vary |$\tilde{\epsilon }$| and N_pts and examine typical clustering success. For all of the data types listed at the beginning of this section, we find that evaluation metrics change in continuous ways with both parameters.

An example of how the number of clusters found by DBSCAN and the recovery fraction vary with |$\tilde{\epsilon }$| (and thus vary with ϵ), as well as with N_pts is shown in Fig. 3. There is obviously a preferred value for |$\tilde{\epsilon }$| that maximizes RF, typically between 0.1–0.2. In addition, the choice of |$\tilde{\epsilon }$| that maximizes RF is valid for a range of choices in N_pts. The drop off in both the number of groups found and the recovery fraction with increasing N_pts begins where N_pts is approximately the 95th percentile of cluster sizes. This is quite sensible: when the vast majority of input clusters are smaller than the chosen N_pts, DBSCAN will be unable to identify them and this will consequently lower the fraction of clusters that can be recovered.

$Example contour plots showing the number of groups found (top panel) and the recovery fraction (bottom panel – see Section 3.2.1), as a function of the normalized ϵ ($\tilde{\epsilon }$) and the number of points in a neighbourhood (Npts) for our fiducial mock data sample. Note that by definition, Npts is only allowed integer values, so values between points are interpolated with Delaunay triangulation (Okabe, Boots & Sugihara 1992). While a narrow region in $\tilde{\epsilon }$ is preferred to maximize recovery fraction, a range of Npts are permitted. However, a lower Npts maximizes the recovery fraction. This example is the result of using DBSCAN on theoretical abundance space, which has a low level of chemical space spread within clusters.$

Figure 3.

Example contour plots showing the number of groups found (top panel) and the recovery fraction (bottom panel – see Section 3.2.1), as a function of the normalized ϵ (⁠|$\tilde{\epsilon }$|⁠) and the number of points in a neighbourhood (N_pts) for our fiducial mock data sample. Note that by definition, N_pts is only allowed integer values, so values between points are interpolated with Delaunay triangulation (Okabe, Boots & Sugihara 1992). While a narrow region in |$\tilde{\epsilon }$| is preferred to maximize recovery fraction, a range of N_pts are permitted. However, a lower N_pts maximizes the recovery fraction. This example is the result of using DBSCAN on theoretical abundance space, which has a low level of chemical space spread within clusters.

Fig. 3 shows results for our theoretical abundance space, but we see similar results for all the chemical spaces we consider. In particular, we find that decreasing N_pts increases the total number of groups found, which is expected given the nature of DBSCAN. However, it is also at these lower choices for N_pts where recovery fraction is maximized. In light of this, we keep our range in |$\tilde{\epsilon }$| for subsequent runs, but fix N_pts at 3.

4.3 Summary statistics

In Section 3.2, we defined homogeneity and completeness as ways to validate the groups found by DBSCAN. To summarize the performance of DBSCAN across several parameter choices, we compute the quartiles of those quantities across all groups with more than 15 members identified by DBSCAN. Figs 4 (the chemical spaces created with spectra) and 5 (the chemical spaces created with abundances) show the results as a function of normalized ϵ. Distances in all chemical spaces are computed with a Euclidean metric, and we do not apply normalization to any dimension. All dimensions in any given chemical space are measured in the same units, so variation along each dimension is of roughly the same order. However, our decision not to normalize the span of each dimension preserves the fact that some elements may be more informative than others when distinguishing between clusters.

$Top panel: the number of groups with more than 15 members found for the different spectra-based chemical spaces when using different values of $\tilde{\epsilon }$ in DBSCAN, where $\tilde{\epsilon }$ is the DBSCAN parameter that defines the neighbourhood of each star. Middle panel: the median homogeneity (equation 5) of groups in the top panel. Bottom panel: the median completeness (equation 6) of groups in the top panel. There is a particular value of $\tilde{\epsilon }$ that maximizes the number of clusters recovered, which corresponds to high median homogeneity and completeness.$

Figure 4.

Top panel: the number of groups with more than 15 members found for the different spectra-based chemical spaces when using different values of |$\tilde{\epsilon }$| in DBSCAN, where |$\tilde{\epsilon }$| is the DBSCAN parameter that defines the neighbourhood of each star. Middle panel: the median homogeneity (equation 5) of groups in the top panel. Bottom panel: the median completeness (equation 6) of groups in the top panel. There is a particular value of |$\tilde{\epsilon }$| that maximizes the number of clusters recovered, which corresponds to high median homogeneity and completeness.

$Like Fig. 4, but for chemical spaces based in abundances. As in Fig. 4, there is a preferred choice of $\tilde{\epsilon }$ that maximizes the number of clusters recovered, and this $\tilde{\epsilon }$ corresponds to high median homogeneity and completeness. Also as in Fig. 4, the median homogeneity and completeness are roughly inversely proportional.$

Figure 5.

Like Fig. 4, but for chemical spaces based in abundances. As in Fig. 4, there is a preferred choice of |$\tilde{\epsilon }$| that maximizes the number of clusters recovered, and this |$\tilde{\epsilon }$| corresponds to high median homogeneity and completeness. Also as in Fig. 4, the median homogeneity and completeness are roughly inversely proportional.

The top panel in each figure shows the total number of groups with more than 15 members found by DBSCAN for a given choice of |$\tilde{\epsilon }$| as defined in equation (4). The middle and bottom panels show the median homogeneity and completeness, respectively, while error bars mark the positions of first and third quartiles. Points for a particular data type are missing when that choice of |$\tilde{\epsilon }$| did not return more than five groups.

We see that there is a preferred choice of |$\tilde{\epsilon }$| for each type of data considered, with a clear relation between |$\tilde{\epsilon }$| and the number of clusters recovered. Each data type also demonstrates the inverse relation between the median homogeneity and completeness scores; as |$\tilde{\epsilon }$| increases, homogeneity decreases and completeness increases. This is because expanding the region considered when determining core points by increasing |$\tilde{\epsilon }$| makes it more likely that a lower density cluster will be correctly identified (improving completeness), while simultaneously allowing more non-member interlopers (reducing homogeneity).

4.4 Maximizing recovery fraction

We compute a recovery fraction for each choice of |$\tilde{\epsilon }$|⁠, shown in Fig. 6. We randomly select 10 clusters with more than 15 members from the input data, as this is the order of the number of known open clusters we might expect in a real data set. We compute the recovery fraction (equation 7), imposing homogeneity and completeness thresholds of 0.7. We repeat the random selection 100 times and use the first and third quartiles in the resulting distribution of recovery fractions to produce an estimate of the scatter around the median resulting from using only 10 clusters. Comparing the left-hand panel with Figs 4 and 5 shows that while median homogeneity and completeness scores may be high for a given value of |$\tilde{\epsilon }$|⁠, this does not necessarily imply a high recovery fraction. However, the peak in the recovery fraction with |$\tilde{\epsilon }$| does correspond to the peak in the total number of clusters recovered, as shown in the right-hand panel of Fig. 6.

$Recovery fraction (equation 7) as a function of normalized ϵ ($\tilde{\epsilon }$ – left-hand panel) and the number of groups identified by DBSCAN (right-hand panel). We show only results for chemical spaces that achieved non-zero recovery fraction: the PC projection of the spectra and all three of the abundance chemical spaces. As with the number of clusters recovered in Figs 4 and 5, there is a preferred value of $\tilde{\epsilon }$ that maximizes the recovery fraction. The right-hand panel demonstrates the clear relationship between the number of clusters recovered and the recovery fraction, and so applications to real data need only choose the $\tilde{\epsilon }$ that returns the most clusters.$

Figure 6.

Recovery fraction (equation 7) as a function of normalized ϵ (⁠|$\tilde{\epsilon }$| – left-hand panel) and the number of groups identified by DBSCAN (right-hand panel). We show only results for chemical spaces that achieved non-zero recovery fraction: the PC projection of the spectra and all three of the abundance chemical spaces. As with the number of clusters recovered in Figs 4 and 5, there is a preferred value of |$\tilde{\epsilon }$| that maximizes the recovery fraction. The right-hand panel demonstrates the clear relationship between the number of clusters recovered and the recovery fraction, and so applications to real data need only choose the |$\tilde{\epsilon }$| that returns the most clusters.

To confirm that our statistics are not sensitive to the specifics of a particular simulation, we create dozens of simulated cluster samples with the same properties and run DBSCAN on each data type with the parameter choice that maximizes the recovery fraction for that type. The median homogeneity and completeness of the resulting groups are shown in Fig. 7 and are relatively consistent across the 61 runs used. The different data types are split in the right-hand plot of median completeness, clearly demonstrating that using DBSCAN on the abundance space with the largest intracluster spread has, as expected, the poorest performance. However, choosing the parameter set that maximizes recovery fraction for a given data type consistently provides excellent homogeneity and completeness scores.

$Median homogeneity and completeness of groups found by DBSCAN with fixed Npts = 3 and the best $\tilde{\epsilon }$ choice (maximized recovery fraction) for each type of data listed at the top of Section 4. Error bars mark the locations of the first and third quartiles. The same parameters were used for 61 runs of DBSCAN for different mock stellar samples, and we find consistent results for these statistics over all runs.$

Figure 7.

Median homogeneity and completeness of groups found by DBSCAN with fixed N_pts = 3 and the best |$\tilde{\epsilon }$| choice (maximized recovery fraction) for each type of data listed at the top of Section 4. Error bars mark the locations of the first and third quartiles. The same parameters were used for 61 runs of DBSCAN for different mock stellar samples, and we find consistent results for these statistics over all runs.

4.4.1 Recovery fraction for observed clusters

In observations, it will not be possible to compute a recovery fraction given that true cluster membership is unknown. However, it may be possible to compute a representative recovery fraction using known open clusters. Open clusters have the chemical properties we assume for our synthetic clusters, and so their recovery fraction should be related to the recovery fraction in the sample as a whole. However, creating a sample of known open clusters with good abundances is challenging. The Open Cluster Chemical Abundances and Mapping survey (OCCAM – Frinchaboy et al. 2013), has recently released updated membership lists in Donor et al. (2018) for APOGEE stars in open clusters. We apply quality cuts to ensure that all stars under consideration have good quality measurements for the 15 chemical abundances we generate for our simulated stars (Section 2.2).

We also make use of the APOGEE spectra for the open cluster members. Observed spectra suffer from contamination of the underlying signal due to factors like atmospheric absorption lines, which are flagged in the APOGEE bitmask (Holtzman et al. 2015). Preparing these open cluster spectra for our algorithm is a useful way to prototype how our algorithm will work on larger observed samples. DBSCAN requires continuous data, with measurements in every dimension (pixel) for each star, but bitmask flagging indicates that some APOGEE pixels cannot be trusted. To apply DBSCAN to real spectra, it is necessary to pre-compute the distances between spectra, for which there are three reasonable options. (1) Compute distances between spectra by comparing only pixels that were unmasked in both of the pair of spectra, then normalize those distances by the total number of pixels considered. (2) Compute distances on a subspace by limiting the number of pixels to ones where all spectra had good observations, reducing the total number of pixels but hopefully retaining sufficient chemical signal to distinguish clusters. (3) Compute the distances after interpolating over the missing data. As the third approach is fast and the amount of missing data fairly minimal for these open clusters after our quality cuts, we use it for our test. We perform a linear spline interpolation in T_eff at each pixel across all stars to replace flux values flagged in the bitmask. However, option (3) is not reasonable for all samples or in all wavelength bands, and we anticipate that option (1) will be most useful when tagging larger samples of real observations.

After cuts to open cluster members with both good abundances and spectra, we are left with only three open clusters of sufficient size to test. The largest is NGC 6819, with 17 members in the OCCAM catalogue that survive our quality cuts. NGC 2682 and NGC 7789 each have 11 members, with the remaining clusters in the OCCAM catalogue having less than 10 members each. We insert the stars from the three largest open clusters into our set of stars from synthetic clusters.

Of the three open clusters, only NGC 6819 is recovered, and only then if the homogeneity threshold is lowered to 0.5, which implies that the group matched to NGC 6819 based on majority membership is nearly half full of members of other clusters. Further investigation reveals that while only one other cluster tends to contribute to the group that is matched to NGC 6819, it is still a major source of contamination.

The technique of extrapolating overall recovery fraction from the recovery fraction of open clusters holds promise for characterizing the success of DBSCAN in a way that is independent of knowledge of the true cluster membership. However, we need large homogeneous chemical surveys of open clusters to make this a robust statistic. While open clusters have been targeted extensively by spectroscopic surveys (e.g. Gaia-ESO – Gilmore et al. 2012) they should remain a focus in surveys that aim for significant Milky Way coverage on a single instrument.

4.5 Comparison of chemical spaces

Each of the chemical spaces we consider yields high homogeneity and completeness scores for the choices of |$\tilde{\epsilon }$| that maximize the number of clusters recovered (Fig. 7). However, it is in abundance based chemical space that we achieve the highest recovery fraction; tagging the optimistic and theoretical abundances spaces recovers more than 30 per cent of the input clusters with high homogeneity and completeness (Fig. 6).

While abundances may be too noisy with our conservative choice for intracluster chemical spreads to achieve a high recovery fraction, we expect better performance from the spectra. Spreads in abundances may be inflated by the post-processing required to derive abundances from the spectra, depending on the method used (e.g. Holtzman et al. 2015; Nissen & Gustafsson 2018), while the spectra are closer to the original observation, with less sources of noise to propagate. However, one challenge in constructing a chemical space with high-resolution spectra is the high dimensionality of that space. When distances are computed between spectra, there are many pixels that do not inform about chemistry and only contribute noise or continuum information. Projecting spectra along their PCs reduces the amount of non-chemical information in the data, which gives a non-zero recovery fraction of about 0.1.

To assess the impact of the noise we add to the spectra, we gradually lower the spectra noise and track improvements in recovery fraction. The recovery fraction reaches a maximal value of about 0.5 even when noise in the spectra is assumed to be zero. This is a consequence of the fact that spectra of stars are generated based on different T_eff and log g for each star. Our approach to removing the influence of these properties on the spectrum (Section 2.2.3) does not totally eliminate their effects, and we are left with non-chemical information in the spectra that prevents improvement of the recovery fraction. As expected, generating spectra for every star using the same T_eff and log g does allow the recovery fraction to climb to 1 for spectra-based chemical spaces, as in that case all variation between stars is due to chemistry. To obtain better recovery fractions using spectra directly, better methods to compute the distance between spectra will be necessary (e.g. Reis, Baron & Shahaf 2019).

Distortion of abundances due to differences in T_eff and log g is not explicitly accounted for in our abundance space simulations. However, these properties affect the absorption lines from which chemical abundances are derived, and so their bulk influence will be represented in the uncertainty in abundances we use to define the allowed chemical space spread in members of the same cluster. As the spread within a cluster decreases, the recovery fraction increases, mirroring the change in recovery fraction we observe when we reduce the contribution of noise pixels in the spectral space. As expected, the abundance space with the lowest chemical space spread within members of a cluster is the space that yields the highest recovery fraction. This further emphasizes the importance of improving the accuracy of and reducing abundance measurement uncertainty.

4.6 Properties of successfully identified clusters

We expect clusters that are successfully recovered by DBSCAN to share some characteristics: perhaps the largest clusters are found more reliably, or clusters more distinct from the centre of the chemical space distribution are easier for DBSCAN to group.

To test the former hypothesis, we colour code a plot of Silhouette coefficient versus group size and matched cluster size according to the completeness of the group (see Fig. 8). We show both the Silhouette coefficient of the found group and of the cluster that was matched to it. We find a population of low completeness groups, marked in yellow, that have high found Silhouette coefficient (top right panel), but low matched Silhouette coefficient (bottom right panel). This indicates that DBSCAN can pick out the cores of large clusters, finding them as tightly bound groups but missing their outlying members. As a consequence of choosing each member’s chemistry from a normal distribution in high-dimensional space, there are many more outlying members than one might expect based on intuition about a 2D normal distribution, and so the completeness is driven down dramatically.

$The Silhouette coefficient Si (equation 10) as a function of the number of stars in the group (left-hand column) and the number of stars in the cluster that the group was matched to (right-hand column). The top row shows the Si computed for the groups found by DBSCAN, while the bottom row shows the Si for the true clusters that were matched to DBSCAN groups, where each Si computed relative to the full set of input clusters. At large matched size, there are both very high and very low completeness groups. The high completeness set is the set of large clusters dense enough that all members were placed in the same DBSCAN group. The low completeness set is a set of large clusters for which only their dense cores were correctly identified as groups by DBSCAN. This result is for a DBSCAN run on spectra projected into PCs, with $\tilde{\epsilon }=0.12$ and Npts = 3, limited to groups identified to have more than five members.$

Figure 8.

The Silhouette coefficient S_i (equation 10) as a function of the number of stars in the group (left-hand column) and the number of stars in the cluster that the group was matched to (right-hand column). The top row shows the S_i computed for the groups found by DBSCAN, while the bottom row shows the S_i for the true clusters that were matched to DBSCAN groups, where each S_i computed relative to the full set of input clusters. At large matched size, there are both very high and very low completeness groups. The high completeness set is the set of large clusters dense enough that all members were placed in the same DBSCAN group. The low completeness set is a set of large clusters for which only their dense cores were correctly identified as groups by DBSCAN. This result is for a DBSCAN run on spectra projected into PCs, with |$\tilde{\epsilon }=0.12$| and N_pts = 3, limited to groups identified to have more than five members.

To better understand why some clusters are fully identified while others are limited to their cores, we consider the location of the clusters in chemical space. We show an example of 2D chemical space in Fig. 9, where we plot the abundance α-element Mg relative to Fe against [Fe/H] for each cluster centre. The underlying distribution shown in a 2D histogram is the typical pattern for abundances of α-elements in the Galaxy (e.g. Hayden et al. 2015). With orange circles, we plot the locations of clusters that are matched to groups in DBSCAN with high (>0.7) homogeneity and completeness. Blue squares mark the location of of open clusters from the OCCAM survey discussed in Section 4.4.1.

$One 2D projection of the 15D abundance chemical space, showing the abundances of Mg versus Fe. The distribution of true cluster centres is shown as the background histogram. Cluster centres matched to groups in DBSCAN (PCs, $\tilde{\epsilon }=0.12$, Npts = 3) with homogeneity and completeness greater than 0.7 are marked with orange circles. Blue squares mark the median abundances of cluster members of OCCAM open clusters. Note that the distribution of the matches does not trace the underlying distribution; there is an overabundance of matches at high [Mg/Fe]. We quantify this observation in Fig. 10.$

Figure 9.

One 2D projection of the 15D abundance chemical space, showing the abundances of Mg versus Fe. The distribution of true cluster centres is shown as the background histogram. Cluster centres matched to groups in DBSCAN (PCs, |$\tilde{\epsilon }=0.12$|⁠, N_pts = 3) with homogeneity and completeness greater than 0.7 are marked with orange circles. Blue squares mark the median abundances of cluster members of OCCAM open clusters. Note that the distribution of the matches does not trace the underlying distribution; there is an overabundance of matches at high [Mg/Fe]. We quantify this observation in Fig. 10.

The density of recovered clusters centres in Fig. 9 does not trace the density of all cluster centres. Compared to the background, clusters are preferentially recovered at higher [Mg/Fe], where the underlying density of cluster centres is lower. We quantify this assessment by using a kernel density estimator to model the density of all cluster centres (ρ_all) and the density of recovered cluster centres (ρ_recovered) in [Mg/Fe] versus [Fe/H] chemical space. The ratio of the two, ρ_recovered/ρ_all, is shown in Fig. 10 as a function of ρ_all for different chemical spaces. For all chemical spaces for which the recovery fraction was non-zero, we find that ρ_recovered/ρ_all is largest at small ρ_all, indicating that DBSCAN has most success recovering clusters separated from their fellows in chemical space. While separations may seem small in our 2D projection, their distance from the bulk of cluster centres is more significant when all 15 abundance dimensions are considered.

Figure 10.

The ratio of the density of group centres recovered by DBSCAN (ρ_recovered) to the density of all cluster centres (ρ_all), as a function of the density of all cluster centres. Density is computed in the 2D chemical space of [Mg/Fe] versus [Fe/H]. DBSCAN recovers more clusters where the total density of cluster centres was low for all types of chemical spaces. Thus cluster finding is more efficient in low-density regions of abundance space.

4.7 Modifying the simulated clusters

As per the description in Section 2.2, we have three ways to modify the simulated clusters on which we run DBSCAN. We can change the following properties:

the power-law index of the cluster mass function, α
the volume of the annulus, V
the number of stars observed, N_survey

Parameters (i) and (ii) determine the total number of clusters included in the simulation. Specifying parameters (ii) and (iii) sets the sampling fraction, which governs how many member stars are observed from a given cluster. To explore how changes in the total number of clusters observed and the sampling fraction change the output of DBSCAN, we simulate a range of choices. Since our fiducial choices were α = 2.1 and |$V=300\, \mathrm{kpc}^3$|⁠, we vary α between 0 and 2.6, and V between 30 and 1000 kpc³.

As expected, the best choice for |$\tilde{\epsilon }$| (the one that maximizes recovery fraction) remains consistent within each data type across all combinations of V and α. Our preliminary tests suggest that median completeness will decrease with decreasing α, while median homogeneity should increase. As the CMF becomes flatter, there are more large clusters, and it is easier for them to dominate groups. But by the same token, it becomes more difficult to ensure all cluster members are in the same group when there are more of them per cluster. Median completeness should also correlate positively with survey volume; as the volume increases for a fixed N_survey, fewer stars are sampled per cluster, making it less likely to sample an outlying star that would not be grouped with its fellows.

Changes to the sampling rate might occur as part of differing survey plans choosing different numbers of stars to observe or accessing different portions of the Galaxy. However, our changes in volume also illustrate the impact of radial mixing on chemical tagging success. In our simulations, we assume that all members of each cluster remain inside the annulus where they are born, with a chance to be observed as they pass through the survey volume. In reality, clusters disperse after they form (Lada & Lada 2003), and their members migrate away from their birth radius and height in the Galactic disc (Sellwood & Binney 2002). This migration impacts the chemical structure of the Galaxy (e.g. Schönrich & Binney 2009; Hayden et al. 2015) but will also impact our ability to recover birth clusters. When radial migration is significant, it moves stars into and out of the surveyed annulus. Moving stars out reduces the number available with which to detect a cluster, and moving stars into the region contributes to the background of stars who are the sole representative of their cluster.

As radial migration grows stronger, a survey with fixed volume actually observes stars drawn from a larger effective volume, thanks to interlopers born outside that have migrated into the survey region. Thus, our experimentation with changing volume emulates the effect various strengths of radial migration. Our largest possible volume of 1000 kpc³ is similar to that of the entire Milky Way’s disc. Even in this extreme case, we recover clusters with high homogeneity and completeness using our fiducial sample size.

5 DISCUSSION

The results we have presented in the previous section show that a density-based clustering technique can successfully recover input clusters when applied to a realistic dataset. We discuss in more detail some of our assumptions to create that data set in the following section. In spite of these simplifying assumptions on the underlying data, we have explored a range of possible constraints on cluster homogeneity without making any requirements on cluster uniqueness. Since we choose random APOGEE stars to provide the central abundances for our clusters, no part of our process requires that the cluster centres be more distinct from each other than current observations allow, and yet we enjoy remarkable success in recovering those clusters. Clusters are easiest to recover if they are in an area of chemical space with lower density, but can still be found in the presence of significant background. Our experiments varying survey volume in Section 4.7 indicate that sufficiently large clusters are recovered well even when the sampling rate is decreased, which has the effect of increasing the background of single star clusters. This highlights one of DBSCAN’s major advantages; the inclusion of the noise point classification makes it easy to discard any single-member clusters.

Our work with DBSCAN does have some limitations; we have worked primarily with a sample of only 50 000 stars. This is about a quarter of the size of APOGEE, and only a 20th of the final GALAH survey. The 50 000 stars are presently treated democratically by DBSCAN, and while this is a viable approach when working with simulations, real data might necessitate some favouritism. Adding weighting to favour more trustworthy data would require modifying the process by which chemical space distances are computed. DBSCAN also does best when finding clusters of consistent density, and so multiple runs with different choices of |$\tilde{\epsilon }$| may be needed to identify potential hierarchical structure.

Despite these potential challenges, our results have clearly demonstrated the remarkable power of DBSCAN to tease out structure that seems completely blended in 2D projection (e.g. Fig. 9). That the method does not require foreknowledge of the number of clusters in the sample is a great advantage over slightly faster techniques like k-means. A single DBSCAN run operating on our 50 000 star sample takes less than 30 s when pairwise distances between stars are pre-computed, a process which only takes an additional minute. In addition, DBSCAN can identify non-spherical structure in chemical space – while this was not much exploited in our work with normally distributed clusters, it may prove an invaluable asset in real data.

We outline below the primary assumptions of our simulation, with descriptions of how they might be relaxed in future work, followed by a discussion of the challenges to be overcome when applying DBSCAN to observations.

5.1 Assumptions used to generate synthetic data

The work presented here relies on several assumptions, which we describe below.

5.1.1 Clusters are perfectly homogeneous

The nature of chemical tagging is such that it assumes the presence of homogeneous clusters. We have taken that to an extreme limit here, assuming in our analysis that all differences between the chemistry of stars in the same birth clusters are due to measurement uncertainty. Especially in the case of the spectra we have failed to account for differences that might arise from variations within the cloud in which the stars formed, such as pollution from massive stars that might go supernova before star formation is complete. This may be relevant in large star-forming clouds (Bland-Hawthorn et al. 2010). It would be useful to separate the noise into observational and intrinsic to examine how changes in either influence the resulting groups found by DBSCAN. However, DBSCAN recovers clusters well as long as they are at least as intrinsically homogeneous as our optimistic case for measurement uncertainties, and observed open clusters do seem to be that homogeneous (Bovy 2016).

5.1.2 Clusters have similar densities

DBSCAN finds clusters according to a density defined by our choice of ϵ and N_pts. This means it naturally finds clusters with density greater than N_pts/ϵ^d, where d is the dimensionality of the chemical space. However, our method of cluster generation through normally distributing the chemical properties of member stars for each clusters means that larger clusters tend to appear less dense with respect to their neighbours (see Fig. 2). One way to circumvent this would be a hierarchical approach, which would explore several possibilities for ϵ and use the resulting groups to merge together smaller groups (e.g. the hierarchical DBSCAN algorithm, Campello et al. 2015; McInnes & Healy 2017).

5.1.3 Simplified mass in survey volume

In our setup of the synthetic clusters described in Section 2, we make a few simplifying assumptions about the survey volume. We first assume a constant density for our Galaxy, which could be improved by incorporating a realistic density profile in the region of the Sun. We also assume that any members of clusters born in the survey annulus remain in the survey annulus. Future work could include the effects of radial migration moving cluster members into and out of the survey volume, which would have the effect of reducing the overall sampling rate by introducing more clusters in the same volume. Our preliminary experimentation in Section 4.7 indicates that this should overall increase completeness of recovered clusters, at the expense of lowering the homogeneity.

We make no assumptions about what stars we could reasonably observe from a given cluster, using only the sampling rate to limit the number of observed members. A truly realistic simulation would incorporate the selection function of the survey it attempts to emulate, which requires setting not just chemistry for each star, but also stellar masses, ages, and distances from the observer. We are able to make our simulations relatively straightforward by ignoring observational preference for closer or brighter stars, but true surveys are not as random as our simulation.

5.2 Applications to current and upcoming surveys

In this work, we use the APOGEE survey as a template for the structure of our chemical spaces. More recent iterations of the ASPCAP now provide more than 20 abundances from APOGEE spectra (Holtzman et al. 2018), and surveys like GALAH measure even more abundances for each star. Incorporating additional abundances will give DBSCAN greater leverage to distinguish clusters; when we reduce the dimensionality of our abundance space to 5, overall homogeneity and completeness also declines. As long as additional abundances are not correlated with those already measured, the new information they provide will help distinguish clusters that might otherwise overlap.

However, switching from a survey like APOGEE to one like GALAH with an increased number of abundances will also impact the results of working with spectra. One potential difference in surveys not touched on in our exploratory analysis in Section 4.7 is the result of changing the wavelength range in which we observe our stars. Different spectroscopic bands will be sensitive to different chemical species, offering alternative ways to distinguish stars. However, different bands also imply different targeting plans. APOGEE’s focus on red giants is part of what made it a useful survey to emulate; while stars in the same birth cluster may have the same intrinsic abundances, it is the observed abundances that are important for chemical tagging. Stars in the same evolutionary phase are less likely to have surface abundance differences due to processes like atomic diffusion (Dotter et al. 2017), and so will lead to more successful chemical tagging. Even within a particular evolutionary phase, the contributions of some elements may be untrustworthy due to differences expected with different stellar mass (e.g. C and N; Masseron & Gilmore 2015). Therefore, future surveys will need to balance the parallel goals of measuring chemistry for a sufficiently large sample of stars and ensuring those stars are close enough in evolutionary phase that they can be reasonably compared.

6 CONCLUSIONS

In this work, we have tested prospects for chemical tagging of spectroscopic observations with an efficient clustering algorithm, DBSCAN. We create synthetic chemical spaces that emulate observations, providing each star with 15 chemical abundances and a continuum-normalized spectrum, ensuring that stars in the same cluster have the same chemical compositions, with normally distributed noise introduced to mimic observational uncertainties of varying magnitudes. We take the median abundances for each cluster as those of random red giant stars observed in APOGEE, so while we assume some varied levels of effective cluster homogeneity (depending on the amount of noise introduced in each star), we impose no constraint on the uniqueness of cluster chemical signatures. We define two categories chemical spaces; those based on spectra (with more than 7000 dimensions, each corresponding to a wavelength) and those based on the 15 derived abundances. We show that the DBSCAN algorithm can effectively recover input groups from these high-dimensional chemical spaces when they are populated by tens of thousands of stars. The density-based approach of this algorithm and its ‘noise star’ classification allows it to ignore the chemical space background of stars which are the sole representative of their cluster in the survey. We use our multiple chemical spaces to compare the efficacy of using spectra with using derived abundances for clustering.

While spectra offer more detailed chemical information than derived abundances, this information is diluted by the many continuum pixels that are not significant for chemically distinguishing stars. When using a Euclidean distance metric, these chemically insignificant pixels can overwhelm the differences between spectra introduced by variations in their intrinsic abundances. We reduce the influence of the continuum pixels by using PCA to identify the most significant pixels for distinguishing stars. By projecting spectra along their PCs, we achieve higher homogeneity in the found groups and recover 10 per cent of the larger clusters from our input sample with homogeneity and completeness of at least 70 per cent. However, using abundance space with our ‘optimistic’ requirements for effective cluster homogeneity (Leung & Bovy 2019) increases the percentage of clusters recovered to 30 per cent with homogeneity and completeness of at least 70 per cent, indicating that decreasing our observational uncertainties in derived abundances may offer the most reliable approach to evaluating whether chemical tagging is viable in the Milky Way.

These results are somewhat sensitive to the parameters of our simulated survey, and we perform tests of the success of chemical tagging with changing our cluster mass function, survey volume, and total number of stars observed. These suggest that chemical tagging should recover clusters at sampling rates even lower than that used for our fiducial sample, although this is sensitive to the underlying cluster mass function. Furthermore, our experimentation with changing simulation parameters suggests that significant radial migration is not likely to prevent the success of chemical tagging. Even when considering a sample where stars are drawn from a volume equivalent to that of the Milky Way’s entire disc, DBSCAN was still able to recover clusters with high homogeneity and completeness.

Changing the survey volume modifies the sampling rate, allowing us to make predictions for different spectroscopic surveys. At our current simulation size limit, the sampling fraction is only 2 × 10⁻⁶, making the detection of clusters with fewer than ∼500 000 members unlikely. While we detect clusters smaller than this in our simulations, we do so only because we add a degree of randomness to the sampling rate for each cluster. Using the full APOGEE survey would increase that sampling fraction to 8 × 10⁻⁶ and thus reduce the minimum required members for detection to ∼125 000. Making use of the million spectra in the final GALAH set would bring that minimum requirement down to ∼25 000. Future surveys like the Milky Way Mapper (MWM) of SDSS V (Kollmeier et al. 2017) will increase the number of spectroscopically surveyed stars to over 4 million. While this survey is designed to have lower S/N than its predecessor, APOGEE, the MWM could lower the minimum number of members required to detect a cluster to 6250. Other planned surveys like the Mauna Kea Spectroscopic Explorer (Zhang et al. 2016) and the 4-metre Multi-Object Spectroscopic Telescope (de Jong et al. 2016) will further increase the available sample of spectroscopically observed stars. This increased sampling rate will lead to more clusters sampled with more than 10 members. While only 535 clusters were sampled at that level in our fiducial simulation, our choice of cluster mass function indicates that we would expect to observe at least 10 members for more than 1700 clusters in APOGEE and nearly 14 000 in GALAH, given our fiducial model for the Milky Way. If the MWM reduces abundances uncertainties from its target (close to our ‘conservative’ case) to our ‘optimistic’ case, its greater spatial coverage means it might sample 54 000 clusters with more than 10 members. Our predicted recovery fraction (Fig. 6) implies we might expect DBSCAN to find a tenth of these clusters as having more than 10 members who are 70 per cent homogeneous when using a PC reduction of the raw spectra. With sufficiently precise and accurate derived abundances (e.g. Leung & Bovy 2019), the percentage likely to be found with DBSCAN with this membership homogeneity rises above 30 per cent. The increase in chemical space resolution that comes of decreasing the uncertainty in a given star’s abundances or spectrum is a crucial component of improving our ability to chemically tag birth clusters. This has been found in previous work with abundances (e.g. Ting et al. 2015a), but the importance of decreasing chemical space uncertainties cannot be overstated.

The true recovery fraction for observations from future surveys should be higher than the fraction we recover in our sample of 50 000 stars. As increasing the number of stars drives up the sampling rate, the overall homogeneity of the found groups should also increase. This increase will ensure that more clusters meet the homogeneity threshold they need to be considered ‘recovered’. Thus, these larger sample sizes and reduced chemical space uncertainties are essential for the birth cluster population studies that would allow us to learn the details of the star formation and chemical evolution history of our Galaxy.

ACKNOWLEDGEMENTS

We thank the anonymous reviewer for their careful reading of this work and for their thoughtful comments and suggestions.

NPJ is supported by an Alexander Graham Bell Canada Graduate Scholarship-Doctoral from the Natural Sciences and Engineering Research Council of Canada. NPJ and JB received support from the Natural Sciences and Engineering Research Council of Canada (NSERC; funding reference number RGPIN-2015-05235) and from an Ontario Early Researcher Award (ER16-12-061).

Funding for Sloan Digital Sky Survey (SDSS) III has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Science Foundation, and the U.S. Department of Energy Office of Science. The SDSS III web site is www.sdss3.org.

Funding for the SDSS IV has been provided by the Alfred P. Sloan Foundation, the U.S. Department of Energy Office of Science, and the Participating Institutions. SDSS-IV acknowledges support and resources from the Center for High-Performance Computing at the University of Utah. The SDSS web site is www.sdss.org.

REFERENCES

Abolfathi

B.

et al. .,

2018

,

ApJS

,

235

,

42

10.3847/1538-4365/aa9e8a

10.1088/0067-0049/219/1/12

Alam

S.

et al. .,

2015

,

ApJS

,

219

,

12

10.1051/0004-6361/201322631

Bensby

T.

,

Feltzing

S.

,

Oey

M. S.

,

2014

,

A&A

,

562

,

A71

10.1051/0004-6361/201425232

Blanco-Cuaresma

S.

et al. .,

2015

,

A&A

,

577

,

A47

10.1051/0004-6361/201832815

Blanco-Cuaresma

S.

,

Fraix-Burnet

D.

,

2018

,

A&A

,

618

,

A65

Blanco-Cuaresma

S.

,

Soubiran

C.

,

2016

, in

Reylé

C.

,

Richard

J.

,

Cambrésy

L.

,

Deleuil

M.

,

Pécontal

E.

,

Tresse

L.

,

Vauglin

I.

, eds,

SF2A-2016: Proceedings of the Annual Meeting of the French Society of Astronomy and Astrophysics

.

Societe Francaise d'Astronomie et d'Astrophysique

,

Lyon, France

, p.

333

10.1088/0004-637X/713/1/166

Bland-Hawthorn

J.

,

Krumholz

M. R.

,

Freeman

K.

,

2010

,

ApJ

,

713

,

166

Blanton

M. R.

et al. .,

2017

,

AJ

,

154

,

28

10.3847/1538-3881/aa7567

10.3847/0004-637X/817/1/49

Bovy

J.

,

2016

,

ApJ

,

817

,

49

Bovy

J.

,

2017

,

MNRAS

,

470

,

1360

10.1093/mnras/stx1277

Campello

R. J. G. B.

,

Moulavi

D.

,

Zimek

A.

,

Sander

J.

,

2015

,

ACM Trans. Knowl. Discov. Data

,

10

,

5:1

Casey

A. R.

,

Hogg

D. W.

,

Ness

M.

,

Rix

H.-W.

,

Ho

A. Q.

,

Gilmore

G.

,

2016

,

preprint (

arXiv:1603.03040)

Chen

B.

,

D’Onghia

E.

,

Pardy

S. A.

,

Pasquali

A.

,

Bertelli Motta

C.

,

Hanlon

B.

,

Grebel

E. K.

,

2018

,

ApJ

,

860

,

70

de Jong

R. S.

et al. .,

2016

,

Proc. SPIE

,

9908

,

99081O

De Silva

G. M.

et al. .,

2015

,

MNRAS

,

449

,

2604

10.1093/mnras/stv327

De Silva

G. M.

,

Sneden

C.

,

Paulson

D. B.

,

Asplund

M.

,

Bland-Hawthorn

J.

,

Bessell

M. S.

,

Freeman

K. C.

,

2006

,

AJ

,

131

,

455

10.1086/497968

De Silva

G. M.

,

Freeman

K. C.

,

Asplund

M.

,

Bland-Hawthorn

J.

,

Bessell

M. S.

,

Collet

R.

,

2007

,

AJ

,

133

,

1161

10.1086/511182

Dempster

A. P.

,

Laird

N. M.

,

Rubin

D. B.

,

1977

,

J. R. Stat. Soc.

,

39

,

1

Donor

J.

et al. .,

2018

,

AJ

,

156

,

142

10.3847/1538-3881/aad635

Dotter

A.

,

Conroy

C.

,

Cargile

P.

,

Asplund

M.

,

2017

,

ApJ

,

840

,

99

10.3847/1538-4357/aa6d10

10.1088/0004-6256/142/3/72

Eisenstein

D. J.

et al. .,

2011

,

AJ

,

142

,

72

Ester

M.

,

Kriegel

H.-P.

,

Sander

J.

,

Xu

X.

,

1996

, in

Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD’96)

.

AAAI Press

,

Portland, OR, USA

, p.

226

10.1146/annurev.astro.40.060401.093840

Feng

Y.

,

Krumholz

M. R.

,

2014

,

Nature

,

513

,

523

Freeman

K.

,

Bland-Hawthorn

J.

,

2002

,

ARA&A

,

40

,

487

10.1088/2041-8205/777/1/L1

Frinchaboy

P. M.

et al. .,

2013

,

ApJ

,

777

,

L1

10.1051/0004-6361/201629272

Gaia Collaboration

et al. .,

2016

,

A&A

,

595

,

A1

10.3847/0004-6256/151/6/144

García Pérez

A. E.

et al. .,

2016a

,

AJ

,

151

,

144

10.3847/0004-6256/151/6/144

García Pérez

A. E.

et al. .,

2016b

,

AJ

,

151

,

144

Gilmore

G.

et al. .,

2012

,

The Messenger

,

147

,

25

Hasselquist

S.

et al. .,

2017

,

ApJ

,

845

,

162

10.3847/1538-4357/aa7ddc

Hawkins

K.

,

Jofré

P.

,

Masseron

T.

,

Gilmore

G.

,

2015

,

MNRAS

,

453

,

758

10.1093/mnras/stv1586

10.1088/0004-637X/808/2/132

Hayden

M. R.

et al. .,

2015

,

ApJ

,

808

,

132

Hayes

C. R.

et al. .,

2018

,

ApJ

,

852

,

49

10.3847/1538-4357/aa9cec

10.3847/1538-4357/833/2/262

Hogg

D. W.

et al. .,

2016

,

ApJ

,

833

,

262

10.1088/0004-6256/150/5/148

Holtzman

J. A.

et al. .,

2015

,

AJ

,

150

,

148

Holtzman

J. A.

et al. .,

2018

,

AJ

,

156

,

125

10.3847/1538-3881/aad4f9

Ivezić

Ž.

,

Connelly

A. J.

,

VanderPlas

J. T.

,

Gray

A.

,

2014

,

Statistics, Data Mining, and Machine Learning in Astronomy

.

Princeton University Press

,

Princeton, NJ

Jofré

P.

,

Das

P.

,

Bertranpetit

J.

,

Foley

R.

,

2017

,

MNRAS

,

467

,

1140

10.1093/mnras/stx075

Jofré

P.

,

Heiter

U.

,

Soubiran

C.

,

2018

,

preprint (

arXiv:1811.08041)

Jolliffe

I. T.

,

2002

,

Principal Component Analysis

.

Springer Series in Statistics, Springer

,

New York

Kaderali

S.

,

Hunt

J. A. S.

,

Webb

J. J.

,

Price-Jones

N.

,

Carlberg

R.

,

2019

,

MNRAS

,

484

,

L114

10.1093/mnrasl/slz015

Kollmeier

J. A.

et al. .,

2017

,

preprint (

arXiv:1711.03234)

Kraft

R. P.

,

1994

,

PASP

,

106

,

553

10.1086/133416

10.1046/j.1365-8711.2001.04022.x

Kroupa

P.

,

2001

,

MNRAS

,

322

,

231

10.1146/annurev.astro.41.011802.094844

Lada

C. J.

,

Lada

E. A.

,

2003

,

ARA&A

,

41

,

57

Leung

H. W.

,

Bovy

J.

,

2019

,

MNRAS

,

483

,

3255

10.1093/mnras/sty3217

Liu

F.

,

Yong

D.

,

Asplund

M.

,

Ramírez

I.

,

Meléndez

J.

,

2016a

,

MNRAS

,

457

,

3934

10.1093/mnras/stw247

Liu

F.

,

Asplund

M.

,

Yong

D.

,

Meléndez

J.

,

Ramírez

I.

,

Karakas

A. I.

,

Carlos

M.

,

Marino

A. F.

,

2016b

,

MNRAS

,

463

,

696

10.1093/mnras/stw2045

Majewski

S. R.

et al. .,

2017

,

AJ

,

154

,

94

10.3847/1538-3881/aa784d

10.1051/0004-6361/201014135

Martell

S. L.

,

Grebel

E. K.

,

2010

,

A&A

,

519

,

A14

Masseron

T.

,

Gilmore

G.

,

2015

,

MNRAS

,

453

,

1855

10.1093/mnras/stv1731

McInnes

L.

,

Healy

J.

,

2017

,

preprint (

arXiv:1705.07321)

Mitschang

A. W.

,

De Silva

G.

,

Zucker

D. B.

,

Anguiano

B.

,

Bensby

T.

,

Feltzing

S.

,

2014

,

MNRAS

,

438

,

2753

10.1093/mnras/stt2320

Ness

M.

et al. .,

2018

,

ApJ

,

853

,

198

10.3847/1538-4357/aa9d8e

10.1088/0004-637X/808/1/16

Ness

M.

,

Hogg

D. W.

,

Rix

H.-W.

,

Ho

A. Y. Q.

,

Zasowski

G.

,

2015

,

ApJ

,

808

,

16

10.1007/s00159-018-0111-3

Nissen

P. E.

,

Gustafsson

B.

,

2018

,

A&AR

,

26

,

6

Okabe

A.

,

Boots

B. N.

,

Sugihara

K.

,

1992

,

Spatial Tessellations: Concepts and Applications of Voronoi Diagrams

.

Wiley

,

New York

Pedregosa

F.

et al. .,

2011

,

J. Mach. Learn. Res.

,

12

,

2825

Price-Jones

N.

,

Bovy

J.

,

2018

,

MNRAS

,

475

,

1410

10.1093/mnras/stx3198

10.1051/0004-6361/201630220

Recio-Blanco

A.

et al. .,

2017

,

A&A

,

602

,

L14

Reis

I.

,

Baron

D.

,

Shahaf

S.

,

2019

,

AJ

,

157

,

16

10.3847/1538-3881/aaf101

10.3847/2041-8205/826/2/L25

Rix

H.-W.

,

Ting

Y.-S.

,

Conroy

C.

,

Hogg

D. W.

,

2016

,

ApJ

,

826

,

L25

Rousseeuw

P. J.

,

1987

,

J. Comput. Appl. Math.

,

20

,

53

Roweis

S.

,

1997

,

EM Algorithms for PCA and SPCA, Proceeding of the 1997 Conference on Advances in Neural Information Processing Systems 10 (NIPS’97), Cambridge, MA, USA

.

MIT Press

,

Denver, CO, USA

, p.

626

Schiavon

R. P.

et al. .,

2017a

,

MNRAS

,

465

,

501

10.1093/mnras/stw2162

Schiavon

R. P.

et al. .,

2017b

,

MNRAS

,

466

,

1010

10.1093/mnras/stw3093

10.1111/j.1365-2966.2009.14750.x

Schönrich

R.

,

Binney

J.

,

2009

,

MNRAS

,

396

,

203

10.1046/j.1365-8711.2002.05806.x

Sellwood

J. A.

,

Binney

J. J.

,

2002

,

MNRAS

,

336

,

785

10.1051/0004-6361/201423937

Smiljanic

R.

et al. .,

2014

,

A&A

,

570

,

A122

Steinmetz

M.

et al. .,

2006

,

AJ

,

132

,

1645

10.1086/506564

Tang

B.

et al. .,

2017

,

MNRAS

,

465

,

19

10.1093/mnras/stw2739

10.1088/0004-637X/807/1/104

Ting

Y.-S.

,

Conroy

C.

,

Goodman

A.

,

2015a

,

ApJ

,

807

,

104

10.1088/0004-637X/807/1/104

Ting

Y.-S.

,

Conroy

C.

,

Goodman

A.

,

2015b

,

ApJ

,

807

,

104

10.3847/0004-637X/826/1/83

Ting

Y.-S.

,

Conroy

C.

,

Rix

H.-W.

,

2016

,

ApJ

,

826

,

83

10.3847/1538-4365/228/2/24

Traven

G.

et al. .,

2017

,

ApJS

,

228

,

24

Weiss

A.

,

Denissenkov

P. A.

,

Charbonnel

C.

,

2000

,

A&A

,

356

,

181

Wojno

J.

et al. .,

2016

,

MNRAS

,

461

,

4246

10.1093/mnras/stw1633

10.1088/1674-4527/16/9/140

Zhang

C.-X.

,

Yuan

Y.

,

Zhang

H.-W.

,

Shuai

Y.

,

Tan

H.-P.

,

2016

,

Res. Astron. Astrophys.

,

16

,

140

10.1088/1674-4527/12/7/002

Zhao

G.

,

Zhao

Y.-H.

,

Chu

Y.-Q.

,

Jing

Y.-P.

,

Deng

L.-C.

,

2012

,

Res. Astron. Astrophys.

,

12

,

723