Abstract

Improving race and ethnicity (hereafter, race/ethnicity) data quality is imperative to ensure underserved populations are represented in data sets used to identify health disparities and inform health care policy. We performed a scoping review of methods that retrospectively improve race/ethnicity classification in secondary data sets. Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines, searches were conducted in the MEDLINE, Embase, and Web of Science Core Collection databases in July 2022. A total of 2 441 abstracts were dually screened, 453 full-text articles were reviewed, and 120 articles were included. Study characteristics were extracted and described in a narrative analysis. Six main method types for improving race/ethnicity data were identified: expert review (n = 9; 8%), name lists (n = 27, 23%), name algorithms (n = 55, 46%), machine learning (n = 14, 12%), data linkage (n = 9, 8%), and other (n = 6, 5%). The main racial/ethnic groups targeted for classification were Asian (n = 56, 47%) and White (n = 51, 43%). Some form of validation evaluation was included in 86 articles (72%). We discuss the strengths and limitations of different method types and potential harms of identified methods. Innovative methods are needed to better identify racial/ethnic subgroups and further validation studies. Accurately collecting and reporting disaggregated data by race/ethnicity are critical to address the systematic missingness of relevant demographic data that can erroneously guide policymaking and hinder the effectiveness of health care practices and intervention.

Introduction

The use of big data sources, such as electronic health records (EHRs) and insurance claims data, are increasingly being used to monitor disease, identify health disparities, and guide health policy development in the United States.1,4 Of concern is the quality of race and ethnicity (hereafter, race/ethnicity) data collected and reported in these big data sources and secondary health care data sets, including national administrative surveys and disease surveillance systems.5

Poor data infrastructure for collecting and reporting race/ethnicity is a form of systemic racism that perpetuates inequitable access to key health resources. Two primary factors that drive low-quality race/ethnicity data are (1) data aggregation and (2) high levels of missing data. These 2 elements are illustrated through COVID-19 data collection and reporting throughout the pandemic, which underscore the crucial role of collecting relevant demographic data (e.g., race/ethnicity) in responding to population health needs as well as key gaps in public health data infrastructure.6,8

Race/ethnicity data aggregation conceals disparities among smaller race/ethnicity groups that may have little in common with one another.5 For example, during the first year of the pandemic, a review of COVID-19 academic and gray literature found there was limited to no COVID-19 outcome information for 3 of the 6 largest Asian American subgroups (i.e., Asian Indian, Korean, and Japanese Americans) because their data were primarily collected and reported under the aggregate racial category of Asian American.9 The limited availability of disaggregated race/ethnicity data applies not only to Asian Americans but to all racial/ethnic subgroups (e.g., Caribbean vs African Black, Latinx subgroups, Arab Americans).5,10

High levels of missing race/ethnicity data are especially problematic because previous research has shown racial/ethnic minority communities are more likely to be categorized as “missing” or “other” in EHRs and administrative data sets. Howland and Tsao11 found Asian, Pacific Islander, and Hispanic patients were often misclassified as “other” or “unknown” in an evaluation of New York State hospital discharge data. Similarly, Klinger et al.12 found that race/ethnicity for Black and Hispanic patients was frequently underreported in primary clinic EHR data. During the COVID-19 pandemic, race/ethnicity data were missing for 50% of COVID-19 cases and 22% of COVID-19 deaths at the national level in November 2020.13 As of September 2022, the proportions of missing race/ethnicity data were still high at 35% for COVID-19 cases and 15% for COVID-19 deaths13 and may be contributing to an underestimation of the COVID-19 burden among racial/ethnic minority groups.

Improving race/ethnicity data quality is imperative to ensure populations are appropriately represented in data sets used to identify and monitor health disparities and inform complex ethnical decisions about health care, policy, and resource allocation. A systematic review by Mateos14 determined that name-based methods to derive and improve race/ethnicity data were highly reliable and valid. New methods capitalizing on advances in computing (e.g., machine learning) and the availability of measures relevant to race/ethnicity identification (e.g., geocoded addresses) represent potential solutions. A similar review by Golder et al.15 recently looked exclusively at methods predicting Twitter (now called X) users’ race/ethnicity, finding minimal guidelines of when and how to best use these methods. The purpose of this scoping review is to provide an update on the available methods to predict the race/ethnicity of individuals and populations in secondary data sources. The results from this review can guide best practices for retrospective classification as well as new data collection efforts related to race/ethnicity.

Methods

We followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines for scoping review.16 Whereas a traditional systematic review draws from a narrow range of quality-assessed studies to answer a precise question, a scoping review is a broader systematic method that can leverage multiple types of study designs to assess the literature and gaps in research on a given topic or question.17 For this scoping review, we aimed to identify studies describing methods that improve classification of race/ethnicity in secondary data sets. Methods for improving race/ethnicity classification were defined as any way of analyzing previously collected data that provided additional race/ethnicity information for an individual or population.

In July 2022, a trained medical librarian (T.R.) performed searches for studies without language or date restrictions in the MEDLINE, Embase, and the Web of Science Core Collection databases (Ovid Medline search strategy available in Appendix S1). Articles identified from our search were uploaded to Covidence systematic review management software. Abstracts and full-text articles were dually screened by coauthors (M.K.C., L.D., E.H., S.P., L.F., K.Y.K.), with each article receiving 2 votes, and conflicts resolved through consensus. All geographies and races/ethnicities, including components of race/ethnicity (e.g., nationality, preferred language, religion), were included. Validation studies that evaluated methods described by other articles in the review were also included.

Additional studies were identified via citations found in the introduction and discussion sections of included studies in the review as well as via citations found in studies excluded from the review for having an insufficient methods description. Additional citations were also identified via a directory website of relevant studies mentioned in an included article and through suggested citations from an expert in the field. Figure 1 shows our full PRISMA search and review flowchart. See Appendix S2 for exclusion criteria.

Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow chart. A total of 2 441 abstracts were identified via database search, and 108 articles were determined to be eligible for extraction and analysis. An additional 133 articles identified via websites and citations were assessed for eligibility and, from these, 12 more articles were added. Thus, a total of 120 articles were included in the review.
Figure 1

Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow chart. A total of 2 441 abstracts were identified via database search, and 108 articles were determined to be eligible for extraction and analysis. An additional 133 articles identified via websites and citations were assessed for eligibility and, from these, 12 more articles were added. Thus, a total of 120 articles were included in the review.

Data analysis

Data were extracted by 6 reviewers. For each article, 2 different reviewers each entered relevant extraction information into a premade REDCap survey form with designated fields of interest. Discrepancies in extraction entries were reviewed and resolved through group discussion. Extraction information included type of method used for race/ethnicity classification; races/ethnicities targeted for improvement; publication year; method inputs; reference population data (if applicable); reference population size; target population data set; target population size; and whether the article included a validation process. If a validation component was included, the gold standard measurement of race/ethnicity was documented, as well as the metrics used to evaluate the method.

Race/ethnicity reporting, which included a category for Middle Eastern or North African (MENA) populations, was informed by the Institute of Medicine’s guide for race/ethnicity standardization as well as by policy recommendations for data collection and data disaggregation, which were funded by the Robert Wood Johnson Foundation and created in partnership with the Asian & Pacific Islander American Health Forum the Arab Community Center for Economic and Social Services, the National Congress of American Indians, the National Urban League, and Unidos US.18,19

Results

Our initial search identified 2 441 abstracts. After removing duplicates (n = 238) and irrelevant abstracts (n = 1883), 320 full-text articles were assessed for eligibility. We determined that 108 articles were eligible for extraction and analysis. An additional 133 articles identified via websites and citations were assessed for eligibility, and 12 articles from these were added. Thus, a total of 120 articles were included in our review.

Method types

Table 1 describes characteristics of included literature. Method types for improving race/ethnicity classification were organized into 6 main categories: expert review, name lists, name algorithms, machine learning, data linkage, and other. Four of these 6 categories—expert review, name lists, name algorithms, and machine learning—rely on evaluating an individual’s surname and, less frequently, first name, middle name, maiden name, or name morphology, to predict or assign race/ethnicity.

Table 1

Characteristics of included literature (n = 120).

CharacteristicNo. of articlesTotal articles, %
Method type
 Expert review98
 Name list2723
 Name algorithm5546
 Machine learning1412
 Data linkage98
 Other65
Race/ethnicity improveda
 American Indian/Alaska Native1412
 Asian5647
 Black4235
 Hispanic, Latino, or Spanish origin4941
 Middle Eastern or North African1412
 Native Hawaiian or Pacific Islander76
 White5143
 Multiracial98
 Other3529
Publication year range
 1970–197911
 1980–198933
 1990–19991613
 2000–20093731
 2010–20194538
 2020–20221815
Geographic context
 United States6554
 United Kingdom1613
 Canada1311
 Australia54
 Multiple countries108
 Other119
Reference population size
 0–9900
 100–99900
 1000–999922
 10 000–99 99987
 100 000–999 99987
 >1 000 000108
 Not applicable9277
Target population size
 0–9900
 100–999119
 1000–99991714
 10 000–99 9992017
 100 000–999 9992622
 >1 000 0003025
 Not applicable1613
Method validated
 Yes8672
 No3428
CharacteristicNo. of articlesTotal articles, %
Method type
 Expert review98
 Name list2723
 Name algorithm5546
 Machine learning1412
 Data linkage98
 Other65
Race/ethnicity improveda
 American Indian/Alaska Native1412
 Asian5647
 Black4235
 Hispanic, Latino, or Spanish origin4941
 Middle Eastern or North African1412
 Native Hawaiian or Pacific Islander76
 White5143
 Multiracial98
 Other3529
Publication year range
 1970–197911
 1980–198933
 1990–19991613
 2000–20093731
 2010–20194538
 2020–20221815
Geographic context
 United States6554
 United Kingdom1613
 Canada1311
 Australia54
 Multiple countries108
 Other119
Reference population size
 0–9900
 100–99900
 1000–999922
 10 000–99 99987
 100 000–999 99987
 >1 000 000108
 Not applicable9277
Target population size
 0–9900
 100–999119
 1000–99991714
 10 000–99 9992017
 100 000–999 9992622
 >1 000 0003025
 Not applicable1613
Method validated
 Yes8672
 No3428

a Not mutually exclusive.

Table 1

Characteristics of included literature (n = 120).

CharacteristicNo. of articlesTotal articles, %
Method type
 Expert review98
 Name list2723
 Name algorithm5546
 Machine learning1412
 Data linkage98
 Other65
Race/ethnicity improveda
 American Indian/Alaska Native1412
 Asian5647
 Black4235
 Hispanic, Latino, or Spanish origin4941
 Middle Eastern or North African1412
 Native Hawaiian or Pacific Islander76
 White5143
 Multiracial98
 Other3529
Publication year range
 1970–197911
 1980–198933
 1990–19991613
 2000–20093731
 2010–20194538
 2020–20221815
Geographic context
 United States6554
 United Kingdom1613
 Canada1311
 Australia54
 Multiple countries108
 Other119
Reference population size
 0–9900
 100–99900
 1000–999922
 10 000–99 99987
 100 000–999 99987
 >1 000 000108
 Not applicable9277
Target population size
 0–9900
 100–999119
 1000–99991714
 10 000–99 9992017
 100 000–999 9992622
 >1 000 0003025
 Not applicable1613
Method validated
 Yes8672
 No3428
CharacteristicNo. of articlesTotal articles, %
Method type
 Expert review98
 Name list2723
 Name algorithm5546
 Machine learning1412
 Data linkage98
 Other65
Race/ethnicity improveda
 American Indian/Alaska Native1412
 Asian5647
 Black4235
 Hispanic, Latino, or Spanish origin4941
 Middle Eastern or North African1412
 Native Hawaiian or Pacific Islander76
 White5143
 Multiracial98
 Other3529
Publication year range
 1970–197911
 1980–198933
 1990–19991613
 2000–20093731
 2010–20194538
 2020–20221815
Geographic context
 United States6554
 United Kingdom1613
 Canada1311
 Australia54
 Multiple countries108
 Other119
Reference population size
 0–9900
 100–99900
 1000–999922
 10 000–99 99987
 100 000–999 99987
 >1 000 000108
 Not applicable9277
Target population size
 0–9900
 100–999119
 1000–99991714
 10 000–99 9992017
 100 000–999 9992622
 >1 000 0003025
 Not applicable1613
Method validated
 Yes8672
 No3428

a Not mutually exclusive.

Expert review

Expert-review articles (n = 9, 8%) entail methods whereby a person or group of people with expertise regarding a given race/ethnicity review an individual’s name and determine whether they belong to that race/ethnicity group. Experts frequently self-identified with the given race/ethnicity,20,24 had scholarly expertise of the group, or had experience regularly engaging with the group.25,27

Name list

In name-list articles (n = 27, 23%), authors classified an individual’s race/ethnicity through the use of reference lists that directly translate an individual’s surname to a corresponding race/ethnicity category. In the majority of name-list articles, race/ethnicity was assigned as a 1-to-1 translation of a designated name. Lauderdale and Kestenbaum28 and Choi et al.29 provided different surname lists based on different acceptable sensitivity, specificity, or probability likelihood cutoffs. In several name-list articles, methods were described as name algorithms or programs but fit our review’s definition of a name list.30,32

Name algorithm

Name-algorithm articles (n = 55, 46%) improved race/ethnicity classification through a multistep sequence of instructions predetermined by the researcher, using not only surname information but other additional secondary data set variables as well. Inputs for name algorithms included first name,33,50 maiden name,43,51 age,36,39,49,52,53 sex,36,49,54 gender,43,53,55 place of birth,34,43,51,54,56,57 parental place of birth,56,58 place of residence,36,39,41,49,52,53,55,59,70 parental surname,56 grandparent ethnic identity,56 place of medical school graduation,71 language preference,39,40,49 and political party.72 Several studies incorporated multiple name elements such as name substrings, name morphology, first and last name character length, name consonant to vowel ratio, or surname phonemes.33,35,38,45,73,75 Other studies used the combination of name, residential address, and conditional probabilities of a race/ethnicity group being within a given geographic boundary to estimate race/ethnicity.39,41,52,53,59,66,68,70,76,77

Although most articles described ways to impute race/ethnicity at an individual level, in a few, authors attempted to describe race/ethnicity at the population level.39,53,55,59,62,64,68,70,76,79 For example, Grofman et al.77 described methods of using the ratio of the name “Garcia” to the name “Anderson” in a city’s telephone directory to estimate the percentage of Hispanic city residents. The Bayesian Improved Surname Geocode (BISG) method59 calculates the probability of an individual belonging to each of 6 race/ethnicity categories and then aggregates those probabilities to describe race/ethnicity distribution at a population level. More advanced evolutions of the BISG method use multinomial logistic regression models to allow for interactions between additional data elements such as age, sex, or insurance.36,39,49,80,81

Machine learning

Machine-learning articles (n = 14, 12%) reported on use of models that automatically analyzed patterns in data to infer race/ethnicity. Machine-learning methods were typically built on a training data set used to calibrate their model and then applied to a real-world data set for race/ethnicity prediction. A variety of machine-learning methods were encountered, including naïve Bayes classifiers,82,87 C-support vector machines,82 recurrent neural network,88 decision-tree learning,89 bidirectional long short term memory,90 and network clustering analysis.91,94 All machine-learning models used name information as the primary input variable. Machine-learning models also included name elements such as name character length, name phonemes, and name substrings.82,84,87,95 Two less common inputs included residential address82 and mother tongue.85

Data linkage

Data linkage articles (n = 9, 8%) connected individuals from 1 data set with missing race/ethnicity information to another data set with complete information. Data linkage could occur through matching variables such as name, age, gender, date of birth, social security numbers, address, and identification numbers specific to the given data system.96,98 In some articles, probabilistic linkage algorithms calculated match probabilities for individuals with specific cutoffs.99,101

Other

Other category articles (n = 6, 5%) included 3 articles about studies in which authors used only geocoding to assign race/ethnicity102,104; 2 articles in which multiple imputation was used, where logistic regression modeling is used to predict missing race/ethnicity data105,106; and 1 article reporting on a study in which researchers used mother tongue and birth country to classify race/ethnicity.107

Reference population

Reference population describes the data used to develop the article’s method for predicting race/ethnicity. Table S1 provides individual article data detail. For name-based methods (excluding expert review), the reference population was frequently preexisting administrative or secondary data sets with name and race/ethnicity information.28,29,60,108,112 In some cases, country of birth was used if race/ethnicity information was not available.28,44,58,113,114 Name-based methods were developed through sources such as ethnic telephone directories,108,115,117 published ethnic surname dictionaries,33,35,45 baby-name books,37 and lists provided by community organizations that engaged the targeted race/ethnicity group.37,110 Several articles used name data that were created by authors combining or supplementing previously developed name lists31,37,109,110,115,120; some name lists were cross-checked by expert reviewers with knowledge of the given race/ethnicity.27,109,110,115

There were several reference data sets consistently sourced by multiple studies in this review, including the Spanish Census Surname list, US Census, Social Security Administration data, and Medicare Beneficiary Records. The Spanish Census Surname list, which has been published every decade since 1970, was frequently used or adapted to identify Hispanic race/ethnicity.59,61,119,124 Similarly, a surname list developed by the US Census in 2000 and 2010 for 6 aggregate race/ethnicity groups (Asian, Black, White, American Indian/Alaska Native [AI/AN], Hispanic, multiracial) was commonly cited.36,40,43,46,49,52,55,59,62,64,66,70,76,80,81,125 Last, an Asian surname list for 6 Asian subgroups (Chinese, Asian Indian, Filipino, Vietnamese, Korean, Japanese) developed by Lauderdale and Kestenbaum28 referenced 1995 Social Security Administration and 1998 Medicare Master Beneficiary Records data.39,44,54,58,59,116

The majority of articles (n = 92, 77%) did not document the size of the reference data population. Ten articles (8%) reported on studies with reference population of more than 1 000 000; 8 studies (7%) had a reference population between 100 000 and 9 999 999; 8 studies (8%) had a reference population between 10 000 and 99 999; and 2 studies (2%) had reference populations between 1000 and 9999. The mean and median reference populations were 16 808 275 and 302 182, respectively, and a respective range of 1784 and 222 316 554. There were no studies with reference populations of fewer than 1000 people.

Target population

The target population included description of the data set to which the study’s method for race/ethnicity assignment was applied. The vast majority of target populations were related to public health and included cancer registries24,32,33,43,51,54,57,94,96,99,106,109,111,117,120,126,127; population health survey samples20,21,35,39,45,49,56,63,64,73,104,116,122,124,128,129; surveillance data26,47,68,74,93,105; vital records27,29,31,34,41,81,91,97,119,121,130,132; health care data36,40,42,52,53,55,58,60,61,71,102,108,112; and EHR data.23,37,42,44,65,66,80,98,100,101,115,118,133,134 In several studies, researchers applied methods to internet-based sources such as email databases, Wikipedia, or Twitter.38,50,84,89 Other methods were applied to targeted voter,69,72,90,125,135 economic,22,25,76,83 telephone directory,75,136 or census data.28,82,85,92,137

We reviewed 30 articles (25%) reporting on studies in which classification improvement methods were applied to data set populations of 1 000 000 people or more; 26 articles (22%) reported on populations between 100 000 and 9 999 999 people; 20 articles (17%) on populations between 10 000 and 99 999 people; 17 articles (18%) on populations between 1000 and 9999 people; and 11 articles (9%) reported on populations between 100 and 999 people. The median and mean target population sizes were 137 632 and 7 276 199, respectively, with a range of 109 to 222 316 554. Sixteen articles (14%) did not describe the target population size.

Race/ethnicity

Among aggregate race/ethnicity categories, the most frequently targeted grouping was Asian (n = 56, 47%), followed by White (n = 51, 43%); Hispanic, Latino, or Spanish origin (n = 49, 41%); Black (n = 42, 35%); Native Hawaiian or Pacific Islander (n = 7, 6%); AI/AN (n = 14, 12%); MENA (n = 14, 12%); multiracial (n = 9, 8%); and other (n = 35, 29%).

Publication year

Articles in the review were published between 1972 and 2020. The majority of articles were published between 2010 and 2019 (n = 45, 38%), followed by 2000–2009 (n = 37, 34%), 2020–2022 (n = 18, 15%), 1990–1999 (n = 16, 14%), 1980–1989 (n = 3, 3%), and 1970–1979 (n = 1; 1%).

Geographic context

Studies in the reviewed articles predominantly applied improved classification methods to secondary data sets from the United States (n = 65, 55%), United Kingdom (n = 16, 14%), Canada (n = 13, 11%), and Australia (n = 5, 4%). Other data sets were from Germany and the Netherlands (referenced in 2 studies each [1%]), and New Zealand, Peru, Kenya, Nepal, Brazil, Malaysia, and France, all referenced in 1 study (1%) each. Ten articles (8%) used data sets sourced from multiple countries.

Validation

Table 3 summarizes characteristics of studies with a validation process. A total of 86 articles (72%) reported some form of validation evaluation in the study method. Among studies that included a validation process, the majority used self-report race/ethnicity as their gold standard (n = 63, 73%). Other gold standard measures included country of birth (n = 9, 11%), race/ethnicity reported by a second party (n = 8, 9%), parent’s race/ethnicity (n = 4, 5%), parent’s country of birth (n = 4, 4%), preferred language (n = 2, 2%), nationality (n = 2, 2%), self-reported religion (n = 2, 2%), grandparent’s country of origin (n = 1, 1%), or panel review of surname (n = 1, 1%). Three articles (3%) reported use of databases with affiliated race/ethnicity or nationality but were not clear about how those values were assessed. Some articles used more than 1 gold standard measure for validation.

Table 2 delineates the 102 unique race/ethnicity categories targeted for classification improvement in the reviewed articles. In most studies on improving classification for Asian race/ethnicity categories, researchers attempted to classify Asian subgroups (n = 44 of 56 articles), whereas studies targeting other aggregate race/ethnicity categories frequently did not classify subgroups. Notably, the majority of unique, targeted, White subgroup categories were from 1 article.115

Table 2

Unique target race/ethnicity categories.a

Unique target race/ethnicity categoryNo. of articlesb
American Indian/Alaska Native
 American Indian/Alaska Native13
 Quechua1
Asian
 Chinese24
 South Asian17
 Japanese11
 Asian10
 Indian10
 Vietnamese9
 Korean8
 Filipino5
 East Asian4
 Pakistani4
 Bangladeshi3
 Southeast Asian3
 Indo-Caucasian1
 Malaysian1
 Nepali1
 Sri Lankan1
 Tibeto-Mongolian1
 West Asian1
Black
 Black30
 African6
 African American2
 Kalenjin2
 Kikuyu2
 Caribbean1
 Kisii1
 Luo1
Hispanic/Latino/Spanish origin
 Hispanic/Latino42
 Cuban1
 Mexican1
 Portuguese/Spanish1
 Puerto Rican1
 South and Central American1
Middle Eastern/North African
 Arab8
 Middle Eastern/North African5
 Israeli1
 Iranian1
Native Hawaiian/Pacific Islander
 Aboriginal3
 Indigenous1
 Maori1
 Moluccan1
 Pacific Islander/Native Hawaiian1
White
 White31
 Italian7
 Irish5
 German5
 Polish4
 Russian4
 European4
 Nordic3
 French3
 Celtic/English3
 British2
 Czech2
 Dutch2
 Estonian2
 Greek2
 Hungarian2
 Latvian2
 Lithuanian2
 Scottish2
 Slovak2
 Slovene2
 Swedish2
 Armenian1
 Bosnian/Macedonian1
 Bulgarian1
 Danish1
 Dutch/Flemish1
 Finnish1
 Iberian1
 Norwegian1
 Portuguese1
 Romanian/Moldavian1
 Romany1
 Serbian1
 Slavic1
 Swiss/Romansch/Tyrolean1
 Ukrainian1
 Welsh1
Other
 Asian/Pacific Islander19
 Other12
 Multiracial11
 Turkish3
 Jewish2
 Surinamese1
 Native vs foreign-bornc1
Languages
 Punjabi3
 Bengali2
 Gujarati2
 Hebrew2
 Hindi2
 Urdu2
 Ashkenazi/Yiddish1
 Spanish1
Religion
 Muslim7
 Sikh5
 Christian1
 Ismaili1
 Greek Orthodox1
Unique target race/ethnicity categoryNo. of articlesb
American Indian/Alaska Native
 American Indian/Alaska Native13
 Quechua1
Asian
 Chinese24
 South Asian17
 Japanese11
 Asian10
 Indian10
 Vietnamese9
 Korean8
 Filipino5
 East Asian4
 Pakistani4
 Bangladeshi3
 Southeast Asian3
 Indo-Caucasian1
 Malaysian1
 Nepali1
 Sri Lankan1
 Tibeto-Mongolian1
 West Asian1
Black
 Black30
 African6
 African American2
 Kalenjin2
 Kikuyu2
 Caribbean1
 Kisii1
 Luo1
Hispanic/Latino/Spanish origin
 Hispanic/Latino42
 Cuban1
 Mexican1
 Portuguese/Spanish1
 Puerto Rican1
 South and Central American1
Middle Eastern/North African
 Arab8
 Middle Eastern/North African5
 Israeli1
 Iranian1
Native Hawaiian/Pacific Islander
 Aboriginal3
 Indigenous1
 Maori1
 Moluccan1
 Pacific Islander/Native Hawaiian1
White
 White31
 Italian7
 Irish5
 German5
 Polish4
 Russian4
 European4
 Nordic3
 French3
 Celtic/English3
 British2
 Czech2
 Dutch2
 Estonian2
 Greek2
 Hungarian2
 Latvian2
 Lithuanian2
 Scottish2
 Slovak2
 Slovene2
 Swedish2
 Armenian1
 Bosnian/Macedonian1
 Bulgarian1
 Danish1
 Dutch/Flemish1
 Finnish1
 Iberian1
 Norwegian1
 Portuguese1
 Romanian/Moldavian1
 Romany1
 Serbian1
 Slavic1
 Swiss/Romansch/Tyrolean1
 Ukrainian1
 Welsh1
Other
 Asian/Pacific Islander19
 Other12
 Multiracial11
 Turkish3
 Jewish2
 Surinamese1
 Native vs foreign-bornc1
Languages
 Punjabi3
 Bengali2
 Gujarati2
 Hebrew2
 Hindi2
 Urdu2
 Ashkenazi/Yiddish1
 Spanish1
Religion
 Muslim7
 Sikh5
 Christian1
 Ismaili1
 Greek Orthodox1

a The 77 target nationality categories from Jun et al.88 not included.

b Not mutually exclusive.

c Native vs Foreign Born in the United States.

Table 2

Unique target race/ethnicity categories.a

Unique target race/ethnicity categoryNo. of articlesb
American Indian/Alaska Native
 American Indian/Alaska Native13
 Quechua1
Asian
 Chinese24
 South Asian17
 Japanese11
 Asian10
 Indian10
 Vietnamese9
 Korean8
 Filipino5
 East Asian4
 Pakistani4
 Bangladeshi3
 Southeast Asian3
 Indo-Caucasian1
 Malaysian1
 Nepali1
 Sri Lankan1
 Tibeto-Mongolian1
 West Asian1
Black
 Black30
 African6
 African American2
 Kalenjin2
 Kikuyu2
 Caribbean1
 Kisii1
 Luo1
Hispanic/Latino/Spanish origin
 Hispanic/Latino42
 Cuban1
 Mexican1
 Portuguese/Spanish1
 Puerto Rican1
 South and Central American1
Middle Eastern/North African
 Arab8
 Middle Eastern/North African5
 Israeli1
 Iranian1
Native Hawaiian/Pacific Islander
 Aboriginal3
 Indigenous1
 Maori1
 Moluccan1
 Pacific Islander/Native Hawaiian1
White
 White31
 Italian7
 Irish5
 German5
 Polish4
 Russian4
 European4
 Nordic3
 French3
 Celtic/English3
 British2
 Czech2
 Dutch2
 Estonian2
 Greek2
 Hungarian2
 Latvian2
 Lithuanian2
 Scottish2
 Slovak2
 Slovene2
 Swedish2
 Armenian1
 Bosnian/Macedonian1
 Bulgarian1
 Danish1
 Dutch/Flemish1
 Finnish1
 Iberian1
 Norwegian1
 Portuguese1
 Romanian/Moldavian1
 Romany1
 Serbian1
 Slavic1
 Swiss/Romansch/Tyrolean1
 Ukrainian1
 Welsh1
Other
 Asian/Pacific Islander19
 Other12
 Multiracial11
 Turkish3
 Jewish2
 Surinamese1
 Native vs foreign-bornc1
Languages
 Punjabi3
 Bengali2
 Gujarati2
 Hebrew2
 Hindi2
 Urdu2
 Ashkenazi/Yiddish1
 Spanish1
Religion
 Muslim7
 Sikh5
 Christian1
 Ismaili1
 Greek Orthodox1
Unique target race/ethnicity categoryNo. of articlesb
American Indian/Alaska Native
 American Indian/Alaska Native13
 Quechua1
Asian
 Chinese24
 South Asian17
 Japanese11
 Asian10
 Indian10
 Vietnamese9
 Korean8
 Filipino5
 East Asian4
 Pakistani4
 Bangladeshi3
 Southeast Asian3
 Indo-Caucasian1
 Malaysian1
 Nepali1
 Sri Lankan1
 Tibeto-Mongolian1
 West Asian1
Black
 Black30
 African6
 African American2
 Kalenjin2
 Kikuyu2
 Caribbean1
 Kisii1
 Luo1
Hispanic/Latino/Spanish origin
 Hispanic/Latino42
 Cuban1
 Mexican1
 Portuguese/Spanish1
 Puerto Rican1
 South and Central American1
Middle Eastern/North African
 Arab8
 Middle Eastern/North African5
 Israeli1
 Iranian1
Native Hawaiian/Pacific Islander
 Aboriginal3
 Indigenous1
 Maori1
 Moluccan1
 Pacific Islander/Native Hawaiian1
White
 White31
 Italian7
 Irish5
 German5
 Polish4
 Russian4
 European4
 Nordic3
 French3
 Celtic/English3
 British2
 Czech2
 Dutch2
 Estonian2
 Greek2
 Hungarian2
 Latvian2
 Lithuanian2
 Scottish2
 Slovak2
 Slovene2
 Swedish2
 Armenian1
 Bosnian/Macedonian1
 Bulgarian1
 Danish1
 Dutch/Flemish1
 Finnish1
 Iberian1
 Norwegian1
 Portuguese1
 Romanian/Moldavian1
 Romany1
 Serbian1
 Slavic1
 Swiss/Romansch/Tyrolean1
 Ukrainian1
 Welsh1
Other
 Asian/Pacific Islander19
 Other12
 Multiracial11
 Turkish3
 Jewish2
 Surinamese1
 Native vs foreign-bornc1
Languages
 Punjabi3
 Bengali2
 Gujarati2
 Hebrew2
 Hindi2
 Urdu2
 Ashkenazi/Yiddish1
 Spanish1
Religion
 Muslim7
 Sikh5
 Christian1
 Ismaili1
 Greek Orthodox1

a The 77 target nationality categories from Jun et al.88 not included.

b Not mutually exclusive.

c Native vs Foreign Born in the United States.

Table 3

Validation studies (n = 86).

Validation characteristicNo. of articlesTotal validation articles, %
Gold standard characteristica
 Race/ethnicity (self-report)6373
 Country of birth910
 Parents’ country of birth33
 Race/ethnicity (second-party report)89
 Language22
 Other1315
Evaluation metrica
 Sensitivity5463
 Specificity4653
 Positive predictive value4148
 Negative predictive value2327
 Race/ethnicity distribution comparison1416
 Area under the curve for receiving operating characteristic1214
 κ Statistic1315
 Correlation coefficient1012
 F score56
 Accuracy910
 Other1214
Validation characteristicNo. of articlesTotal validation articles, %
Gold standard characteristica
 Race/ethnicity (self-report)6373
 Country of birth910
 Parents’ country of birth33
 Race/ethnicity (second-party report)89
 Language22
 Other1315
Evaluation metrica
 Sensitivity5463
 Specificity4653
 Positive predictive value4148
 Negative predictive value2327
 Race/ethnicity distribution comparison1416
 Area under the curve for receiving operating characteristic1214
 κ Statistic1315
 Correlation coefficient1012
 F score56
 Accuracy910
 Other1214

a Not mutually exclusive.

Table 3

Validation studies (n = 86).

Validation characteristicNo. of articlesTotal validation articles, %
Gold standard characteristica
 Race/ethnicity (self-report)6373
 Country of birth910
 Parents’ country of birth33
 Race/ethnicity (second-party report)89
 Language22
 Other1315
Evaluation metrica
 Sensitivity5463
 Specificity4653
 Positive predictive value4148
 Negative predictive value2327
 Race/ethnicity distribution comparison1416
 Area under the curve for receiving operating characteristic1214
 κ Statistic1315
 Correlation coefficient1012
 F score56
 Accuracy910
 Other1214
Validation characteristicNo. of articlesTotal validation articles, %
Gold standard characteristica
 Race/ethnicity (self-report)6373
 Country of birth910
 Parents’ country of birth33
 Race/ethnicity (second-party report)89
 Language22
 Other1315
Evaluation metrica
 Sensitivity5463
 Specificity4653
 Positive predictive value4148
 Negative predictive value2327
 Race/ethnicity distribution comparison1416
 Area under the curve for receiving operating characteristic1214
 κ Statistic1315
 Correlation coefficient1012
 F score56
 Accuracy910
 Other1214

a Not mutually exclusive.

The most common evaluation metric for validation studies was sensitivity (n = 54, 63%) followed by specificity (n = 46, 53%), positive predictive value (n = 41, 48%), and negative predictive value (n = 23, 27%). In 14 articles (16%), researchers compared population distributions between method-assigned race/ethnicity and gold standard race/ethnicity. In 12 articles (14%), researchers calculated area under the curve for the receiving operating characteristic, and in 13 articles (15%), κ statistics were calculated. In 10 articles (12%), the average correlation coefficient between predicted race/ethnicity probability and self-reported race/ethnicity was measured. Bias was calculated in 5 articles (6%). Five articles (6%) reported F-score statistics and another 9 (10%) articles reported accuracy. Less common evaluation metrics included average squared error ratios (n = 1, 1%), precision (n = 5, 6%), recall (n = 5, 6%), false-positive rate (n = 1, 1%), false-negative rate (n = 1, 1%), and positive likelihood ratio (1, 1%).

Discussion

In this scoping review, we investigated the literature published on methods to improve classification of race/ethnicity in secondary data sets. We identified 6 main method types for classifying race/ethnicity data: expert review, name lists, name algorithms, machine learning, and data linkage. Each of the 5 main method types identified in our review had advantages and disadvantages (Table 4).

Table 4

Method type strengths and limitations.

Method typeMethod type strengths
Development requires minimal technical expertiseAppropriate for large data setsTargets smaller race/ethnicity groupsTransferrable to other populationsTargets multiple race/ethnicity groups
Expert review+ab+
Name lists+++
Name algorithms+++
Machine learning+++
Data linkage+++
Method typeMethod type strengths
Development requires minimal technical expertiseAppropriate for large data setsTargets smaller race/ethnicity groupsTransferrable to other populationsTargets multiple race/ethnicity groups
Expert review+ab+
Name lists+++
Name algorithms+++
Machine learning+++
Data linkage+++

a + The described Strength is applicable to the given Method Type.

b – The described Strength is not applicable to the given Method Type.

Table 4

Method type strengths and limitations.

Method typeMethod type strengths
Development requires minimal technical expertiseAppropriate for large data setsTargets smaller race/ethnicity groupsTransferrable to other populationsTargets multiple race/ethnicity groups
Expert review+ab+
Name lists+++
Name algorithms+++
Machine learning+++
Data linkage+++
Method typeMethod type strengths
Development requires minimal technical expertiseAppropriate for large data setsTargets smaller race/ethnicity groupsTransferrable to other populationsTargets multiple race/ethnicity groups
Expert review+ab+
Name lists+++
Name algorithms+++
Machine learning+++
Data linkage+++

a + The described Strength is applicable to the given Method Type.

b – The described Strength is not applicable to the given Method Type.

Strengths and limitations by method

Expert review

Expert review requires minimal technical expertise to apply to data sets, which may make this approach ideal for smaller race/ethnicity groups that lack validated name lists or name algorithms and for projects with limited funding to develop their own name list or algorithm. Expert-review methods also include validation components with moderate to high sensitivities for improving race/ethnicity.20,21,45 A major disadvantage is it requires time for individual coders to review the population of interest and may only be appropriate for smaller or mid-sized data sets.

Name list

Name-list methods can be developed from the simple compilation of names from ethnic phone books and baby-name dictionaries to the more complex calculation of each listed name’s predicted race/ethnicity probability.28,29 Name lists can be programmed to assign race/ethnicity for large data sets and require fewer resources to develop than more advanced name algorithms. With a few exceptions,28,113,115 a disadvantage of name-list methods is they are often designed to improve race/ethnicity classification for 1 target race/ethnicity.

Name algorithms

Name algorithms frequently require more technical expertise to incorporate additional variables when predicting race/ethnicity. Name algorithms using Bayes theorem to estimate the conditional probability of a race/ethnicity, when given name and geocoded location data, perform with higher accuracy and better coverage than either variable alone.59,60 Evidence suggests using smaller geocode units in name algorithms, such as census block code versus zip codes, improves race/ethnicity prediction accuracy.70 Notably, Bayesian name algorithms that impute race/ethnicity probabilistically at the population level vs deterministically at the individual level are more accurate and less biased.46,60,70,138 The BISG method is a Bayesian name algorithm endorsed by the National Academy of Medicine, the National Committee for Quality Assurance, and National Quality Forum, and is used by health plans like Kaiser Permanente, Cigna, Aetna, and the Centers for Medicaid and Medicare Services.60 When compared with traditional multiple imputation methods, the BISG method further reduces bias in nonrandom missing race/ethnicity data,66 and incorporation of first names into the BISG improves prediction accuracy.76 A few studies have found that BISG-informed multinomial logistic regression models that use additional variables such as age, sex, and insurance perform better than BISG alone.36,39,80,81 There is also some evidence that models trained on state-specific name data perform better than models built on nationwide census surname lists.80 A limitation of the BISG method is it targets only 6 aggregate race/ethnicity groups: Asian/Pacific Islander, Black, White, AI/AN, Hispanic, and multiracial, and must be applied to a large population to accurately predict race/ethnicity distributions.60

Machine learning

In contrast, machine-learning approaches frequently target a more diverse set of race/ethnicity subgroups. For example, Jun et al.88 trained their algorithm to impute data for 77 nationalities. Machine-learning methods are frequently developed from and applied to large data sets drawn from multiple countries, providing wider coverage and potentially more accurate prediction of different race/ethnicity groups. Nonetheless, a disadvantage to machine-learning methods is they are less flexible than name lists or name algorithms to being updated to changing demographic name trends, because they require their entire structure to be rebuilt for pattern detection.92 Machine-learning approaches may also have limited transferability to other populations if the model is overfit to training data; this is particularly notable because many machine-learning methods are developed from scraped internet data sources, such as Twitter or Wikipedia.84,86,87,89 The applicability of machine-learning methods developed from large web-based sources to more localized data sets remains unclear.

Data linkage

Data linkage methods can connect individual-level data with limited or missing race/ethnicity to gold standard, self-reported race/ethnicity measures. However, these methods are only relevant when a linkable data set with race/ethnicity information are available. Interestingly, 5 of the 8 data linkage articles targeted AI/AN or Aboriginal populations for race/ethnicity improvement.97,99,101,137 This may be related to the structure of Indigenous or Native data systems in the United States and Australia but is notable because name-list and name-algorithm methods have been reported to be poor at predicting race/ethnicity for this population.60,76

Potential harms of reviewed methods

Across all methods for predicting race/ethnicity, there is potential for harm, particularly against communities of color, related to implicit and explicit biases from the reclassification of race/ethnicity. Among name-based methods (i.e., expert review, name lists, name algorithms, machine learning), not all race/ethnicity groups are equally distinguishable from name information alone. In the United States, AI/AN, Black, and multiracial names are particularly challenging to differentiate.60,139 As noted by Kozlowski et al.,46 using name methods that attempt to predict race/ethnicity for multiple groups can lead to underestimation of groups with poor prediction, particularly among Black populations. This is exacerbated for name-based methods assigning race/ethnicity at the individual level using probability threshold cutoffs. Because some names are associated with more than 1 race/ethnicity group, such as “Lee”, deterministically assigning a name to a race/ethnicity above a given probability threshold or to the race/ethnicity group with the highest probability underestimates other race/ethnicity groups that may be associated with that name but to a lesser degree.46,60,138 Methods that indirectly estimate the distribution of multiple race/ethnicity groups at the population level can mitigate bias by aggregating the probabilities of each individual in the given cohort population being classified into each potential race/ethnicity group.46,60,70 Factors such as cultural integration, intermarriage, adoption, and name change can also further confound name-based race/ethnicity predictions.14,28,139

Another source of bias could be related to the dynamic nature of how race/ethnicity categories are defined. At both individual and institutional levels, race/ethnicity groupings are not static. How individuals self-identify their race/ethnicity changes over a 10-year period, particularly among Al/AN, Native Hawaiian, Pacific Islanders, and multiracial persons.140 Similarly, since its inception, the US Census race/ethnicity categories have evolved, influenced by political attitudes, advocacy, and immigration, and can influence how individuals choose to self-identify.141 Methods included in this review assume that patterns of racial/ethnic self-identity and classification are consistent across both the reference populations used to develop the method and target populations to which the method is being applied. Both underestimation and overestimation of predicted race/ethnicity groups can occur depending on how trends of self-identification vary.

The nature of racial/ethnic groups is also rooted in a history of racial/ethnic stratification.142 The categorization of people into US Census race/ethnicity groups, originally developed to scientifically reinforce the dominance of the White racial/ethnic class, carries the potential of reinforcing and perpetuating inequities through stigmatizing and pathologizing racial/ethnic minority groups.142 There is an increasing amount of literature calling on researchers to move beyond the documentation of racial/ethnic health disparities, to antiracist praxis that emphasizes the examination of the effects and pervasiveness of racial structures within oppressive systems like health data systems143 and the operationalization and measurement of structural racism.144 Racial/ethnic disparities in health and health care use manifest as a result of inequitable access to social and economic resources and experiences of racism at the individual and structural levels. Contextualizing health data with respect to these determinants rather than race/ethnicity may better represent the underlying mechanisms that drive disparities and help minimize theories of biological inferiority and discriminatory behavior.

Last, across all methods, identification of race/ethnicity can be sufficient to reveal an individual’s identity, particularly among racial/ethnic minority groups. Researchers must be cautious when reporting race/ethnicity for small population cohorts. Recent computer science research has also raised concerns of how machine-learning models memorize training data.145,146 Findings have suggested machine-learning models trained on health care data sets can be made to reveal individual-level health information, particularly among patients with outlier characteristics among the trained population data.145,146

Looking forward

In the United States, the widely adopted BISG algorithm could be further adapted to improve race/ethnicity classification for racial/ethnic subgroups. The release of updated Census surname lists for disaggregated race/ethnicity categories from the 2020 Census provides an opportunity to improve the BISG. However, it is unclear when the Census plans to release the updated disaggregated name list; prior lists have been released 6 to 7 years after a decennial Census. Another method would be to leverage other race/ethnicity name lists such as those for Asian or MENA subgroups.28,30,126 Name lists for these populations have been validated and may be able to provide similar race/ethnicity probability estimates.

Although this review provides guidance regarding the strengths and limitations of extant methods for improving race/ethnicity data quality in secondary data sets, there is still a need for more validation studies to quantitatively assess which methods are best and under what circumstances. There is especially limited research on how name algorithms compare with machine-learning methods. Kandt and Longley92 compared their name algorithm’s performance with the machine-learning project Onomap, analyzing 2011 UK Census data, and they found each performed best for different race/ethnicity groups. Research evaluating name-algorithm and machine-learning methods’ performance could provide insight to the tradeoffs between the 2 methods.

This review focused on how to retrospectively improve racial/ethnic data quality; however, future work should also focus on the prospective improvement of data collection and reporting of racial/ethnic and other demographic information at the local, state, and national levels, and should interrogate other measures of personal and structural racism’s influence on health. Racial/ethnic disparities exist even if data do not; improving race/ethnicity data quality, both retrospectively and prospectively, while also exploring the intersectional effects of race/ethnicity, socioeconomic status, and racism, is important.

Study limitations

We did not contact the primary researchers of studies that were excluded, because they provided limited or no descriptions of their race/ethnicity classification method. There may be studies with race/ethnicity classification methods that were not included. We also only included articles that were in English and this may undercount methods to improve race/ethnicity used in other countries. Additionally, our search did not specifically include “Twitter” in our keywords and did cover the same articles as the review by Golder et al.15; notably, we did not include articles that inferred race/ethnicity on the basis of an individual’s photograph or physical appearance.

Conclusion

We provide an overview of ways to improve race/ethnicity data quality in administrative and secondary data sets that can aid researchers in selecting the best method based on resource and data input availability. Findings emphasize the need for innovative methods to better identify race/ethnicity subgroups and additional validation research comparing more advanced approaches. This information is critical to addressing the systematic missingness of relevant demographic data and perpetual failure to accurately and efficiently collect and report disaggregated data by race/ethnicity that can erroneously guide policy decisions on funding and resource allocation and hinder the effectiveness of health care practices and intervention.

Acknowledgments

Presented in part at the American Public Health Association 2022 Annual Meeting and Expo, Boston, Massachusetts, November 6–9, 2022.

Supplementary material

Supplementary material is available at Epidemiologic Reviews online.

Funding

The research described in this article is supported by grants U54MD000538 from the National Institutes of Health (NIH) National Institute on Minority Health and Health Disparities; R01HL141427 from the NIH National Heart, Lung and Blood Institute; and NU38OT2020001477 from the Centers for Disease Control and Prevention (CDC) and New York State (NYS).

Conflict of interest

The authors declare no conflicts of interest.

Disclaimer

The contents of this publication are solely the responsibility of the authors and do not necessarily represent the official views of the NIH, the CDC, or NYS.

Data availability statement

The data set is available from the corresponding author.

References

1.

Henry
J
,
Pylypchuk
Y
,
Searcy
T
, et al.
Adoption of Electronic Health Record Systems among U.S. Non-Federal Acute Care Hospitals: 2008–2015
.
Washington, DC
:
Office of the National Coordinator for Health Information Technology
.
Available from:
https://www.healthit.gov/sites/default/files/briefs/2015_hospital_adoption_db_v17.pdf.
Accessed December 10, 2021
.

2.

Lewis
C
,
Getachew
Y
,
Abrams
MK
, et al.
Changes at Community Health Centers, and How Patients Are Benefiting
.
Washington, DC
:
The Commonwealth Fund
;
2019
.
Available from:
https://www.commonwealthfund.org/sites/default/files/2019-08/Lewis_changes_at_CHCs_patients_benefiting_FQHC_survey_2013-2018.pdf.
Accessed December 10, 2021
.

3.

National Center for Health Statistics
. National Electronic Health Records Survey.
2019
.
Available from
: https://www.cdc.gov/nchs/data/nehrs/2019NEHRS-PUF-weighted-estimates-508.pdf.
Accessed December 20, 2021
.

4.

Lin
KJ
,
Schneeweiss
S
.
Considerations for the analysis of longitudinal electronic health records linked to claims data to study the effectiveness and safety of drugs
.
Clin Pharmacol Ther.
2016
;
100
(
2
):
147
159
. https://doi.org/10.1002/cpt.359).

5.

Yi
SS
,
Kwon
SC
,
Suss
R
, et al.
The mutually reinforcing cycle of poor data quality and racialized stereotypes that shapes Asian American health
.
Health Aff (Millwood).
2022
;
41
(
2
):
296
303
. https://doi.org/10.1377/hlthaff.2021.01417.

6.

Rivara
FP
,
Bradley
SM
,
Catenacci
DV
, et al.
Structural racism and JAMA network open
.
JAMA Netw Open
.
2021
;
4
(
6
):e2120269. https://doi.org/10.1001/jamanetworkopen.2021.20269.

7.

Robert Wood Johnson Foundation
. Statement from Richard Besser, MD, on racial injustice, violence, and health in America.
June 2, 2020
.
Available from
: https://www.rwjf.org/en/library/articles-and-news/2020/06/statement-from-richard-besser-on-racial-injustice-violence-and-health-in-america.html.
Accessed January 12, 2021
.

8.

Churchwell
K
,
Elkind
MSV
,
Benjamin
RM
, et al.
Call to action: structural racism as a fundamental driver of health disparities: a presidential advisory from the American Heart Association
.
Circulation
.
2020
;
142
(
24
):
e454
e468
. https://doi.org/10.1161/CIR.0000000000000936.

9.

Chin
MK
,
Đoàn
LN
,
Chong
SK
, et al. Asian American Subgroups and the COVID-19 Experience: What We Know And Still Don’t Know.
Health Affairs Forefront
[Web log post]
. March 25, 2022. https://www.healthaffairs.org/do/10.1377/forefront.20220323.555023/.
Accessed December 15, 2021
.

10.

Bakkar
H.
Trying to assess COVID's impact on Arab-American communities is complicated.
NPR
.
April 11, 2021
.
Available from
: https://www.npr.org/2021/04/11/985128948/trying-to-assess-covids-impact-on-arab-american-communities-is-complicated.
Accessed June 23, 2021
.

11.

Howland
RE
,
Tsao
TY
.
Evaluating race and ethnicity reported in hospital discharge data and its impact on the assessment of health disparities
.
Med Care
.
2020
;
58
(
3
):
280
284
. https://doi.org/10.1097/MLR.0000000000001259.

12.

Klinger
EV
,
Carlini
SV
,
Gonzalez
I
, et al.
Accuracy of race, ethnicity, and language preference in an electronic health record
.
J Gen Intern Med
.
2015
;
30
(
6
):
719
723
. https://doi.org/10.1007/s11606-014-3102-8.

13.

Centers for Disease Control and Prevention
. Demographic trends of COVID-19 cases and deaths in the US reported to CDC. CDC COVID Data Tracker. 2023.
Available from
: https://covid.cdc.gov/covid-data-tracker/#demographics.
Accessed November 30, 2020
.

14.

Mateos
P
.
A review of name-based ethnicity classification methods and their potential in population studies
.
Popul Space Place
.
2007
;
13
(
4
):
243
263
. https://doi.org/10.1002/psp.457.

15.

Golder
S
,
Stevens
R
,
O'Connor
K
, et al.
Methods to establish race or ethnicity of Twitter users: scoping review
.
J Med Internet Res
.
2022
;
24
(
4
):
23
. https://doi.org/10.2196/35788.

16.

Page
MJ
,
McKenzie
JE
,
Bossuyt
PM
, et al.
The PRISMA 2020 statement: an updated guideline for reporting systematic reviews
.
BMJ
.
2021
;
372
:n71. https://doi.org/10.1136/bmj.n71.

17.

Arksey
H
,
O'Malley
L
.
Scoping studies: towards a methodological framework
.
Int J Soc Res Methodol
.
2005
;
8
(
1
):
19
32
. https://doi.org/10.1080/1364557032000119616.

18.

Asian & Pacific Islander American Health Forum
.
Policy Recommendations: Health Equity Cannot Be Achieved Without Complete and Transparent Data Collection and the Disaggregation of Data
.
Washington, DC
:
Asian & Pacific Islander American Health Forum
;
2021
.
Available from:
https://www.apiahf.org/wp-content/uploads/2021/02/APIAHF-Policy-Recommendationas-Health-Equity.pdf.
Accessed December 15, 2021
.

19.

Institute of Medicine
.
Race, Ethnicity, and Language Data: Standardization for Health Care Quality Improvement
.
Washington, DC
:
The National Academies Press
;
2009
: https://doi.org/10.17226/12696.

20.

Thornton
D
,
Martin
TPC
,
Amin
P
, et al.
Chronic suppurative otitis media in Nepal: ethnicity does not determine whether disease is associated with cholesteatoma or not
.
J Laryngol Otol
.
2011
;
125
(
1
):
22
26
. https://doi.org/10.1017/S0022215110001878.

21.

Bouwhuis
CB
,
Moll
HA
.
Determination of ethnicity in children in the Netherlands: two methods compared
.
Eur J Epidemiol
.
2003
;
18
(
5
):
385
388
. https://doi.org/10.1023/a:1024205226239.

22.

Yenkey
CB
.
Fraud and market participation: social relations as a moderator of organizational misconduct
.
Adm Sci Q
.
2018
;
63
(
1
):
43
84
. https://doi.org/10.1177/0001839217694359.

23.

Boxall
N
,
David
M
,
Schalinski
E
, et al.
Perinatal outcome in women with a Vietnamese migration background - retrospective comparative data analysis of 3000 deliveries
.
Geburtshilfe Frauenheilkd
.
2018
;
78
(
7
):
697
706
. https://doi.org/10.1055/a-0636-4224.

24.

Virk
R
,
Gill
S
,
Yoshida
E
, et al.
Racial differences in the incidence of colorectal cancer
.
Can J Gastroenterol Hepatol
.
2010
;
24
(
1
):
47
51
. https://doi.org/10.1155/2010/565613.

25.

Young
IP
,
Castaneda
JA
.
Color of money as compared to color of principals: an assessment of pay for male elementary school principals varying in surname (Hispanic vs. non-Hispanic)
.
Educ Adm Q
.
2008
;
44
(
5
):
675
703
. https://doi.org/10.1177/0013161X08323791.

26.

Sargent
JD
,
Stukel
TA
,
Dalton
MA
, et al.
Iron deficiency in Massachusetts communities: socioeconomic and demographic risk factors among children
.
Am J Public Health
.
1996
;
86
(
4
):
544
550
. https://doi.org/10.2105/ajph.86.4.544.

27.

Nicoll
A
,
Bassett
K
,
Ulijaszek
SJ
.
What's in a name? Accuracy of using surnames and forenames in ascribing Asian ethnic identity in English populations
.
J Epidemiol Community Health
.
1986
;
40
(
4
):
364
368
. https://doi.org/10.1136/jech.40.4.364.

28.

Lauderdale
DS
,
Kestenbaum
B
.
Asian American ethnic identification by surname
.
Popul Res Policy Rev
.
2000
;
19
(
3
):
283
300
. https://doi.org/10.1023/A:1026582308352.

29.

Choi
BCK
,
Hanley
AJG
,
Holowaty
EJ
, et al.
Use of surnames to identify individuals of Chinese ancestry
.
Am J Epidemiol
.
1993
;
138
(
9
):
723
734
. https://doi.org/10.1093/oxfordjournals.aje.a116910.

30.

El-Sayed
AM
,
Lauderdale
DS
,
Galea
S
.
Validation of an Arab name algorithm in the determination of Arab ancestry for use in health research
.
Ethn Health
.
2010
;
15
(
6
):
639
647
. https://doi.org/10.1080/13557858.2010.505979.

31.

Schwartz
K
,
Beebani
G
,
Sedki
M
, et al.
Enhancement and validation of an Arab surname database
.
J Registry Manag
.
2013
;
40
(
4
):
176
179
.

32.

Razum
O
,
Zeeb
H
,
Akgun
S
.
How useful is a name-based algorithm in health research among Turkish migrants in Germany?
Trop Med Int Health
.
2001
;
6
(
8
):
654
661
. https://doi.org/10.1046/j.1365-3156.2001.00760.x.

33.

Cummins
C
,
Winter
H
,
Cheng
KK
, et al.
An assessment of the Nam Pehchan computer program for the identification of names of South Asian ethnic origin
.
J Public Health Med
.
1999
;
21
(
4
):
401
406
. https://doi.org/10.1093/pubmed/21.4.401.

34.

Mangtani
P
,
Maringe
C
,
Rachet
B
, et al.
Cancer mortality in ethnic South Asian migrants in England and Wales (1993-2003): patterns in the overall population and in first and subsequent generations
.
Br J Cancer
.
2010
;
102
(
9
):
1438
1443
. https://doi.org/10.1038/sj.bjc.6605645.

35.

Macfarlane
GJ
,
Lunt
M
,
Palmer
B
, et al.
Determining aspects of ethnicity amongst persons of South Asian origin: the use of a surname-classification programme (Nam Pehchan)
.
Public Health
.
2007
;
121
(
3
):
231
236
. https://doi.org/10.1016/j.puhe.2006.07.001.

36.

Silva
GC
,
Trivedi
AN
,
Gutman
R
.
Developing and evaluating methods to impute race/ethnicity in an incomplete dataset
.
Health Serv Outcomes Res Methodol
.
2019
;
19
(
2–3
):
175
195
. https://doi.org/10.1007/s10742-019-00200-9.

37.

Nanchahal
K
,
Mangtani
P
,
Alston
M
, et al.
Development and validation of a computerized South Asian names and group recognition algorithm (SANGRA) for use in British health-related studies
.
J Public Health Med
.
2001
;
23
(
4
):
278
285
. https://doi.org/10.1093/pubmed/23.4.278.

38.

Karaulova
M
,
Gok
A
,
Shapira
P
.
Identifying author heritage using surname data: an application for Russian surnames
.
J Assoc Inf Sci Technol
.
2019
;
70
(
5
):
488
498
. https://doi.org/10.1002/asi.24104.

39.

Haas
A
,
Elliott
MN
,
Dembosky
JW
, et al.
Imputation of race/ethnicity to enable measurement of HEDIS performance by race/ethnicity
.
Health Serv Res
.
2019
;
54
(
1
):
13
23
. https://doi.org/10.1111/1475-6773.13099.

40.

Eicheldinger
C
,
Bonito
A
.
More accurate racial and ethnic codes for Medicare administrative data
.
Health Care Financ Rev
.
2008
;
29
(
3
):
27
42
.

41.

Elo
IT
,
Turra
CM
,
Kestenbaum
B
, et al.
Mortality among elderly Hispanics in the United States: past evidence and new results
.
Demography
.
2004
;
41
(
1
):
109
128
. https://doi.org/10.1353/dem.2004.0001.

42.

Nitsch
D
,
Kadalayil
L
,
Mangtani
P
, et al.
Validation and utility of a computerized south Asian names and group recognition algorithm in ascertaining South Asian ethnicity in the national renal registry
.
Qjm
.
2009
;
102
(
12
):
865
872
. https://doi.org/10.1093/qjmed/hcp142.

43.

Atekruse
SF
,
Cosgrove
C
,
Cronin
K
, et al.
Comparing cancer registry abstracted and self-reported data on race and ethnicity
.
J Registry Manag
.
2017
;
44
(
1
):
30
33
.

44.

Wong
EC
,
Palaniappan
LP
,
Lauderdale
DS
.
Using name lists to infer Asian racial/ethnic subgroups in the healthcare setting
.
Med Care
.
2010
;
48
(
6
):
540
546
. https://doi.org/10.1097/MLR.0b013e3181d559e9.

45.

Harding
S
,
Dews
H
,
Simpson
SL
.
The potential to identify South Asians using a computerised algorithm to classify names
.
Popul Trends
.
1999
;
97
:
46
49
.

46.

Kozlowski
D
,
Murray
DS
,
Bell
A
, et al.
Avoiding bias when inferring race using name-based approaches
.
PloS One
.
2022
;
17
(
3
):e0264270. https://doi.org/10.1371/journal.pone.0264270.

47.

Petersen
J
,
Kandt
J
,
Longley
PA
.
Names-based ethnicity enhancement of hospital admissions in England, 1999-2013
.
Int J Med Inform
.
2021
;
149
:104437. https://doi.org/10.1016/j.ijmedinf.2021.104437.

48.

Webber
R
.
Using names to segment customers by cultural, ethnic or religious origin
.
J Direct Data Digit Mark Pract
.
2007
;
8
(
3
):
226
242
. https://doi.org/10.1057/palgrave.dddmp.4350051.

49.

Haas
A
,
Adams
JL
,
Haviland
AM
, et al.
The contribution of first-name information to the accuracy of racial-and-ethnic imputations varies by sex and race-and-ethnicity among Medicare beneficiaries
.
Med Care
.
2022
;
60
(
8
):
556
562
. https://doi.org/10.1097/MLR.0000000000001732.

50.

Webber
R
,
Roe
P
,
Lewison
G
.
Do scientists with foreign names improve European medical research? A preliminary study of a new methodology
.
J Scientometr Res
.
2021
;
10
(
1
):
1
8
. https://doi.org/10.5530/jscires.10.1.1.

51.

Boscoe
FP
,
Schymura
MJ
,
Zhang
X
, et al.
Heuristic algorithms for assigning Hispanic ethnicity
.
PloS One
.
2013
;
8
(
2
):e55689. https://doi.org/10.1371/journal.pone.0055689.

52.

Sandberg
TJ
,
Wilson
AR
,
Rodin
H
, et al.
Improving the imputation of race: evaluating the benefits of stratifying by age
.
Popul Health Manag
.
2009
;
12
(
6
):
325
331
. https://doi.org/10.1089/pop.2009.0006.

53.

Yee
K
,
Hoopes
M
,
Giebultowicz
S
, et al.
Implications of missingness in self-reported data for estimating racial and ethnic disparities in Medicaid quality measures
.
Health Serv Res
.
2022
;
57
(
6
):
1370
1378
. https://doi.org/10.1111/1475-6773.14025.

54.

Hsieh
MC
,
Pareti
LA
,
Chen
VW
.
Using NAPIIA to improve the accuracy of Asian race codes in registry data
.
J Registry Manag
.
2011
;
38
(
4
):
190
195
.

55.

Adjaye-Gbewonyo
D
,
Bednarczyk
RA
,
Davis
RL
, et al.
Using the Bayesian improved surname geocoding method (BISG) to create a working classification of race and ethnicity in a diverse managed care population: a validation study
.
Health Serv Res
.
2014
;
49
(
1
):
268
283
. https://doi.org/10.1111/1475-6773.12089.

56.

Hazuda
HP
,
Comeaux
PJ
,
Stern
MP
, et al.
A comparison of three indicators for identifying Mexican Americans in epidemiologic research. Methodological findings from the San Antonio Heart Study
.
Am J Epidemiol
.
1986
;
123
(
1
):
96
112
. https://doi.org/10.1093/oxfordjournals.aje.a114228.

57.

Liu
L
,
Tanjasiri
SP
,
Cockburn
M
.
Challenges in identifying native Hawaiians and Pacific Islanders in population-based cancer registries in the US
.
J Immigr Minor Health
.
2011
;
13
(
5
):
860
866
. https://doi.org/10.1007/s10903-010-9381-1.

58.

Lauderdale
DS
,
Kestenbaum
B
.
Mortality rates of elderly Asian American populations based on Medicare and social security data
.
Demography
.
2002
;
39
(
3
):
529
540
. https://doi.org/10.1353/dem.2002.0028.

59.

Elliott
MN
,
Fremont
A
,
Morrison
PA
, et al.
A new method for estimating race/ethnicity and associated disparities where administrative records lack self-reported race/ethnicity
.
Health Serv Res
.
2008
;
43
(
5 Pt 1
):
1722
1736
. https://doi.org/10.1111/j.1475-6773.2008.00854.x.

60.

Elliott
MN
,
Morrison
PA
,
Fremont
A
, et al.
Using the Census Bureau's surname list to improve estimates of race/ ethnicity and associated disparities
.
Health Serv Outcomes Res Methodol
.
2009
;
9
(
2
):
69
83
. https://doi.org/10.1007/s10742-009-0047-1.

61.

Palacio
AM
,
Tamariz
LJ
,
Uribe
C
, et al.
Can claims-based data be used to recruit black and Hispanic subjects into clinical trials?
Health Serv Res
.
2012
;
47
(
2
):
770
782
. https://doi.org/10.1111/j.1475-6773.2011.01316.x.

62.

Bykov
K
,
Franklin
JM
,
Toscano
M
, et al.
Evaluating cardiovascular health disparities using estimated race/ethnicity: a validation study
.
Med Care
.
2015
;
53
(
12
):
1050
1057
. https://doi.org/10.1097/MLR.0000000000000438.

63.

Smith
CK
,
Bonauto
DK
.
Improving occupational health disparity research: testing a method to estimate race and ethnicity in a working population
.
Am J Ind Med
.
2018
;
61
(
8
):
640
648
. https://doi.org/10.1002/ajim.22850.

64.

Elliott
MN
,
Klein
DJ
,
Kallaur
P
, et al.
Using predicted Spanish preference to target bilingual mailings in a mail survey with telephone follow-up
.
Health Serv Res
.
2019
;
54
(
1
):
5
12
. https://doi.org/10.1111/1475-6773.13088.

65.

Storey
P
,
Murchison
AP
,
Dai
Y
, et al.
Comparing methodologies for imputing ethnicity in an urban ophthalmology clinic
.
Ophthalmic Epidemiol
.
2014
;
21
(
2
):
106
110
. https://doi.org/10.3109/09286586.2014.884603.

66.

Grundmeier
RW
,
Song
L
,
Ramos
MJ
, et al.
Imputing missing race/ethnicity in pediatric electronic health records: reducing bias with use of U.S. Census location and surname data
.
Health Serv Res
.
2015
;
50
(
4
):
946
960
. https://doi.org/10.1111/1475-6773.12295.

67.

Elliott
MN
,
Becker
K
,
Beckett
MK
, et al.
Using indirect estimates based on name and census tract to improve the efficiency of sampling matched ethnic couples from marriage license data
.
Public Opin Q
.
2013
;
77
(
1
):
375
384
https://doi.org/10.1093/poq/nft007.

68.

Sartin
EB
,
Metzger
KB
,
Pfeiffer
MR
, et al.
Facilitating research on racial and ethnic disparities and inequities in transportation: application and evaluation of the Bayesian improved surname geocoding (BISG) algorithm
.
Traffic Inj Prev
.
2021
;
22
(
sup1
):
S32
S37
. https://doi.org/10.1080/15389588.2021.1955109.

69.

Clark
JT
,
Curiel
JA
,
Steelman
TS
.
Minmaxing of Bayesian improved surname geocoding and geography level ups in predicting race
.
Polit Anal
.
2022
;
30
(
3
):
456
462
. https://doi.org/10.1017/pan.2021.31.

70.

DeLuca
K
,
Curiel
JA
.
Validating the applicability of Bayesian inference with surname and geocoding to congressional redistricting
.
Polit Anal
2023;31(3):465–471. https://doi.org/10.1017/pan.2022.14.

71.

Hayes-Bautista
DE
,
Hsu
P
,
Hayes-Bautista
M
, et al.
Latino physician supply in California: sources, locations, and projections
.
Acad Med
.
2000
;
75
(
7
):
727
736
. https://doi.org/10.1097/00001888-200007000-00018.

72.

Imai
K
,
Khanna
K
.
Improving ecological inference by predicting individual ethnicity from voter registration records
.
Polit Anal
.
2016
;
24
(
2
):
263
272
. https://doi.org/10.1093/pan/mpw001.

73.

Kronzer
VL
,
Leasure
EL
,
Halvorsen
AJ
, et al.
Effect of resident gender and surname origin on clinical load: observational cohort study in an internal medicine continuity clinic
.
J Gen Intern Med
.
2020
;
36
(
5
):
1237
1243
. https://doi.org/10.1007/s11606-020-06296-x.

74.

Komahan
K
,
Reidpath
DD
.
A "Roziah" by any other name: a simple Bayesian method for determining ethnicity from names
.
Am J Epidemiol
.
2014
;
180
(
3
):
325
329
. https://doi.org/10.1093/aje/kwu129.

75.

Tjam
EY
.
How to find Chinese research participants: use of a phonologically based surname search method
.
Can J Public Health
.
2001
;
92
(
2
):
138
142
. https://doi.org/10.1007/BF03404948.

76.

Voicu
I
.
Using first name information to improve race and ethnicity classification
.
Stat Public Foreign Policy
.
2018
;
5
(
1
):
1
13
.

77.

Grofman
B
,
Garcia
J
.
Using Spanish surname ratios to estimate proportion Hispanic in California cities via Bayes theorem
.
Soc Sci Q
.
2015
;
96
(
5
):
1511
1527
.

78.

Grofman
B
,
Garcia
JR
.
Using Spanish surname to estimate Hispanic voting population in voting rights litigation: a model of context effects using Bayes' theorem
.
Elect Law J
.
2014
;
13
(
3
):
375
393
. https://doi.org/10.1089/elj.2013.0190.

79.

Clark
E
,
Fredricks
K
,
Woc-Colburn
L
, et al.
Disproportionate impact of the COVID-19 pandemic on immigrant communities in the United States
.
PLoS Negl Trop Dis
.
2020
;
14
(
7
):e0008484. https://doi.org/10.1371/journal.pntd.0008484.

80.

Zavez
K
,
Harel
O
,
Aseltine
RH
.
Imputing race and ethnicity in healthcare claims databases
.
Health Serv Outcomes Res Methodol
.
2022
;
22
:
493
507
. https://doi.org/10.1007/s10742-022-00273-z.

81.

Xue
Y
,
Harel
O
,
Aseltine
RH
Jr
.
Imputing race and ethnic information in administrative health data
.
Health Serv Res
.
2019
;
54
(
4
):
957
963
. https://doi.org/10.1111/1475-6773.13171.

82.

Wong
KO
,
Zaiane
OR
,
Davis
FG
, et al.
A machine learning approach to predict ethnicity using personal name and census location in Canada
.
PloS One
.
2020
;
15
(
11
):e0241239. https://doi.org/10.1371/journal.pone.0241239.

83.

Monasterio
L
.
Surnames and ancestry in Brazil
.
PloS One
.
2017
;
12
(
5
):e0176890. https://doi.org/10.1371/journal.pone.0176890.

84.

Ye
J
,
Han
S
,
Hu
Y
,
Coskun
B
,
Liu
M
,
Qin
H
, et al. Nationality classification using name embeddings.
Presented at the 2017 ACM on Conference on Information and Knowledge Management, Singapore
.
November 6–10, 2017
.
2017
.

85.

Xu
DF
,
Zhang
YX
.
Identifying ethnic occupational segregation
.
J Popul Econ
.
2022
;
35
(
3
):
1261
1296
. https://doi.org/10.1007/s00148-020-00796-0.

86.

Le
TT
,
Himmelstein
DS
,
Hippen
AA
, et al.
Analysis of scientific society honors reveals disparities
.
Cell Syst
.
2021
;
12
(
9
):
900
6.e5
. https://doi.org/10.1016/j.cels.2021.07.007.

87.

Mazières
A
,
Roth
C
.
Large-scale diversity estimation through surname origin inference
.
Bull Sociol Methodol
.
2018
;
139
(
1
):
59
73
.

88.

Jun
J
,
Mizuno
T
.
Detecting ethnic spatial distribution of business people using recurrent neural networks
. WI '19 Companion: IEEE/WIC/ACM International Conference on Web Intelligence.
2019
;
29
34
https://doi.org/10.1145/3358695.3360925.

89.

Ambekar
A
,
Ward
C
,
Mohammed
J
, et al.
Name-ethnicity classification from open sources
.
Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
.
2009
;
49
58
https://doi.org/10.1145/1557019.1557032.

90.

Xie
FZ
.
rethnicity: An R package for predicting ethnicity from names
.
SoftwareX
.
2022
;
17
:100965 https://doi.org/10.1016/j.softx.2021.100965.

91.

Lakha
F
,
Gorman
DR
,
Mateos
P
.
Name analysis to classify populations by ethnicity in public health: validation of Onomap in Scotland
.
Public Health
.
2011
;
125
(
10
):
688
696
.

92.

Kandt
J
,
Longley
PA
.
Ethnicity estimation using family naming practices
.
PloS One
.
2018
;
13
(
8
):e0201774. https://doi.org/10.1371/journal.pone.0201774.

93.

Thomas
DR
,
Orife
O
,
Plimmer
A
, et al.
Ethnic variation in outcome of people hospitalised during the first COVID-19 epidemic wave in Wales (UK): an analysis of national surveillance data using Onomap, a name-based ethnicity classification tool
.
BMJ Open
.
2021
;
11
(
8
):e048335. https://doi.org/10.1136/bmjopen-2020-048335.

94.

Affar
S
,
Morrison
DS
,
Campbell
C
.
Cervical cancer incidence by ethnic group in Scotland from 2008 to 2017: a population-based study
.
Eur J Cancer Care
.
2021
;
30
(
5
):
6
. https://doi.org/10.1111/ecc.13441.

95.

Jun
J
,
Mizuno
T
.
Detecting ethnic spatial distribution of business people using machine learning
.
Information
.
2020
;
11
(
4
):197.

96.

Shaw
C
,
Atkinson
J
,
Blakely
T
.
(Mis)classification of ethnicity on the New Zealand cancer registry: 1981-2004
.
N Z Med J
.
2009
;
122
(
1294
):
10
22
.

97.

Frost
F
,
Tollestrup
K
,
Ross
A
, et al.
Correctness of racial coding of American-Indians and Alaska natives on the Washington-State death certificate
.
Am J Prev Med
.
1994
;
10
(
5
):
290
294
. https://doi.org/10.1016/S0749-3797(18)30581-6.

98.

Baumeister
L
,
Marchi
K
,
Pearl
M
, et al.
The validity of information on "race" and "Hispanic ethnicity" in California birth certificate data
.
Health Serv Res
.
2000
;
35
(
4
):
869
883
.

99.

Johnson
JC
,
Soliman
AS
,
Tadgerson
D
, et al.
Tribal linkage and race data quality for American Indians in a state cancer registry
.
Am J Prev Med
.
2009
;
36
(
6
):
549
554
. https://doi.org/10.1016/j.amepre.2009.01.035.

100.

Gibberd
AJ
,
Simpson
JM
,
Eades
SJ
.
Use of family relationships improved consistency of identification of Aboriginal people in linked administrative data
.
J Clin Epidemiol
.
2017
;
90
:
144
155
. https://doi.org/10.1016/j.jclinepi.2017.06.021.

101.

Draper
GK
,
Somerford
PJ
,
Pilkington
AS
, et al.
What is the impact of missing indigenous status on mortality estimates? An assessment using record linkage in Western Australia
.
Aust N Z J Public Health
.
2009
;
33
(
4
):
325
331
. https://doi.org/10.1111/j.1753-6405.2009.00403.x.

102.

Chen
W
,
Petitti
DB
,
Enger
S
.
Limitations and potential uses of census-based data on ethnicity in a diverse community
.
Ann Epidemiol
.
2004
;
14
(
5
):
339
345
https://doi.org/10.1016/j.annepidem.2003.07.002.

103.

Kwok
RK
,
Yankaskas
BC
.
The use of Census data for determining race and education as SES indicators: a validation study
.
Ann Epidemiol
.
2001
;
11
(
3
):
171
177
. https://doi.org/10.1016/s1047-2797(00)00205-2.

104.

Andjelkovich
DA
,
Richardson
RB
,
Enterline
PE
, et al.
Assigning race to occupational cohorts using census block statistics
.
Am J Epidemiol
.
1990
;
131
(
5
):
928
934
. https://doi.org/10.1093/oxfordjournals.aje.a115582.

105.

Oyetunji
TA
,
Crompton
JG
,
Ehanire
ID
, et al.
Multiple imputation in trauma disparity research
.
J Surg Res
.
2011
;
165
(
1
):
e37
e41
. https://doi.org/10.1016/j.jss.2010.09.025.

106.

Montealegre
JR
,
Zhou
R
,
Amirian
ES
, et al.
Uncovering nativity disparities in cancer patterns: multiple imputation strategy to handle missing nativity data in the Surveillance, Epidemiology, and End Results data file
.
Cancer
.
2014
;
120
(
8
):
1203
1211
. https://doi.org/10.1002/cncr.28533.

107.

Rezai
MR
,
Maclagan
LC
,
Donovan
LR
, et al.
Classification of Canadian immigrants into visible minority groups using country of birth and mother tongue
.
Open Med
.
2013
;
7
(
4
):
e85
e93
.

108.

Shah
BR
,
Chiu
M
,
Amin
S
, et al.
Surname lists to identify south Asian and Chinese ethnicity from secondary data in Ontario, Canada: a validation study
.
BMC Med Res Methodol
.
2010
;
10
:42 https://doi.org/10.1186/1471-2288-10-42.

109.

Schwartz
KL
,
Kulwicki
A
,
Weiss
LK
, et al.
Cancer among Arab Americans in the metropolitan Detroit area
.
Ethn Dis
.
2004
;
14
(
1
):
141
146
.

110.

Taylor
RJ
,
Morrell
SL
,
Mamoon
HA
, et al.
Cervical cancer screening in a Vietnamese nominal cohort
.
Ethn Health
.
2003
;
8
(
3
):
251
261
. https://doi.org/10.1080/1355785032000136443.

111.

Swallen
KC
,
Glaser
SL
,
Stewart
SL
, et al.
Accuracy of racial classification of Vietnamese patients in a population-based cancer registry
.
Ethn Dis
.
1998
;
8
(
2
):
218
227
.

112.

Rosenwaike
I
.
Surname analysis as a means of estimating minority elderly: an application using Asian surnames
.
Res Aging
.
1994
;
16
(
2
):
212
227
. https://doi.org/10.1177/0164027594162005.

113.

Webster
BM
.
Bibliometric analysis of presence and impact of ethnic minority researchers on science in the UK
.
Res Eval
.
2004
;
13
(
1
):
69
76
. https://doi.org/10.3152/147154404781776545.

114.

Cook
D
,
Hewitt
D
,
Milner
J
.
Uses of the surname in epidemiologic research
.
Am J Epidemiol
.
1972
;
95
(
1
):
38
45
. https://doi.org/10.1093/oxfordjournals.aje.a121367.

115.

Page
WF
,
Mack
TM
,
Kurtzke
JF
, et al. Epidemiology of multiple-sclerosis in US veterans. 6. Population ancestry and surname ethnicity as risk-factors for multiple-sclerosis.
Neuroepidemiology
1995
;
14
(
6
):
286
296
. https://doi.org/10.1159/000109804.

116.

Quan
H
,
Wang
F
,
Schopflocher
D
, et al.
Development and validation of a surname list to define Chinese ethnicity
.
Med Care
.
2006
;
44
(
4
):
328
333
. https://doi.org/10.1097/01.mlr.0000204010.81331.a9.

117.

Singh-Carlson
S
,
Wong
F
,
Oshan
G
, et al.
Name recognition to identify patients of South Asian ethnicity within the cancer registry
.
Asia Pac J Oncol Nurs
.
2016
;
3
(
1
):
86
92
. https://doi.org/10.4103/2347-5625.170224.

118.

Quan
H
,
Ghali
WA
,
Dean
S
, et al.
Validity of using surname to define Chinese ethnicity
.
Can J Public Health
.
2004
;
95
(
4
):
314
. https://doi.org/10.1007/BF03542932.

119.

Sorenson
SB
.
Identifying Hispanics in existing databases - effect of three methods on mortality patterns of Hispanics and non-Hispanic whites
.
Eval Rev
.
1998
;
22
(
4
):
520
534
. https://doi.org/10.1177/0193841X9802200405.

120.

Sweeney
C
,
Edwards
SL
,
Baumgartner
KB
, et al.
Recruiting Hispanic women for a population-based study: validity of surname search and characteristics of nonparticipants
.
Am J Epidemiol
.
2007
;
166
(
10
):
1210
1219
. https://doi.org/10.1093/aje/kwm192.

121.

Rosenwaike
I
,
Hempstead
K
,
Rogers
RG
.
Using surname data in U.S. Puerto Rican mortality analysis
.
Demography
.
1991
;
28
(
1
):
175
180
. https://doi.org/10.2307/2061342.

122.

Morgan
RO
,
Wei
II
,
Virnig
BA
.
Improving identification of Hispanic males in Medicare: use of surname matching
.
Med Care
.
2004
;
42
(
8
):
810
816
. https://doi.org/10.1097/01.mlr.0000132392.49176.5a.

123.

Clarke
LC
,
Rull
RP
,
Ayanian
JZ
, et al.
Validity of race, ethnicity, and national origin in population-based cancer registries and rapid case ascertainment enhanced with a Spanish surname list
.
Med Care
.
2016
;
54
(
1
):
E1
E8
. https://doi.org/10.1097/MLR.0b013e3182a30350.

124.

Winkleby
MA
,
Rockhill
B
.
Comparability of self-reported Hispanic ethnicity and Spanish surname coding
.
Hisp J Behav Sci
.
1992
;
14
(
4
):
487
495
. https://doi.org/10.1177/07399863920144006.

125.

Harris
JA
.
What's in a name? A method for extracting information about ethnicity from names
.
Polit Anal
.
2015
;
23
(
2
):
212
224
. https://doi.org/10.1093/pan/mpu038.

126.

Nasseri
K
,
Mills
PK
,
Allan
M
.
Cancer incidence in the middle eastern population of California, 1988-2004
.
Asian Pac J Cancer Prev
.
2007
;
8
(
3
):
405
411
.

127.

Pinheiro
PS
,
Sherman
R
,
Fleming
LE
, et al.
Validation of ethnicity in cancer data: which Hispanics are we misclassifying?
J Registry Manag
.
2009
;
36
(
2
):
42
46
.

128.

Gonzales
GF
,
Villena
A
,
Ubilluz
M
.
Age at menarche in Peruvian girls at sea level and at high altitude: effect of ethnic background and socioeconomic status
.
Am J Hum Biol
.
1996
;
8
(
4
):
457
463
. https://doi.org/10.1002/(SICI)1520-6300(1996)8:4<457::AID-AJHB5>3.0.CO;2-V.

129.

Dhaliwal
J
,
Tuna
M
,
Shah
BR
, et al.
Incidence of inflammatory bowel disease in South Asian and Chinese people: a population-based cohort study from Ontario, Canada
.
Clin Epidemiol
.
2021
;2021(
13)
:
1109
1118
. https://doi.org/10.2147/CLEP.S336517

130.

Bodewes
AJ
,
Agyemang
C
,
Kunst
AE
.
All-cause mortality among three generations of Moluccans in the Netherlands
.
Eur J Public Health
.
2019
;
29
(
3
):
463
467
. https://doi.org/10.1093/eurpub/cky255.

131.

Sheth
T
,
Nargundkar
M
,
Chagani
K
, et al.
Classifying ethnicity utilizing the Canadian Mortality Data Base
.
Ethn Health
.
1997
;
2
(
4
):
287
295
. https://doi.org/10.1080/13557858.1997.9961837.

132.

Coldman
AJ
,
Braun
T
,
Gallagher
RP
.
The classification of ethnic status using name information
.
J Epidemiol Community Health
.
1988
;
42
(
4
):
390
395
. https://doi.org/10.1136/jech.42.4.390.

133.

Rosales
M
,
Smith
SA
,
Stallones
L
.
Newspaper coverage of injuries affecting the Spanish surname population in two counties in Colorado
.
Psychol Rep
.
2006
;
99
(
2
):
651
658
. https://doi.org/10.2466/pr0.99.2.651-658.

134.

Chan
C
.
The quality of life of women of Chinese origin
.
Health Soc Care Community
.
2000
;
8
(
3
):
212
222
. https://doi.org/10.1046/j.1365-2524.2000.00243.x.

135.

Chaudhry
S
,
Fink
A
,
Gelberg
L
, et al.
Utilization of Papanicolaou smears by South Asian women living in the United States
.
J Gen Intern Med
.
2003
;
18
(
5
):
377
384
. https://doi.org/10.1046/j.1525-1497.2003.20427.x.

136.

Hage
BH-H
,
Oliver
RG
,
Powles
JW
, et al.
Telephone directory listings of presumptive Chinese surnames: an appropriate sampling frame for a dispersed population with characteristic surnames
.
Epidemiology
.
1990
;
1
(
5
):
405
407
.

137.

Gialamas
A
,
Pilkington
R
,
Berry
J
, et al.
Identification of aboriginal children using linked administrative data: consequences for measuring inequalities
.
J Paediatr Child Health
.
2016
;
52
(
5
):
534
540
. https://doi.org/10.1111/jpc.13132.

138.

Fremont
A
,
Weissman
JS
,
Hoch
E
, et al.
When race/ethnicity data are lacking: using advanced indirect estimation methods to measure disparities
.
Rand Health Q
.
2016
;
6
(
1
):
16
.

139.

Fiscella
K
,
Fremont
AM
.
Use of geocoding and surname analysis to estimate race and ethnicity
.
Health Serv Res
.
2006
;
41
(
4 Pt 1
):
1482
1500
. https://doi.org/10.1111/j.1475-6773.2006.00551.x.

140.

Liebler
CA
,
Porter
SR
,
Fernandez
LE
, et al.
America's churning races: race and ethnicity response changes between census 2000 and the 2010 census
.
Demography
.
2017
;
54
(
1
):
259
284
. https://doi.org/10.1007/s13524-016-0544-0.

141.

Brown A.

The changing categories the U.S. census has used to measure race
.”
Washington, DC
:
Pew Research Center
February 25, 2020
.
Available from
: https://www.pewresearch.org/fact-tank/2020/02/25/the-changing-categories-the-u-s-has-used-to-measure-race/.
Accessed December 15, 2021
.

142.

Zuberi
T
.
Thicker Than Blood: How Racial Statistics Lie
.
Minneapolis, MN
:
University of Minnesota Press
;
2003
.

143.

Yellow Horse
AJ
,
Patterson
SE
.
Greater inclusion of Asian Americans in aging research on family caregiving for better understanding of racial health inequities
.
Gerontologist
.
2021
;
62
(
5
):
704
710
. https://doi.org/10.1093/geront/gnab156.

144.

Adkins-Jackson
PB
,
Chantarat
T
,
Bailey
ZD
, et al.
Measuring structural racism: a guide for epidemiologists and other health researchers
.
Am J Epidemiol
.
2021
;
191
(
4
):
539
547
. https://doi.org/10.1093/aje/kwab239.

145.

Feldman
V
,
Zhang
C
.
What neural networks memorize and why: discovering the long tail via influence estimation
.
Adv Neural Inf Process Syst
.
2020
;
33
:
2881
2891
https://doi.org/10.48550/arXiv.2008.03703.

146.

Rinberg
R
,
Cummings
R
,
Bellovin
S
.
How do machine learning models memorize training data
.
Presented at
.
New York, NY
:
Columbia University Data Science Day
;
April 6, 2022
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data