Abstract

Both UNESCO and OECD have recognized the public policy benefit of publicizing information on linguistic diversity on the Internet. However, the published methodologies for estimating “linguistic diversity” or “Internet statistics (by language)” do so with different interpretations of these key terms. This article creates a new taxonomy, defining and contrasting user activity, user profile, web presence, and diversity index to distinguish among the various indicators used to estimate language usage on the Internet. This taxonomy facilitates comparisons of the available methodologies, whose limitations are then critiqued. It also helps to resolve the apparent paradox as to whether the use of English on the Internet has declined rapidly or has remained fairly stable. The study concludes that the best estimates of web presence can be achieved by direct measurement: randomly addressing and analyzing a representative sample of all public websites. However, this approach will only suffice if the language detection software used is progressively extended to recognize all the world’s written languages.

Introduction

In recent years, both the Organization for Economic Co-operation and Development (OECD) and the United Nations Educational, Scientific, and Cultural Organization (UNESCO) have published data on the relative use of the world’s major languages on the Internet. Initially, the OECD, in its biennial Communications Outlook, only published data on the online use of languages on websites devoted to e-commerce (OECD, 2003, 2005), but in 2006 it compared “references to secure servers by language” with “Internet users by language” (OECD, 2006, p. 22). However, these estimates measure quite different indicators of language use, the first being of web presence and the second of potential Internet usage. Neither estimate measures actual Internet activity by users in different languages.

UNESCO, at its 2005 World Summit for the Information Society in Tunis, committed itself “to promote the inclusion of all peoples in the Information Society through the development and use of local and/or indigenous languages in ICTs” (UNESCO, 2005, n.p.). As an input document for the Tunis Summit, UNESCO sponsored the production of a report entitled Measuring Linguistic Diversity on the Internet, comprising articles by several researchers (Paolillo, Pimienta, & Prado, 2005), which added a major quantum to the published research on this topic.

In January 2006, UNESCO’s Communications and Information Sector’s In Focus web column asked: “Is it reasonable to define and direct linguistic policies in digital space without having sufficient, accurate, and precise indicators on the situation of languages and their progress?” (UNESCO, 2006, n.p.). The column deplored the dominance of marketing companies, which often use concealed methodologies, in providing figures for the mainstream media on the relative use of languages on the Internet:

Disorder and confusion regarding the state of languages on the Internet has been the result, which can lead to disinformation. … It is urgent that the academic world regains its role in this area along with public institutions, both national and international. (UNESCO Portal, 2006)

The aim of the present article is to provide conceptual clarity in distinguishing among the different methodologies available to estimate the relative usage of different human languages on the Internet. A taxonomy is proposed that enables much “disorder and confusion”—the paradox of sometimes wildly conflicting estimates—to be resolved, and some pitfalls in interpretation to be avoided. As such, it has potential benefits for academics, public institutions, and marketing companies. The article then applies the taxonomy as a framework by which to compare and critique the major methodologies, drawing attention to their limitations in measurement techniques and errors in source data.

A Simple Taxonomy of Methodologies

Ideally, Internet usage would be measured by user activity on the Internet, which potentially includes not just visiting web pages but using the full range of Internet services: email, Voice over IP, downloading software tools, playing multi-party computer games, downloading audio and video streams, etc. However, it is simply not technically feasible to measure the aggregate Internet activity in any given language directly, other than by tapping and analyzing the data streams on some major Internet backbone traffic routes. The only organizations known to have the resources to do this are the national security agencies, and they have yet to publish their results.

In practice, two distinct, broad approaches have been applied most often in estimating the use of different human languages on the Internet. These can be categorized as user profile(the number or proportion of active Internet users in each language group) and web presence(the number or proportion of web pages written in each language group). Despite the fact that these two approaches to estimating Internet linguistic diversity are quite distinct, both are often published as “Internet Statistics by Language” (see, e.g., Global Reach, 2006). This can lead to major discrepancies, as noted by Pimienta (2005):

Disorder and confusion regarding the state of languages on the Internet has been the result, which can lead to disinformation. Therefore, when the proportion of English-language speakers who use the Internet has gone from more than 80% in the year the Web was born to 35% today, the figures circulating in the media, against all evidence, are reported as stable between 70% and 80%! (pp. 27–28)

A likely source of the view that English-language use of the Internet has been “stable between 70% and 80%” is an OCLC study (O’Neill, Lavoie, & Bennett, 2003), which shows the same figure of 72% for the use of English on public Internet websites in 1999 and 2002. This study was based on measuring the web presence of different languages using a statistical sample of public websites, whereas the Global Reach estimates of user profile show a 35% figure for English in September 2004, down from an estimate of 80% in 1996 (Global Reach, 2006). It is quite possible for both estimates to be correct without being inconsistent, because they estimate different aspects of language usage. However, there are good reasons for treating both sets of estimates with caution, as discussed further below.

The three indicators of linguistic diversity discussed so far are compared schematically in Figure 1, together with a fourth, more abstract indicator to be discussed below, the diversity index.

A taxonomy of different methodologies used to estimate language diversity on the Internet
Figure 1

A taxonomy of different methodologies used to estimate language diversity on the Internet

The left hand column of Figure 1 contrasts four distinct types of indicators that have been used to estimate linguistic diversity, actual or potential, on the Internet.

User profile(the number or proportion of active Internet users in each language group) is an indicator of the potential use of a language on the Internet by speakers (or writers) of a particular language. Of course, the fact that a sizable group of people speak a given language as their native tongue does not mean that they will necessarily write in that language. It is perfectly legitimate to use this indicator to reflect a policy target, such as everyone being facilitated to use their first language on the Internet. However, user profile should not be confused with user activity or web presence.

National censuses sometimes capture crude data on literacy, although not necessarily literacy in a specific language. In contrast, regional or national governments that have a policy interest in supporting minority languages often conduct censuses on their citizens’ abilities to speak, write, read, and understand a spoken language over the range of languages of interest (e.g., Generalitat de Catalunya, 2003). Organizations such as SIL International, publishers of Ethnologue(e.g., 2000, 2005), and Global Reach (2006) publish aggregate data on user profiles across all nations, as is discussed in some detail below.

User activity is an absolute or relative measure of the actual use of a language on the Internet. This indicator is most easily measured on a specific communication technology such as email or postings on a discussion website in order to research the linguistic behavior of an identified community. Examples of such studies are Climent et al. (2003), Durham (2003), and Wodak and Wright (2006). It is also possible, in principle, to monitor the use of key words or phrases, either in digitalized speech or digitalized text, occurring in aggregated streams of Internet traffic, and from these data to deduce crudely the relative frequency of different languages used.

Web presence(the number or proportion of web pages written in each language group) is a measure of the availability of text written in a given language that can be accessed on the public World Wide Web. Of course, a given web page might never be read by anyone other than its author. Thus, web presence is also a measure of potential user activity in a given language when reading and acting on offered choices (e.g., downloading, or hyperlinking away) from a given web page. Web presence was measured in the period 1997–2002 by direct language analysis of randomly addressed web pages, but in the more recent period 2001–2006, it was measured more commonly using search engines.

A diversity index is a statistical measure of the diversity of a set of language data; it can be applied to data concerning user profile, user activity, or web presence. Paolillo has proposed two diversity indexes for different applications (Paolillo, 2005; Paolillo & Das, 2006); the first of these is critiqued below.

The second column in Figure 1 differentiates the language mode used (spoken, written, combined, etc.) in the corpus that is being measured or analyzed. Some published results fail to note that the speakers of a language might not use the language when writing or reading on the Internet; indeed many languages are rarely or never written by their native speakers.

The third and fourth columns of Figure 1 distinguish between studies of local populations, e.g., microstudies or national censuses, and truly global studies of worldwide language usage. The former are sometimes aggregated to produce the latter, e.g., in the Ethnologue(2000, 2005).

The final, right hand column provides examples of published work that has used methodologies falling into the indicated classification. In the case of web presence, a further distinction can be made between the methodologies of using random sampling versus search engines.

Table 1 extracts several of these examples and illustrates how they are differentiated by the taxonomy.

Table 1

Examples of studies and results classified according to the taxonomy

Global Reach (2006)User profile/spoken/global
Mikami et al. (2005)User profile/written/global
Climent et al. (2003)User activity/written/local
Babel Project (1997)Web presence/written/global/random sampling
OECD (2003)Web presence/written/global/search engine
Global Reach (2006)User profile/spoken/global
Mikami et al. (2005)User profile/written/global
Climent et al. (2003)User activity/written/local
Babel Project (1997)Web presence/written/global/random sampling
OECD (2003)Web presence/written/global/search engine
Table 1

Examples of studies and results classified according to the taxonomy

Global Reach (2006)User profile/spoken/global
Mikami et al. (2005)User profile/written/global
Climent et al. (2003)User activity/written/local
Babel Project (1997)Web presence/written/global/random sampling
OECD (2003)Web presence/written/global/search engine
Global Reach (2006)User profile/spoken/global
Mikami et al. (2005)User profile/written/global
Climent et al. (2003)User activity/written/local
Babel Project (1997)Web presence/written/global/random sampling
OECD (2003)Web presence/written/global/search engine

Paolillo (2005) applies a new diversity index to compare results based on //User profile/spoken/local and global// from the Ethnologue(2000) with //Web presence/written/global/random sampling// from O’Neill et al. (2003).

In the following sections, the strengths and weaknesses of each of the above methodologies as they have been applied in the published literature are appraised.

Critique of Existing Methodologies

Estimating Linguistic Diversity by User Profiles

The dominant practitioner in this area has been the marketing organization Global Reach, whose website has been offering “Global Internet Statistics (by Language)” updated bi-annually since 1996 (Global Reach, 2006). Global Reach’s results have been quoted by widely published authors such as Crystal (2006), as well as by the OECD (2005) and UNESCO (2005).

There are good reasons for doubting the accuracy of the Global Reach estimates, which appear to be calculated using a simple linear formula of the following kind, deduced from the tabulated information supplied on the Global Reach website (http://www.global-reach.biz/globstats/index.php3):
(equation 1)

where ixy = number of Internet users using language x in year y, sxyz = number of first-language speakers of language x in country z in year y, ayz = number of Internet individual access services (dial-up, cable, DSL, WiMax, etc.) in country z in year y, pyz = population of country z in year y, and ∑z is the summation operator that sums the terms to its right, for given fixed values of language x and year y, over all values of z, the country identifier.

The data points sxyz for language speakers in different countries are obtained by Global Reach from the latest available edition of the Ethnologue(e.g., 2000 or 2005), a collection of data on the world’s languages. The data points ayz for Internet access services are obtained from miscellaneous marketing organizations and the International Telecommunication Union (ITU). The sources of the national population figures pyz are not given, but can be calculated by summing the individual estimates sxyz for all first languages spoken in each country.

Several problems with the reliability of the Global Reach calculations can be identified. The first is the asynchronicity of the data, first between the relatively up to-date ayz Internet access data based on marketing sources, the much less frequent population data pyz, and the least frequent language census data sxyz, which in many cases are based on national census data carried out (at best) at five-yearly intervals.

There is also major asynchronicity within any given edition of the Ethnologue between the years in which language-speaker data points sxyz were obtained, not only between different countries but also within some individual countries (where the source is census data obtained by different regional governments). The editors of the Ethnologue have carried out the heroic task since 1951 of assembling and analyzing data on the identities and locations of the world’s languages and dialects, as well as the numbers of speakers (generally first-language speakers) of these languages and dialects. However, they are dependent on the results of censuses, which may be carried out (by governments, in the main) infrequently and which are not synchronized across governments or regions.

The Ethnologue appears to lack the resources to obtain the latest census results for many national languages. For example, both the 2000 and 2005 editions show 28.2 million first-language Spanish speakers in Spain, based on the 1986 census, and 86.2 million Spanish speakers in Mexico (the largest cohort of Spanish speakers in any one country), based on 1995 data (Ethnologue, 2000, 2005). These figures have evidently been used by Global Reach as data for each of the years 1996 to the present and will continue to be used until the next edition of the Ethnologue appears, perhaps in 2010. (The most recent Spanish census before Ethnologue2005 was conducted in 2001.)

For Spain’s major regional languages, the entries in both the 2000 and 2005 editions of the Ethnologue on Catalan-Valencian-Balear (i.e., Catalan) use census data on language speakers from 1996; the data on Basque are from 1991, and the data on Galician and Spanish (Castilian) are from 1986.

In the case of native English speakers, Ethnologue2005 still depends on 1984 census data for the U.S. and the U.K., 1987 data for Australia and New Zealand, 1996 data for South Africa, and 1998 data for Canada. The Ethnologue’s estimate of 11 million second-language English speakers in India is based on India’s 1961 census!

In an unpublished report, Paolillo and Das (2006) provide a detailed analysis of the accuracy of the Ethnologue’s language statistics in its year 2000 and 2005 editions. While praising the merits of the Ethnologue—its comprehensiveness, its organization and display of information, and the documentation of many of its sources—the report concludes that “many aspects of the Ethnologue are slow to see updates” (p. 48). Paolillo and Das reveal that in a random sample of 2,001 entries for population data from the 15th edition (Ethnologue, 2005), 2.1% of the entries were from source years 1920 to 1975, and 52.4%—over half of the entries—had sources before 1996.

Thus, much of the Ethnologue’s base data on spoken languages were obtained 10 to 50 years before the beginning of the World Wide Web, yet they have been used by Global Reach to produce estimates of different language users online (ixy) for each successive year since 1996. Global Reach (2006) indicates that it also uses more up-to-date data on web usage from marketing organizations such as Nielsen Ratings, but it is far from transparent what methodologies have been used by those marketing organizations or how reliable their data are.

The second problem lies with Global Reach’s assumptions concerning bilingualism and multilingualism. In general, Global Reach assumes that Internet users will use the Internet in their first language. This assumption tends to skew the estimates towards overestimating the use of minority languages, especially those languages requiring scripts that are not commonly available on personal computers. Global Reach indicates that it has compensated for the use of English as a second language on the Internet, albeit apparently only in the United States and Canada. Global Reach’s methodology also ignores the increasingly prevalent phenomenon of speakers who are fluently bilingual in non-English languages (e.g., Catalan/Spanish in Barcelona, German/French in Zurich), who also regularly access English language websites on the Internet for business or personal reasons. These simplifications effectively underestimate the use of English compared to other languages on the web.

The third problem is that Global Reach’s methodology assumes that the population of Internet users in any country has the same spread of language usage on the Internet as the distribution of “first language” speakers in the country as a whole: For example, if 12% of the U.S. population is Hispanic, then it assumes 12% of U.S. Internet users will use Spanish on the Internet. (In equation (1) above, Global Reach assumes that for any given year y, the Internet penetration factor for each separate first-language group x in each country z is the same as the national average, [ayz/ pyz], rather than varying with the language variable x.)

This crude assumption will overestimate the use of minority languages on the Internet for any language community that is less able to afford home Internet access. Conversely, it will underestimate the use of economically and culturally dominant languages such as English. In many parts of the world, English is a person’s third or fourth language, but would be used on the Internet. Hence Global Reach’s (2006) estimate that by 2005 English was used only by 27% of Internet users is likely to be a major underestimate of actual Internet activity in terms of email and particularly web browsing in English.

Another major weakness, already identified above, is the validity of the often outdated Ethnologue data that are used to provide estimates of current first- or second-language speakers. As a result, the yearly changes in the Global Reach graphs and tables of world online language use depend almost entirely on changes in the estimates of Internet penetration in the countries where these languages are most spoken. But these latter estimates are in part taken from private marketing organizations whose own methodologies are opaque. Despite these limitations, for lack of more authoritative sources, Global Reach’s estimates of the relative usage of different languages on the Internet have been widely quoted, as noted by Paolillo (2005).

The Ethnologue’s data on language populations has also been used in the UNESCO-funded Language Observatory Project to estimate the global “distribution of user population by major script categories” (Mikami et al., 2005, Table 1[link]). As there is an implicit assumption that all the native speakers of each language will potentially use the traditional script of that language, this methodology clearly also belongs to the user profile approach.

Estimation of Linguistic Diversity by Web Presence

Automated Language Testing of Randomly Sampled Websites

The first reported study of linguistic diversity according to web presence appears to be the 1997 study carried out by the Canadian company Alis Technologies in collaboration with the Internet Society as part of their Babel Project “to internationalize the Internet” (Babel, 1997).

Using the ICMP (ping) protocol, the Babel team sent more than 30 million messages to randomly generated IP addresses to identify if computers existed at those addresses, and identified close to 60,000 IP-addressable machines. Of those, some 8,000 responded as being http (i.e., web) servers. From this sample, the Babel team calculated that the addressable, public World Wide Web consisted of some 1,007,000 servers as of June 1997 (Babel, 1997).

The SILC language detection software they used only allowed them to detect up to 17 different languages, as shown in Table 2. Note that this list includes only three languages that make use of non-Roman scripts (Chinese, Japanese, and Russian) and excludes such major languages as Arabic, Greek, Hebrew, and Korean.

Table 2

Languages handled by the SILC detection software

1. German7. French13. Portuguese
2. English8. Italian14. Russian
3. Chinese9. Japanese15. Serbo-Croatian
4. Danish10. Malay16. Swedish
5. Spanish11. Dutch17. Czech
6. Finnish12. Norwegian
1. German7. French13. Portuguese
2. English8. Italian14. Russian
3. Chinese9. Japanese15. Serbo-Croatian
4. Danish10. Malay16. Swedish
5. Spanish11. Dutch17. Czech
6. Finnish12. Norwegian

Source: Babel (1997)

Table 2

Languages handled by the SILC detection software

1. German7. French13. Portuguese
2. English8. Italian14. Russian
3. Chinese9. Japanese15. Serbo-Croatian
4. Danish10. Malay16. Swedish
5. Spanish11. Dutch17. Czech
6. Finnish12. Norwegian
1. German7. French13. Portuguese
2. English8. Italian14. Russian
3. Chinese9. Japanese15. Serbo-Croatian
4. Danish10. Malay16. Swedish
5. Spanish11. Dutch17. Czech
6. Finnish12. Norwegian

Source: Babel (1997)

Restricting their analysis to the 3,239 homepages found to contain at least 500 characters each (presumably to ensure accuracy in automated language detection), the Babel team came up with the results shown in Table 3.

Table 3

The Babel Project’s ‘Preliminary Hit Parade’ (June 1997)

RankingLanguageNumber of pagesPercentageCorrected percentageEstimated number of servers
1English2,72284.0 %82.3 %332,778
2German1474.5 %4.0 %17,971
3Japanese1013.1 %1.6 %12,348
4French591.8 %1.5 %7,213
5Spanish381.2 %1.1 %4,646
6Swedish351.1 %0.6 %4,279
7Italian311.0 %0.8 %3,790
8Portuguese210.7 %0.7 %2,567
9Dutch200.6 %0.4 %2,445
10Norwegian190.6 %0.3 %2,323
11Finnish140.4 %0.3 %1,712
12Czech110.3 %0.3 %1,345
13Danish90.3 %0.3 %1,100
14Russian80.3 %0.1 %978
15Malay40.1 %0.1 %489
None or unknown (correction)5.6 %
Total3,239100 %100 %395,984
RankingLanguageNumber of pagesPercentageCorrected percentageEstimated number of servers
1English2,72284.0 %82.3 %332,778
2German1474.5 %4.0 %17,971
3Japanese1013.1 %1.6 %12,348
4French591.8 %1.5 %7,213
5Spanish381.2 %1.1 %4,646
6Swedish351.1 %0.6 %4,279
7Italian311.0 %0.8 %3,790
8Portuguese210.7 %0.7 %2,567
9Dutch200.6 %0.4 %2,445
10Norwegian190.6 %0.3 %2,323
11Finnish140.4 %0.3 %1,712
12Czech110.3 %0.3 %1,345
13Danish90.3 %0.3 %1,100
14Russian80.3 %0.1 %978
15Malay40.1 %0.1 %489
None or unknown (correction)5.6 %
Total3,239100 %100 %395,984

Source: Babel (1997)

Table 3

The Babel Project’s ‘Preliminary Hit Parade’ (June 1997)

RankingLanguageNumber of pagesPercentageCorrected percentageEstimated number of servers
1English2,72284.0 %82.3 %332,778
2German1474.5 %4.0 %17,971
3Japanese1013.1 %1.6 %12,348
4French591.8 %1.5 %7,213
5Spanish381.2 %1.1 %4,646
6Swedish351.1 %0.6 %4,279
7Italian311.0 %0.8 %3,790
8Portuguese210.7 %0.7 %2,567
9Dutch200.6 %0.4 %2,445
10Norwegian190.6 %0.3 %2,323
11Finnish140.4 %0.3 %1,712
12Czech110.3 %0.3 %1,345
13Danish90.3 %0.3 %1,100
14Russian80.3 %0.1 %978
15Malay40.1 %0.1 %489
None or unknown (correction)5.6 %
Total3,239100 %100 %395,984
RankingLanguageNumber of pagesPercentageCorrected percentageEstimated number of servers
1English2,72284.0 %82.3 %332,778
2German1474.5 %4.0 %17,971
3Japanese1013.1 %1.6 %12,348
4French591.8 %1.5 %7,213
5Spanish381.2 %1.1 %4,646
6Swedish351.1 %0.6 %4,279
7Italian311.0 %0.8 %3,790
8Portuguese210.7 %0.7 %2,567
9Dutch200.6 %0.4 %2,445
10Norwegian190.6 %0.3 %2,323
11Finnish140.4 %0.3 %1,712
12Czech110.3 %0.3 %1,345
13Danish90.3 %0.3 %1,100
14Russian80.3 %0.1 %978
15Malay40.1 %0.1 %489
None or unknown (correction)5.6 %
Total3,239100 %100 %395,984

Source: Babel (1997)

While limited in the number of languages its search software can identify, the Babel methodology has the merit of directly measuring websites from a randomly generated sample, which can then be generalized. It found that in June 1997, English accounted for 82.3% of all websites, with a large gap to the next most common language, German, at 4.5%, and Spanish, the fifth most popular, at only 1.1%. Minority languages as well as many major languages are presumably included in the residual 5.6%.

From 1998 to 2002, Lavoie and his colleagues at OCLC performed sampling studies (Lavoie & O’Neill, 1999; O’Neill et al., 2003) using a methodology similar to the Babel Project but with detection software capable of distinguishing 29 languages (albeit no minority languages). Their results, shown in Figures 2 and 3 below, show English maintaining a level of 72% of all single-language websites, German following at 7%, and Spanish creeping up from fifth to fourth position (slightly above French), while holding steady at about 3%.

Relative frequency of languages, 1999 (percent of public sites)
Figure 2

Relative frequency of languages, 1999 (percent of public sites)

Source: O’Neill et al. (2003)

Relative frequency of languages, 2002 (percent of public sites)
Figure 3

Relative frequency of languages, 2002 (percent of public sites)

Source: O’Neill et al. (2003)

O’Neill et al. comment that 7% of all sampled websites in 1999, declining to 5% in 2002, were found to be multilingual, and all these cases included English. These sites are not included in the scores for English shown in Figures 2 and 3 above. In fact, one can deduce from the above charts, given that the y-axis scores for the 12 languages displayed add up to 100% (subject to very small round-up error), that any websites not accepted by the detection software as being predominantly one of the detectable 29 languages, whether multilingual or not, have been excluded from the final score. Moreover, the scores for all but the first 12 out of the 29 detectable languages must have been negligible, as they collectively sum to less than 0.5%.

The Babel and OCLC reports are the only published studies in which direct analysis of randomly addressed websites (using language detection software) has been employed to produce estimates of web presence. All other published studies of web presence have used search engines to produce estimates of the numbers of web pages in different languages.

Limitations of Search Engines

A search engine is a program designed to help find information stored on a computer system such as a personal computer or the World Wide Web; it allows one to ask for content meeting specific criteria (typically containing a given word or phrase) and retrieves a list of references that match those criteria (Wikipedia, 2006). Search engines work in two stages: Automated software (variously called a web crawler, spider, or bot) works continuously in the background to assemble a database of ranked, indexed links to key words and phrases found across the target computer system; and a browser-interfacing search facility interrogates that database for occurrences of nominated words or phrases (with other nominated or default search criteria).

Limited sampling of the web The current mainstream search engines store huge databases—Yahoo! claims to hold over 20 billion “web objects”—yet “overlap studies show that about half of the pages in any search engine database exist only in that database” (UC Berkeley, 2006, n.p.). Why are their web crawls so different? Paolillo (2005) points out that web spiders produce what is known as a “snowball sample:” The spiders discover new web pages by following all the links found in a given set of web pages. The starting point of each “snowball” is therefore crucial.

Readers can prove to themselves that each of the major search engines searches a widely different set of indexed pages by carrying out the following test on two or more search engines. If one sets the “language preference” to the same language choice in each search engine, then searches for the same word in each, major inconsistencies can be observed in the total number of web pages reported to contain that word. The following results were obtained on May 4, 2006, comparing Google, Yahoo!, and Teoma (since renamed Ask™). Table 4 presents the results for searching separately for “man,” “woman” (English only) and “hombre,” “mujer” (Spanish only):

Table 4

Comparing search engine results for the same language settings

Language preference setTarget wordWeb pages reported by GoogleWeb pages reported by Yahoo!Ratio Google/ Yahoo!Web pages reported by TeomaRatio Google/ Teoma
Englishman1,510 M598 M2.52192 M7.9
Englishwoman592 M260 M2.2879.4 M7.5
Spanishhombre46.3 M16.8 M2.760.673 M69
Spanishmujer58.5 M18.7 M3.130.602 M97
Language preference setTarget wordWeb pages reported by GoogleWeb pages reported by Yahoo!Ratio Google/ Yahoo!Web pages reported by TeomaRatio Google/ Teoma
Englishman1,510 M598 M2.52192 M7.9
Englishwoman592 M260 M2.2879.4 M7.5
Spanishhombre46.3 M16.8 M2.760.673 M69
Spanishmujer58.5 M18.7 M3.130.602 M97
Table 4

Comparing search engine results for the same language settings

Language preference setTarget wordWeb pages reported by GoogleWeb pages reported by Yahoo!Ratio Google/ Yahoo!Web pages reported by TeomaRatio Google/ Teoma
Englishman1,510 M598 M2.52192 M7.9
Englishwoman592 M260 M2.2879.4 M7.5
Spanishhombre46.3 M16.8 M2.760.673 M69
Spanishmujer58.5 M18.7 M3.130.602 M97
Language preference setTarget wordWeb pages reported by GoogleWeb pages reported by Yahoo!Ratio Google/ Yahoo!Web pages reported by TeomaRatio Google/ Teoma
Englishman1,510 M598 M2.52192 M7.9
Englishwoman592 M260 M2.2879.4 M7.5
Spanishhombre46.3 M16.8 M2.760.673 M69
Spanishmujer58.5 M18.7 M3.130.602 M97

The variation is so great across page counts for the same word and same language choice that one can have no confidence that these three search engines have crawled the same parts of the web. Nor can one have confidence that any of them have crawled over more than a fraction of the public web, nor that their webcrawling, indexing, and search methodologies are consistent, given that the search engines do not publish their algorithms, presumably for reasons of commercial confidence (Lawrence & Giles, 1999; Van Couvering, 2007).

Undisclosed indexing algorithms In his UNESCO report on linguistic diversity on the Internet, Paolillo (2005) observes that:

Search engines typically employ a variety of proprietary indexing methods that are not open to inspection, and these may bias the page counts returned in ways that cannot be corrected or even reckoned. A word need not appear in the page at all for it to be included in the count, and pages containing the word might also be dropped from the count. (p. 59)

Nevertheless, a number of institutions have used search engine results to estimate the usage of different human languages on the Internet. These include FUNREDES, an NGO promoting the use of ICTs in Latin America (Pimienta, 2005; Pimienta & Lamey, 2001; Pimienta, Lamey, Prado, & Sztrum, 2001), and the OECD (2003, p. 130; 2006, p. 22). FUNREDES’s methodology is vulnerable to error, as it depends on choosing “a sample of keywords in each of the languages under study … with attention paid to providing the best semantic and syntactic equivalence among these” (Pimienta, 2005, p. 31). The problem is that in order to produce good estimates of the total number of web pages in several languages, one needs to choose search words whose average frequency among all web pages in their respective languages is the same, not search words that are necessarily semantically equivalent. One also needs to avoid words that belong to more than one language, since the “language preference” settings are far from perfect as constraints on searches.

In contrast to FUNREDES, the OECD employs a culturally-neutral technique through searching on the machine protocol code “https://” in order to estimate the numbers of secure socket layer (SSL) servers associated with web pages in each target language (OECD, 2003, 2005). This provides a rough estimate of the relative use of different languages in e-commerce, which is the focus of OECD, since SSL servers are normally only used for e-commerce purposes.

Language search restrictions Unfortunately for the estimation of web presence of minority languages, the sale of the AllTheWeb search engine to Overture in 2003 (and then to Yahoo!) has reduced its language indexing repertoire from the original 46 (including six European minority languages: Catalan, Basque, Galician, Welsh, Faroese, and Friesian) to 36, of which the only minority language remaining is Catalan (www.alltheweb.com, 2006). This makes it impossible to repeat the experiments of Guinovart (2003) or Mas (2003) in estimating the web presence of the other European minority languages.

The current mainstream search engines Google, Yahoo!, and Ask (previously Teoma) offer their general users only 41, 37, and six language preferences, respectively. Google and Yahoo! include Catalan but no other minority languages among their search preferences; Ask/Teoma does not include languages with non-Roman scripts.

Estimation of Linguistic Diversity by Measurement of User Activity

Two of the few published studies comparing actual user activity in different languages on the Internet appeared in the November 2003 issue of the Journal of Computer-Mediated Communication. By coincidence, both examined the emails used by university students on closed email lists in research into modern diglossia, as defined by Fishman (1967). Durham (2003) compared the relative use of English, German, and French in emails by Swiss medical students, and Climent et al. (2003) studied the relative use of Catalan and Spanish by students posting to a bulletin board system of the Open University of Catalonia. Another study of user activity is by Wodak and Wright (2006), who analyzed discourse patterns in multilingual postings on the European Union’s threaded discussion forum Futurum in 2003, and who report some quantitative results on the different languages used in the forum.

These three examples are all intrinsically micro studies, and while it would be possible to apply their methodologies to a wide range of online discussion groups, it would be difficult to generalize their context-dependent results to much larger, more socially diverse populations of users, such as all users of the Internet in a given period.

Paolillo’s Linguistic Diversity Index

A distinct and more abstract methodology is that developed by Paolillo (2005) for measuring linguistic diversity (whether of a region, country, the world, or the Internet) as a single index. He uses the thermodynamic/information theory construct of entropy, a statistical measure of variance, to construct his index, in order to provide the following properties:

Firstly, it must address some unit of analysis, such as a country, a continent, or the Internet. Secondly, linguistic diversity should take into account the probabilities of finding speakers of any particular language. It should have a natural minimum of zero, for a completely homogeneous population, and no fixed maximum value. A greater variety of languages should increase the value of the index, but as the proportion of a language group decreases, its contribution to diversity should also decrease. This way, countries with many language groups of roughly equal size (e.g., Tanzania; Mafu, 2004) will show relatively high linguistic diversity, whereas countries with comparable numbers of languages, but with one or two dominant languages (e.g., the U.S.) will show relatively lower linguistic diversity. (Paolillo, 2005, p. 31)

Paolillo applies his linguistic diversity index to the nine major regions of the world, plus the U.S. by itself for comparison, based on Ethnologue(2005) data on the numbers of first-language speakers in each country. His results are shown in Table 5:

Table 5

Linguistic diversity index scores by region

RegionLanguagesDiversity indexProportion of world total
USA1700.78090.0020
N America (incl. USA)2483.38430.0086
E Asia2004.45140.0112
W Asia15926.15390.0659
SC Asia66129.80930.0752
S America93030.50070.0769
Europe36432.43690.08 18
SE Asia131737.66150.0949
Oceania132246.56530.1174
Africa2390185.68360.4681
RegionLanguagesDiversity indexProportion of world total
USA1700.78090.0020
N America (incl. USA)2483.38430.0086
E Asia2004.45140.0112
W Asia15926.15390.0659
SC Asia66129.80930.0752
S America93030.50070.0769
Europe36432.43690.08 18
SE Asia131737.66150.0949
Oceania132246.56530.1174
Africa2390185.68360.4681

Source: Paolillo (2005), using base data from Ethnologue(2000)

Table 5

Linguistic diversity index scores by region

RegionLanguagesDiversity indexProportion of world total
USA1700.78090.0020
N America (incl. USA)2483.38430.0086
E Asia2004.45140.0112
W Asia15926.15390.0659
SC Asia66129.80930.0752
S America93030.50070.0769
Europe36432.43690.08 18
SE Asia131737.66150.0949
Oceania132246.56530.1174
Africa2390185.68360.4681
RegionLanguagesDiversity indexProportion of world total
USA1700.78090.0020
N America (incl. USA)2483.38430.0086
E Asia2004.45140.0112
W Asia15926.15390.0659
SC Asia66129.80930.0752
S America93030.50070.0769
Europe36432.43690.08 18
SE Asia131737.66150.0949
Oceania132246.56530.1174
Africa2390185.68360.4681

Source: Paolillo (2005), using base data from Ethnologue(2000)

Paolillo then applies his metric to the data obtained from the OCLC studies of web presence (Lavoie & O’Neill, 1999; O’Neill et al., 2003) to demonstrate that their Year 1999 sample of 2,229 random websites has a diversity index of 2.47—larger than that of the U.S., but much smaller than the diversity index of any of the eight world regions shown in Table 5 above. Paolillo concludes: “Hence, linguistic diversity of the worldwide Web, while it approaches that of many multilingual countries, is a poor representation of linguistic diversity worldwide” (2005, p. 57).

Paolillo’s conclusion is clearly correct, for a more fundamental reason. His data source, Ethnologue2000, recorded the existence of more than 6,809 spoken languages and dialects (Ethnologue2005 increased this number to 6,912). Of these, fewer than 1,000—at most about one in seven—are written by their native speakers (SIL, 2000). As only 2,229 websites were sampled, it is not very likely that all written languages—say 1,000—were actually present on the sampled websites, even if all could be separately identified. Paolillo’s diversity index thus must register a much lower value for their Internet web presence than for the Ethnologue user profile distribution of close to seven times as many spoken languages.

As noted by Paolillo (2005), the OCLC studies used software capable of distinguishing no more than 29 languages (O’Neill et al., 2003), even if it encountered 200. Because the OCLC language recognition software filters the potential 1,000 written languages down to fewer than 30, Paolillo’s index, when applied to these data, can only estimate the variance of about 3% of all human languages written somewhere on the web. Thus this index naturally computes as being much less than the index of the entire Ethnologue database of close to 7,000 spoken languages.

Paolillo’s diversity index is versatile: It can be applied to any set of language population statistics, whether of user profile, web presence, or user activity in different languages. However, this index cannot provide, and is not intended to provide, an independent estimate of the relative use of any individual language.

Results and Trends in Internet Usage for Major Languages

Global Reach’s estimates of Internet language populations for the 10 years from 1996 to 2005 are tabulated in Table 6, re-arranged in order of size for the most recent year:

Table 6

Estimated online global language populations (millions)

End of year:1996199719981999200020012002200320042005
English407291148192231233.8288280300
Chinese0.11.2210314878103160220
Japanese279203948617085105
Spanish0.20.81.813213550667080
German0.53.56.314223743536271
French0.223.49.9171823284049
Italian0.10.51.89.7122024243542
Scandinavian22.23.27.7911141516.317
Korean0.010.050.85172528303540
Portuguese0.020.21.24111419263238
Dutch0.05125.8711131213.515
Other Non-English:711.415.16.428.8416489129.2142
TOTAL:50117151245391529626.87299411100
End of year:1996199719981999200020012002200320042005
English407291148192231233.8288280300
Chinese0.11.2210314878103160220
Japanese279203948617085105
Spanish0.20.81.813213550667080
German0.53.56.314223743536271
French0.223.49.9171823284049
Italian0.10.51.89.7122024243542
Scandinavian22.23.27.7911141516.317
Korean0.010.050.85172528303540
Portuguese0.020.21.24111419263238
Dutch0.05125.8711131213.515
Other Non-English:711.415.16.428.8416489129.2142
TOTAL:50117151245391529626.87299411100

Source of data: Global Reach (2006)

Table 6

Estimated online global language populations (millions)

End of year:1996199719981999200020012002200320042005
English407291148192231233.8288280300
Chinese0.11.2210314878103160220
Japanese279203948617085105
Spanish0.20.81.813213550667080
German0.53.56.314223743536271
French0.223.49.9171823284049
Italian0.10.51.89.7122024243542
Scandinavian22.23.27.7911141516.317
Korean0.010.050.85172528303540
Portuguese0.020.21.24111419263238
Dutch0.05125.8711131213.515
Other Non-English:711.415.16.428.8416489129.2142
TOTAL:50117151245391529626.87299411100
End of year:1996199719981999200020012002200320042005
English407291148192231233.8288280300
Chinese0.11.2210314878103160220
Japanese279203948617085105
Spanish0.20.81.813213550667080
German0.53.56.314223743536271
French0.223.49.9171823284049
Italian0.10.51.89.7122024243542
Scandinavian22.23.27.7911141516.317
Korean0.010.050.85172528303540
Portuguese0.020.21.24111419263238
Dutch0.05125.8711131213.515
Other Non-English:711.415.16.428.8416489129.2142
TOTAL:50117151245391529626.87299411100

Source of data: Global Reach (2006)

The trends for the top 10 language groups by size are graphed in Figure 4 (in millions) and in Figure 5 (by % of total Internet users).

Estimated language populations of Internet users
Figure 4

Estimated language populations of Internet users

Source of data: Global Reach (2006)

Distribution of languages among Internet users
Figure 5

Distribution of languages among Internet users

Source of data: Global Reach (2006)

Over this decade, the proportion of English speakers online has been estimated by Global Reach to drop from 80% to 27%—while at the same time increasing from 40 to 300 million in absolute numbers—largely due to the rise in online Chinese (from 0.1 to 220 million), Japanese (2 to 105 million), and Spanish speakers (0.2 to 80 million). (See Table 6.)

Again it must be emphasized that the Global Reach estimates are based on slowly moving and largely out-of-date Ethnologue data (1996, 2000) on the numbers of first-language speakers in each country. These are multiplied by the faster-moving and more up-to-date ITU and marketing figures on Internet penetration in that country and aggregated over all countries for each language. Global Reach then did the same calculation for a small and unspecified number of second languages and added these results to the first-language set (Global Reach, 2006). In both cases, the most critical factors causing changes in the graphed figures from year to year are the annually changing Internet penetration rates in each country.

The reduced growth rate of the online English language population after 2001 can be explained in terms of the market saturation effect in Internet penetration in the “first world” as it reached 50% and higher in the U.S., U.K., Canada, Australia, and New Zealand in 2001. The same saturation effect applied to the Japanese Internet market in 2001-2002; this explains the levelling off of the Japanese online population after 2000, causing it to drop from second to third place. (The move from dial-up to broadband Internet access, a key feature of Internet penetration statistics in the developed countries from 2001 to 2006, is not reflected directly in the Global Reach estimates, which do not reflect actual Internet usage. The exception is the extent to which the availability of cheaper broadband in the U.S. and elsewhere from 2002 helped increase the overall uptake of the Internet, leading to an upward spike in the 2003 data point for English.) Chinese and Spanish, starting with low Internet penetration rates in the 1990s, moved ahead to positions two and three by 2003 once their large native speaker numbers were multiplied by the higher Internet penetration rates arising in the 2000s.

The “saturation” or “levelling out” effect will occur in future statistics with populations whose total birth rate is modest but whose average Internet penetration rate is already high, such as with Scandinavian (88%), Japanese (84%), Dutch (75%), Italian (74%), and German speakers (72%), as shown in Table 7. In this table, the right hand column shows the percentage of speakers online out of the total Ethnologue-sourced figure for speakers of that language worldwide. It is not clear if Global Reach is including second language English speakers in India (or China) among its global English population; if so, the rapid growth of a large English-speaking middle class in Asia would be expected to boost the English figures in the next few years.

Table 7

Online versus total native speakers for major languages in 2005

Online (M)Total (M)% online
English30050859%
Chinese22088525%
Japanese10512584%
Spanish8033224%
German719872%
French497268%
Italian425774%
Scandinavian171988%
Korean407553%
Portuguese3817022%
Dutch152075%
Other Non-English:1423,4314%
TOTAL:1,1005,79219%
Online (M)Total (M)% online
English30050859%
Chinese22088525%
Japanese10512584%
Spanish8033224%
German719872%
French497268%
Italian425774%
Scandinavian171988%
Korean407553%
Portuguese3817022%
Dutch152075%
Other Non-English:1423,4314%
TOTAL:1,1005,79219%

Source of data: Global Reach (2006)

Table 7

Online versus total native speakers for major languages in 2005

Online (M)Total (M)% online
English30050859%
Chinese22088525%
Japanese10512584%
Spanish8033224%
German719872%
French497268%
Italian425774%
Scandinavian171988%
Korean407553%
Portuguese3817022%
Dutch152075%
Other Non-English:1423,4314%
TOTAL:1,1005,79219%
Online (M)Total (M)% online
English30050859%
Chinese22088525%
Japanese10512584%
Spanish8033224%
German719872%
French497268%
Italian425774%
Scandinavian171988%
Korean407553%
Portuguese3817022%
Dutch152075%
Other Non-English:1423,4314%
TOTAL:1,1005,79219%

Source of data: Global Reach (2006)

The OECD in its biennial Communications Outlook (OECD, 2003, 2005) used the Google search engine, setting language preferences, to evaluate the numbers of web pages in each of the world’s major languages on secure socket layer servers. This provides a measure of the relative web presence of the different languages for serious e-commerce websites, for which use of the secure socket layer is virtually essential.

While the OECD’s individual e-commerce web presence estimates, shown in Figure 6A, do not match the Global Reach user profile percentages for the same languages, shown in Figure 6B, the top rankings are similar, with the exception that Chinese does not appear at that date in the googled e-commerce websites: English dominates, and Japanese, German, and Spanish appear in the top five.

Language of e-commerce (A) and language populations on the Internet (B) (OECD, 2006)
Figure 6

Language of e-commerce (A) and language populations on the Internet (B) (OECD, 2006)

Table 8 shows the trends in the OECD e-commerce web presence estimates from August 2002 (OECD, 2003) and February 2005 (OECD, 2006). As with the more general web presence figures, English still dominates and is well ahead of the next most common language (Japanese). The absence of Chinese secure server figures is puzzling.

Table 8

OECD estimates for secure servers in major languages

date:Aug-02Feb-05
English57.6%45.2%
Japanese9.9%10.7%
German10.5%10.1%
French4.1%4.5%
Spanish5.3%4.1%
date:Aug-02Feb-05
English57.6%45.2%
Japanese9.9%10.7%
German10.5%10.1%
French4.1%4.5%
Spanish5.3%4.1%

Sources: (OECD 2003, 2005)

Table 8

OECD estimates for secure servers in major languages

date:Aug-02Feb-05
English57.6%45.2%
Japanese9.9%10.7%
German10.5%10.1%
French4.1%4.5%
Spanish5.3%4.1%
date:Aug-02Feb-05
English57.6%45.2%
Japanese9.9%10.7%
German10.5%10.1%
French4.1%4.5%
Spanish5.3%4.1%

Sources: (OECD 2003, 2005)

Results and Trends for Minority Languages

Global Reach has not published trend data for minority languages on its website, although its methodology is equally applicable to such languages in principle. As with the major languages, however, the datedness and asynchronicity of the Ethnologue census data for minority languages is a serious problem when applying the Global Reach methodology.

The most comprehensive estimates of minority language web presence were made independently in March 2002 by Xavier Gómez Guinovart (Guinovart, 2003) from the University of Vigo and Jordi Mas i Hernàndez (Mas, 2003) from the Catalan-language translation support group SoftCatalà. They made the same choice of search engine (AllTheWeb) so as to include as many languages as possible. Guinovart’s article concerns the state of Galician on the Internet, and Mas’s the state of Catalan, in both cases using the AllTheWeb search engine to count the numbers of web pages indexed for each set language preference. During that period, AllTheWeb indexed up to 48 separate languages, including six European minority languages (seven, if one counts Latin).

Guinovart’s and Mas’s results are compared in Table 9, where I have translated their tables (from Galician and Catalan respectively) into English, and presented their results in a common format so that they can be compared side by side. The overall consistency of results is striking, despite the three-fold increase in total web pages indexed by AllTheWeb over the 17-month period between their measurements.

Table 9

Web presence in order of frequency: Comparison of search engine results by Guinovart (2003) in March 2002 (left hand columns) and Mas (2003) in August 2003 (right hand columns), using AllTheWeb

OrderLanguageWeb Pages(M)% totalOrderLanguageWeb Pages (M)% total
1English44260.731English1,28060.42
2German51.27.0352German182.08.591
3Japanese43.25.9263French99.74.708
4Chinese26.23.6004Japanese69.73.291
5French24.63.3795Spanish65.83.107
6Korean20.42.7996Chinese65.73.103
7Russian19.62.6897Korean64.63.050
8Spanish16.42.2548Russian42.31.996
9Italian15.12.0779Italian41.81.975
10Portuguese12.51.71810Dutch41.11.941
11Dutch11.21.53311Portuguese37.71.779
12Polish7.461.02412Polish22.21.046
13Swedish6.560.90013Czech15.60.735
14Czech5.940.81514Swedish14.90.703
15Danish5.030.69015Danish12.10.571
16Norwegian3.640.49916Hungarian8.540.403
17Finnish3.020.41417Norwegian8.120.383
118Slovakian1.970.27018Finnish5.680.268
19Hungarian1.910.26219Slovakian5.080.240
20Turkish1.510.20820Hebrew4.790.226
21Thai0.8550.11721Turkish4.700.222
22Estonian0.8110.11122Thai3.120.147
23Catalan0.6810.09423Catalan2.930.138
24Slovenian0.6810.09324Arabic2.470.117
25Greek0.6480.08925Greek2.370.112
26Indonesian0.5970.08226Romanian2.050.097
27Ukrainian0.5880.08127Slovenian1.690.080
28Croatian0.5210.07128Croatian1.670.079
29Hebrew0.5140.07129Estonian1.460.069
30Icelandic0.4410.06130Icelandic1.390.066
31Romanian0.4190.05831Bulgarian1.120.053
32Arabic0.3280.05232Lithuanian1.080.051
33Lithuanian0.3280.04533Indonesian1.040.049
34Bulgarian0.3190.04434Ukrainian1.010.048
35Malay0.1940.02735Latvian0.5600.026
36Latvian0.1560.02136Byelorussian0.5360.025
37Galician0.0990.01437Vietnamese0.3900.018
38Basque0.0800.01138Malay0.3280.015
39Afrikaans0.0720.01039Galician0.2740.013
40Vietnamese0.0480.00740Basque0.1550.007
41Byelorussian0.0440.00641Latin0.1370.006
42Welsh0.0430.00642Afrikaans0.1160.005
43Faroese0.0370.00543Welsh0.0930.004
44Albanian0.0330.00544Faroese0.0660.003
45Friesian0.0210.00345Friesian0.0630.003
Total:727.85 M100.00%46Albanian0.0530.003
47Serbian0.0430.002
48Swahili0.0140.001
Totals:2,118.5 M100.00%
OrderLanguageWeb Pages(M)% totalOrderLanguageWeb Pages (M)% total
1English44260.731English1,28060.42
2German51.27.0352German182.08.591
3Japanese43.25.9263French99.74.708
4Chinese26.23.6004Japanese69.73.291
5French24.63.3795Spanish65.83.107
6Korean20.42.7996Chinese65.73.103
7Russian19.62.6897Korean64.63.050
8Spanish16.42.2548Russian42.31.996
9Italian15.12.0779Italian41.81.975
10Portuguese12.51.71810Dutch41.11.941
11Dutch11.21.53311Portuguese37.71.779
12Polish7.461.02412Polish22.21.046
13Swedish6.560.90013Czech15.60.735
14Czech5.940.81514Swedish14.90.703
15Danish5.030.69015Danish12.10.571
16Norwegian3.640.49916Hungarian8.540.403
17Finnish3.020.41417Norwegian8.120.383
118Slovakian1.970.27018Finnish5.680.268
19Hungarian1.910.26219Slovakian5.080.240
20Turkish1.510.20820Hebrew4.790.226
21Thai0.8550.11721Turkish4.700.222
22Estonian0.8110.11122Thai3.120.147
23Catalan0.6810.09423Catalan2.930.138
24Slovenian0.6810.09324Arabic2.470.117
25Greek0.6480.08925Greek2.370.112
26Indonesian0.5970.08226Romanian2.050.097
27Ukrainian0.5880.08127Slovenian1.690.080
28Croatian0.5210.07128Croatian1.670.079
29Hebrew0.5140.07129Estonian1.460.069
30Icelandic0.4410.06130Icelandic1.390.066
31Romanian0.4190.05831Bulgarian1.120.053
32Arabic0.3280.05232Lithuanian1.080.051
33Lithuanian0.3280.04533Indonesian1.040.049
34Bulgarian0.3190.04434Ukrainian1.010.048
35Malay0.1940.02735Latvian0.5600.026
36Latvian0.1560.02136Byelorussian0.5360.025
37Galician0.0990.01437Vietnamese0.3900.018
38Basque0.0800.01138Malay0.3280.015
39Afrikaans0.0720.01039Galician0.2740.013
40Vietnamese0.0480.00740Basque0.1550.007
41Byelorussian0.0440.00641Latin0.1370.006
42Welsh0.0430.00642Afrikaans0.1160.005
43Faroese0.0370.00543Welsh0.0930.004
44Albanian0.0330.00544Faroese0.0660.003
45Friesian0.0210.00345Friesian0.0630.003
Total:727.85 M100.00%46Albanian0.0530.003
47Serbian0.0430.002
48Swahili0.0140.001
Totals:2,118.5 M100.00%
Table 9

Web presence in order of frequency: Comparison of search engine results by Guinovart (2003) in March 2002 (left hand columns) and Mas (2003) in August 2003 (right hand columns), using AllTheWeb

OrderLanguageWeb Pages(M)% totalOrderLanguageWeb Pages (M)% total
1English44260.731English1,28060.42
2German51.27.0352German182.08.591
3Japanese43.25.9263French99.74.708
4Chinese26.23.6004Japanese69.73.291
5French24.63.3795Spanish65.83.107
6Korean20.42.7996Chinese65.73.103
7Russian19.62.6897Korean64.63.050
8Spanish16.42.2548Russian42.31.996
9Italian15.12.0779Italian41.81.975
10Portuguese12.51.71810Dutch41.11.941
11Dutch11.21.53311Portuguese37.71.779
12Polish7.461.02412Polish22.21.046
13Swedish6.560.90013Czech15.60.735
14Czech5.940.81514Swedish14.90.703
15Danish5.030.69015Danish12.10.571
16Norwegian3.640.49916Hungarian8.540.403
17Finnish3.020.41417Norwegian8.120.383
118Slovakian1.970.27018Finnish5.680.268
19Hungarian1.910.26219Slovakian5.080.240
20Turkish1.510.20820Hebrew4.790.226
21Thai0.8550.11721Turkish4.700.222
22Estonian0.8110.11122Thai3.120.147
23Catalan0.6810.09423Catalan2.930.138
24Slovenian0.6810.09324Arabic2.470.117
25Greek0.6480.08925Greek2.370.112
26Indonesian0.5970.08226Romanian2.050.097
27Ukrainian0.5880.08127Slovenian1.690.080
28Croatian0.5210.07128Croatian1.670.079
29Hebrew0.5140.07129Estonian1.460.069
30Icelandic0.4410.06130Icelandic1.390.066
31Romanian0.4190.05831Bulgarian1.120.053
32Arabic0.3280.05232Lithuanian1.080.051
33Lithuanian0.3280.04533Indonesian1.040.049
34Bulgarian0.3190.04434Ukrainian1.010.048
35Malay0.1940.02735Latvian0.5600.026
36Latvian0.1560.02136Byelorussian0.5360.025
37Galician0.0990.01437Vietnamese0.3900.018
38Basque0.0800.01138Malay0.3280.015
39Afrikaans0.0720.01039Galician0.2740.013
40Vietnamese0.0480.00740Basque0.1550.007
41Byelorussian0.0440.00641Latin0.1370.006
42Welsh0.0430.00642Afrikaans0.1160.005
43Faroese0.0370.00543Welsh0.0930.004
44Albanian0.0330.00544Faroese0.0660.003
45Friesian0.0210.00345Friesian0.0630.003
Total:727.85 M100.00%46Albanian0.0530.003
47Serbian0.0430.002
48Swahili0.0140.001
Totals:2,118.5 M100.00%
OrderLanguageWeb Pages(M)% totalOrderLanguageWeb Pages (M)% total
1English44260.731English1,28060.42
2German51.27.0352German182.08.591
3Japanese43.25.9263French99.74.708
4Chinese26.23.6004Japanese69.73.291
5French24.63.3795Spanish65.83.107
6Korean20.42.7996Chinese65.73.103
7Russian19.62.6897Korean64.63.050
8Spanish16.42.2548Russian42.31.996
9Italian15.12.0779Italian41.81.975
10Portuguese12.51.71810Dutch41.11.941
11Dutch11.21.53311Portuguese37.71.779
12Polish7.461.02412Polish22.21.046
13Swedish6.560.90013Czech15.60.735
14Czech5.940.81514Swedish14.90.703
15Danish5.030.69015Danish12.10.571
16Norwegian3.640.49916Hungarian8.540.403
17Finnish3.020.41417Norwegian8.120.383
118Slovakian1.970.27018Finnish5.680.268
19Hungarian1.910.26219Slovakian5.080.240
20Turkish1.510.20820Hebrew4.790.226
21Thai0.8550.11721Turkish4.700.222
22Estonian0.8110.11122Thai3.120.147
23Catalan0.6810.09423Catalan2.930.138
24Slovenian0.6810.09324Arabic2.470.117
25Greek0.6480.08925Greek2.370.112
26Indonesian0.5970.08226Romanian2.050.097
27Ukrainian0.5880.08127Slovenian1.690.080
28Croatian0.5210.07128Croatian1.670.079
29Hebrew0.5140.07129Estonian1.460.069
30Icelandic0.4410.06130Icelandic1.390.066
31Romanian0.4190.05831Bulgarian1.120.053
32Arabic0.3280.05232Lithuanian1.080.051
33Lithuanian0.3280.04533Indonesian1.040.049
34Bulgarian0.3190.04434Ukrainian1.010.048
35Malay0.1940.02735Latvian0.5600.026
36Latvian0.1560.02136Byelorussian0.5360.025
37Galician0.0990.01437Vietnamese0.3900.018
38Basque0.0800.01138Malay0.3280.015
39Afrikaans0.0720.01039Galician0.2740.013
40Vietnamese0.0480.00740Basque0.1550.007
41Byelorussian0.0440.00641Latin0.1370.006
42Welsh0.0430.00642Afrikaans0.1160.005
43Faroese0.0370.00543Welsh0.0930.004
44Albanian0.0330.00544Faroese0.0660.003
45Friesian0.0210.00345Friesian0.0630.003
Total:727.85 M100.00%46Albanian0.0530.003
47Serbian0.0430.002
48Swahili0.0140.001
Totals:2,118.5 M100.00%

On both dates, Catalan was ranked 23rd, but Galician and Basque slipped slightly in the intervening 17 months, from 37th and 38th to 39th and 40th, respectively, being overtaken by Byelorussian and Vietnamese. Despite the fallibility of search engines, the results for the Spanish minority languages are very consistent in their ranking. So are the results for three other European minority languages—Welsh, Frisian, and Faeroese—which were ranked between 42nd and 45th in each list, if one omits Latin as a dead language, which Guinovart did.

How representative are these estimates of language presence across the entire public web? To treat these results as being typical of presence in the public web, one has to assume that the web pages stored by AllTheWeb in March 2002 and August 2003 were both good samples of the public web in terms of their linguistic distribution. This is an untestable assumption, now that the AllTheWeb database has been merged into Yahoo!’s, and the search engine no longer offers the user the ability to set more than 36 different language preferences. Moreover, the fallibility of search engine-based estimates has been described at length above. Nonetheless, faute de mieux, Guinovart’s and Mas’s are the most comprehensive estimates we have of language distribution, in terms of web presence, across the public web, and it is reassuring that they provide a high level of consistency in their ranking of languages.

It is interesting to compare the web presence of the dominant language, English, in the most recent published results using search engines versus random sampling techniques. The approximately 60.5% English web presence detected by Guinovart in March 2002 and by Mas in August 2003 using the AllTheWeb search engine is 16% lower than the 72% detected by the OCLC studies in 1999 and 2002 using random direct testing, although neither methodology showed any significant decline in the relative English language web presence.

How can this major discrepancy in estimating English language web presence be explained? Did the Norwegian-invented AllTheWeb database under-represent English because it snowballed from initial non-English language websites? Or did the OCLC software reject too many non-English web pages because they contained insufficient words to allow for automatic language detection? Are there a large number of public web pages, predominantly in English, that are virtually never visited by web crawlers, but that would be represented in any randomly addressed sample? This is a useful question for future research.

Conclusions

The principal aim of this article has been to provide conceptual clarity in distinguishing among the different methodologies available to estimate the relative usage of different human languages on the Internet. A new taxonomy is offered that distinguishes among user profile, user activity, web presence, and diversity index as separate indicators of language diversity on the Internet, and further distinguishes between spoken and written languages. These distinctions help avoid pitfalls in the interpretation of earlier published results.

The taxonomy also helps to resolve the paradox about whether English usage has declined on the Internet. The estimates that show English remaining dominant (and not shrinking) are based on web presence, whereas the estimates showing English in decline are based on user profile. In the most statistically representative results available for web presence, from the OCLC studies, the relative number of web pages in English in the public web remained fairly constant at about 72%, at least from 1998 to 2002. Beyond 2003 we have no reliable published information on web presence. The much-quoted pie-charts and graphs on the Global Reach web site, purporting to show “Global Internet Statistics (by Language)” and which indicate a dramatic decline in the use of English on the Internet over the period 1996–2006, are based on estimates of user profile—of first language speaker populations—and not on actual measurements of either web presence or user activity.

This article has also critiqued the strengths and weaknesses of the relevant published methodologies. It reaches the following conclusions on the best “means for assessing the usage level of each language in cyberspace”—to quote from the key objective of the UNESCO-funded Language Observatory Project (Mikami et al., 2005):

  • User profile projections are useful, when accurate, but only to compare the potential with the actual use of languages on the Internet.

  • Sociolinguistic measurement of user activity is likely to remain limited to micro-studies of targeted groups. Attempts at truly global measurement of user activity can give rise to ethical concerns about privacy (if private communications such as email or VoIP are monitored) and would require resources beyond the means of most research organizations.

  • This leaves web presence as the most practical indicator for estimating actual language use in cyberspace. Current commercial search engines are not reliable tools for this purpose, despite their great user convenience, because of the lack of transparency of their algorithms for webcrawling and indexing, their inability to index the whole public web, and (since 2003) their limited range of language search settings.

The best estimates of web presence may be achieved by broadly following the methodology used in the OCLC project from 1998 to 2002, which randomly addressed and analyzed a representative sample of all public web sites (O’Neill et al., 2003). However, this approach will only suffice if the range of the language detection software is greatly extended, in practical stages, to be able eventually to recognize all of the world’s computer-mediated written languages. To cover at least the 365 languages currently used to translate the United Nations’ Declaration of Human Rights (UNHCHR, 2006) would be a worthy aim.

Acknowledgment

This article is based on a chapter of the author’s Ph.D. thesis. The author is grateful for the encouragement and guidance of his thesis supervisors, Professor Roy C. Boland and Associate Professor Peter B. White, both of La Trobe University, Australia; and to the JCMC reviewers, including Editor-in-Chief Susan Herring and Associate Editor John Paolillo, for suggestions leading to the paper’s further improvement.

References

Babel
. (
1997
).
Web languages hit parade - June 1997
. Retrieved October 11, 2006 from http://alis.isoc.org/palmares.en.html#liste_langues

Climent
,
S.
,
Moré
,
J.
,
Oliver
,
A.
,
Salvatierra
,
M.
,
Sànchez
,
I.
,
Taulé
,
M.
, &
Vallmanya
,
L
. (
2003
).
Bilingual newsgroups in Catalonia: A challenge for machine translation
.
Journal of Computer-Mediated Communication
,
9
(
1
). Retrieved June 30, 2007 from http://jcmc.indiana.edu/vol9/issue1/climent.html

Crystal
,
D
. (
2006
).
Language and the Internet
(2nd ed.).
Cambridge, UK
:
Cambridge University Press
.

Durham
,
M
. (
2003
).
Language choice on a Swiss mailing list
.
Journal of Computer-Mediated Communication
,
9
(
1
). Retrieved June 30, 2007 from http://jcmc.indiana.edu/vol9/issue1/durham.html

Ethnologue . (

1996
). Vol.
1
.
Languages of the World
(13th ed.).
Dallas, TX
:
SIL International
.

Ethnologue . (

2000
). Vol.
1
.
Languages of the World
(14th ed.).
Dallas, TX
:
SIL International
.

Ethnologue . (

2005
).
Languages of the World
(15th ed.).
Dallas, TX
:
SIL International
.

Fishman
,
J
. (
1967
).
Bilingualism with and without diglossia; Diglossia with and without bilingualism
.
Journal of Social Issues
,
23
,
29
38
.

Generalitat de Catalunya
. (
2003
).
Cens Lingüistic 2001
[Linguistic Census 2001].
Generalitat de Catalunya
, Barcelona.

Global Reach
. (
2006
).
Global Internet Statistics (by Language)
. Retrieved October 11, 2006 from http://www.global-reach.biz/globstats/index.php3

Guinovart
,
X. G
. (
2003
).
A lingua galega en Internet [The Galician language on the Internet]
. In
A.
Bringas
&
B.
Martín
(Eds.),
Nacionalismo e Globalización: Lingua, Cultura e Identidade
(pp.
71
88
).
Vigo, Spain
:
Universidade de Vigo
.

Lavoie
,
B. F.
, &
O’Neill
,
E. T
. (
1999
).
How ‘World Wide’ is the Web? Trends in the internationalization of web sites
.
OCLC Annual Review of Research 1999
Retrieved October 11, 2006 from http://digitalarchive.oclc.org/da/ViewObjectMain.jsp?fileid=0000002655:000000059202&reqid=21527&frame=false

Mas i Hernàndez
,
J
. (
2003
).
La salut del català a Internet
[The state of health of Catalan on the Internet]. Retrieved October 11, 2006 from http://www.softcatala.org/articles/article26.htm

Mikami
,
Y.
,
Zaki abu Bakar
,
A.
,
Soniert-Iamvanich
,
V.
,
Vikas
,
O.
,
Pavol
,
Z.
,
Zaidi Abdul Rozan
,
M.
,
Nagy János
,
G.
, &
Takahashi
,
T
. (
2005
).
Language diversity on the Internet: An Asian view
. In
Measuring Linguistic Diversity on the Internet
(pp.
91
103
).
Paris
:
UNESCO
Report 142186.

O’Neill
,
E. T.
,
Lavoie
,
B. F.
, &
Bennett
,
R
. (
2003
).
Trends in the evolution of the public Web: 1998–2002
.
D-Lib Magazine
,
9
(
4
). Retrieved October 11, 2006 from http://www.dlib.org/dlib/april03/lavoie/04lavoie.html

OECD
. (
2003
).
Communications Outlook 2003
.
Paris
:
Organisation for Economic Co-Operation and Development
.

OECD
. (
2005
).
Communications Outlook 2005
. Paris
Organisation for Economic Co-Operation and Development
.

OECD
. (
2006, April
).
Input to the UN Working Group on Internet Governance
. Retrieved October 11, 2006 from http://www.oecd.org/dataoecd/1/46/36779934.pdf

Paolillo
,
J. C
. (
2005
)
Language diversity on the Internet
. In
Measuring Linguistic Diversity on the Internet
(pp.
43
89
).
Paris
:
UNESCO Report 142186
.

Paolillo
,
J. C.
,
Pimienta
,
D.
, &
Prado
,
D
. (
2005
).
Measuring Linguistic Diversity on the Internet
.
Paris
:
UNESCO Report 142186
.

Paolillo
,
J. C.
, &
Das
,
A
. (
2006, March
).
Evaluating Language Statistics: The Ethnologue and Beyond
. Report prepared for the UNESCO Institute for Statistics. Retrieved December 19, 2006 from http://ella.slis.indiana.edu/~paolillo/research/u_lg_rept.pdf

Pimienta
,
D.
,
Lamey
,
B.
,
Prado
,
D.
, &
Sztrum
,
M
. (
2001
).
L5: The Fifth Study on Languages and the Internet
. Retrieved October 11, 2006 from http://funredes.org/lc2005/english/L5/L5authors.html

Pimienta
,
D.
, &
Lamey
,
B
. (
2001
).
Lengua española y culturas hispánicas en la Internet. Comparación con el inglés y el francés
[The Spanish language and Hispanic cultures in the Internet.
Comparison with English and French]. Valladolid, Spain
:
II Congreso Internacional de la Lengua Española
. Retrieved October 11, 2006 from http://www.funredes.org/lc2005/L5/valladolid.html#_Toc527796430

Pimienta
,
D
. (
2005
).
Linguistic diversity in Cyberspace: Models for development and measurement
. In
Measuring Linguistic Diversity on the Internet
(pp.
13
34
).
Paris
:
UNESCO Report 142186
.

SIL
. (
2000
).
Overview of translation in SIL International (formerly Summer Institute of Linguistics)
. Retrieved October 11, 2006 from http://www.sil.org/translation/TransinSIL.htm

U. C. Berkeley
. (
2006
).
Types of search tools
. Retrieved October 11, 2006 from http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/ToolsTables.html [no longer available].

UNESCO Portal
. (
2006
).
In focus: Measures and indicators
. Retrieved October 11, 2006 from http://portal.unesco.org/ci/en/ev.php-URL_ID=20973&URL_DO=DO_TOPIC&URL_SECTION=201.html

UNHCHR
. (
2006
).
The Universal Declaration of Human Rights
. Retrieved December 19, 2006 from http://www.unhchr.ch/udhr/

Van Couvering
,
E
. (
2007
).
Is relevance relevant? Market, science, and war: Discourses of search engine quality
.
Journal of Computer-Mediated Communication
,
12
(
3
), article 6. http://jcmc.indiana.edu/vol12/issue3/vancouvering.html

Wikipedia
. (
2006
).
Search engine
. Retrieved October 11, 2006 from http://en.wikipedia.org/wiki/Search_engine

Wodak
,
R.
, &
Wright
,
S
. (
2006
).
The European Union in cyberspace: Multilingual democratic participation in a virtual public sphere?
Journal of Language and Politics
,
5
(
2
),
251
275
.

About the Author

Peter Gerrand is a doctoral candidate within the Spanish/Catalan/Galician and Media Studies programs at La Trobe University and is also a Professorial Fellow in telecommunications in the Engineering Faculty of the University of Melbourne. His doctoral research examines the promotion of minority languages on the Internet, with primary focus on the regional languages of Spain.

Address: Faculty of Engineering, University of Melbourne, Vic. 3010, Australia