Here’s why language coding needs an uplift

(Credit: Unsplash)

This article is brought to you thanks to the collaboration of The European Sting with the World Economic Forum.

Author: Erik Vogt, Vice President Enterprise Solutions, Appen


  • Languages have been subjected to standardization through the ISO 693-1 code but the growth of digital content means more precise coding is now more urgent.
  • The ISO 693-3 has moved the two-letter language coding to a three-letter standard that takes into account more regional dialects or language variants e.g. Mandarin and Cantonese rather than just Chinese.
  • To involve an even more precise language coding the ISO 693-3 should be used with the ISO 3166 country codes, which is particularly important for metadata.

The case for standardization is well established. Not only do standards assist regulation but they also help effectively join up systems across a globalized economy. Hundreds of functions are subject to codes or standardization and languages are one of them.

Internationally recognized codes representing each language, language family and dialect help country systems and organizations correctly identify and manage data accordingly. They are used for bibliographic purposes within libraries, information management systems, databases and websites and to ensure that machine learning training data addresses its correct intention. What’s more, correctly aligning language variants is not only efficient, it is convenient and protects your brand.

So what code should you be looking for a precise language designation? The comprehensive ISO 639-3 unambiguously defines almost all known languages in the world together with ISO 3166 country codes.

Pre ISO 639-3

Before ISO 639-3, however, there was ISO 639-1, which was a two-letter designation but as the digital world has grown, so has the demand for more precise language support. For instance, “zh” for Chinese under ISO 639-1 is “zho” under ISO 639-3 with around 16 additional language codes for different dialects e.g. “cdo” for Min Dong Chinese, “cmn” for Mandarin Chinese, “hak” for Hakka Chinese etc.

In the spirit of the 2022 World Cup in Qatar, we could take the English language example of “football” to understand the importance of differentiating language variants. Football is understood as a completely different game to US English speakers than, for example, those in the UK who identify it as what Americans call soccer. However, the differentiation between US and UK English is actually not dealt with by ISO 639, although it is differentiated under the country codes.

Even with all the various forms of English around the world, ISO-639 does not count English as a macro language. The other English codes are mostly Creole or Pidgin variants, such as Jamaican Creole English, which perhaps amplifies how inappropriate it is to collate Arabic variants under one code, such as Egyptian Arabic (arz), as differentiated from standard Arabic (ara).

When it comes to the lexicons, training data and data management solutions, language differentiations are crucial to avoid messy results. The best practice for most applications is combining ISO 639-3 and ISO 3166 to identify the specific language and region you intend to use.

Delivering increasingly personalized solutions to end customers means precise language ID is a must so applications can align with the end user expectations in each region and their spoken language.”— Erik Vogt, Vice President Enterprise Solutions, Appen Limited

ISO standards for languages

ISO (International Organization for Standardization) has released five parts for language identification standardization: ISO 639 establishes internationally recognized codes (either 2, 3, or 4-letter codes) for representing languages or language families.

Part 1 – ISO 639-1 – is the oldest standard representing the majority of languages using a two-letter code. It covers the most common spoken languages but doesn’t account for variations within languages. Parts 2-5 use three-letter codes and provide more local combinations to account for all known natural languages, living or extinct. ISO 639-3 extends much further than ISO 639-2 to cover over 7,000 languages and is intended for use as metadata code. It is commonly used in computer and information systems, such as the web and SaaS applications, for support of many different languages.

Delivering increasingly personalized solutions to end customers means precise language ID is a must so applications can align with the end user expectations in each region and their spoken language. The 3-letter ISO 639-3 and ISO 3166 codes provide the ability to differentiate these unique languages. Ethnologue, one of the largest and most comprehensive language databases available today, uses the 3-letter ISO system.

Yet there are still a surprising number of requests to provide training data for undifferentiated languages either undefined by ISO 639 or expect outputs that include two or more variants that share an ISO 639-1 code. The later migration to ISO 639-3 happens, the more system ambiguity will occur in systems where language classification is necessary. There will be a higher risk of cross-variant contamination at a higher cost.

Once you start working with languages with more than one variant, it’s essential to migrate to the 3-letter code system. However, while one-to-one mapping exists from all 2-letter codes to 3-letter codes, it is not so easy the other way around. However, updating procedures to the ISO-639-3 standard is a future-proof move worth pre-empting.

Benefits of ISO standards

Training natural language processing (NLP) models need detail and accuracy to be effective for spoken languages. The best combination is ISO-639-3 language and 3166 country codes. For example, English (eng) can be divided into American English (eng-USA), British English (eng-GBR), Canadian English (eng-CAN), Australian English (eng-AUS), South African English (eng-ZAF), etc. A voice assistant designed to recognize speech should be able to identify the English dialect to accurately understand the request and make the correct output.

The two key benefits of consistently applying these ISO standards throughout a system are:

  • Ability to accurately identify candidates with the correct language skills for every task.
  • Ability to consistently refer to the same language across organizations and applications.

Ignoring the increasing need for precise data means inconsistent, incomplete and potentially inaccurate local data means all relevant users are not reliably identified, which impacts the quality and hampers data collection efforts. Fundamentally, however, you’d be ignoring a large subset of your customer base with regional dialects and there is no better reason to adopt the three-letter ISO plus country code than being able to deliver the services you offer to all of your end users.

Speak your Mind Here

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: