← Back to portfolio

Why ChatGPT struggles with Cantonese like a Mandarin speaker

Published on

Despite having 85 million speakers, Cantonese remains unsupported by ChatGPT, and its responses to Cantonese appear to be based on a more dominant and officially recognized language. This raises concerns. Linguist Edward Sapir suggested that language shapes how we interact with the world. In the age of AI, what happens when a language can't speak for itself? Will AI homogenize our perception of Cantonese? Behind AI's half-assed Cantonese is a tug-of-war between linguistic heritage and social resource allocation.

Link to the original version: https://theinitium.com/article...

This article was originally published in Chinese. This machine-translated English version is for reference only.

———————————————————————————————————————————————————————

Have you ever heard ChatGPT speak Cantonese?

If you are a native Mandarin speaker, congratulations on your "proficiency in Cantonese" achievement. Cantonese speakers, on the other hand, may be in for a rude awakening - ChatGPT comes with a peculiar accent, like a foreigner trying to speak Cantonese.

In an update in September 2023, ChatGPT "speaks" for the first time; on May 13, 2024, the latest generation model, GPT-4o, was released, and although the new version of the voice feature has not yet been officially released and only exists in the demo, it is possible to get a glimpse of ChatGPT's ability to have multilingual voice conversations from last year's update.

Many people have noticed that ChatGPT speaks Cantonese with a strong accent, and although the voice is natural and like a real person, this "real person" is not a native Cantonese speaker.

Grammatically, the generated text is more written, with only occasional spoken expressions. When writing Cantonese rap lyrics, suno.ai wrote lyrics such as "The neighbors can imitate this, the characteristics of Hong Kong are really beautiful" with unclear meanings. We prompted this line to ChatGPT, which noted that "this line seems to be a direct translation of Mandarin, or a syntax of Mandarin mixed with Cantonese".

By comparison, we have found that these errors do not occur when they try to use Putonghua (Spoken Mandarin). ChatGPT's Cantonese, however, can at best be described as "salty" (meaning unskilled, half-assed).

What's going on here, ChatGPT doesn't speak Cantonese? But instead of admitting it outright, ChatGPT tries to imagine it, and that imagination is explicitly based on a stronger, more officially sanctioned language. Will this be a problem?

Linguist and anthropologist Edward Sapir argues that spoken language affects the way people interact with the world. What does it mean if a language can't speak for itself in the age of artificial intelligence? Will we begin to share the same vision of what Cantonese looks like as AI?

Language without Resources

The speech model introduced by ChatGPT last year is composed of three main parts: first, an open-source speech recognition system, Whisper, converts the spoken language to text - then a ChatGPT text-to-text model generates the textual responses - and finally a text-to-speech (TTS) model generates the audio and fine-tunes the pronunciation.

That is, the content of the conversations is still generated by ChatGPT3.5, with the training set being a large amount of text that already exists on the web, rather than audio data.

In this respect, Cantonese is at a significant disadvantage because it is largely spoken rather than written. Officially, the written language used in Cantonese-speaking areas is standard written Chinese, which is derived from northern Chinese and is closer to Putonghua than to Cantonese, while Written Cantonese, which is a system of writing that conforms to the grammar and vocabulary of spoken Cantonese, is used mainly in informal settings such as Internet forums.

This usage often does not follow a consistent rule. "About 30% of the words in Cantonese I don't know how to write." Frank says that when people come across a word they don't know how to write in an online chat, they often just type it on their Pinyin keyboards, looking for a word with a similar pronunciation. For example, in Cantonese, "亂噏二十四" (lyun6 up1 jaa6 sei3; meaning nonsense) is often written as "亂up廿四" (亂噏二十四). Although this is generally understood, it further complicates the existing Cantonese text and makes it subject to different standards.

The emergence of Large language model has led to an understanding of the importance of training sets for AI, and the biases they can carry. But in reality, the gap in data resources between languages has been there way before the advent of generative AI. Most natural language processing systems are designed and tested in resource-rich languages, and of all the active languages in the world, only 20 are considered "high resource" languages, such as English, Spanish, Mandarin, French, German, Arabic, Japanese, and Korean.

Cantonese, which has 85 million users, is often considered a low-resource language in natural language processing (NLP). As a starting point for deep learning, the compressed size of the English version of Wikipedia is 15.6 GB, the compressed size of the traditional and simplified versions is 1.7 GB, and the compressed size of the Cantonese version is only 52 MB, a difference of 33 times.

Similarly, in Common Voice, which is the largest public speech corpus in existence, there are 1232 hours of speech data for Chinese (China), 141 hours for Chinese (Hong Kong), and 198 hours for Cantonese.

A lack of corpus can profoundly affect natural language processing performance; a 2018 study found that if there are less than 13K parallel sentences in the corpus, a machine translation cannot achieve a reasonable result. This also affects the machine's "dictation" performance: performance tests of the open-source Whisper speech recognition model (version V2) used by ChatGPT Voice showed a significantly higher error rate for Cantonese characters than for Mandarin.

The model's textual performance shows the lack of resources in Cantonese, but how do pronunciation and intonation, which determine how it, go wrong? 

How did the machine learn to talk?

The idea of making machines talk has been around for a long time, dating back to the 17th century, and early attempts included the use of organs or bellows to mechanically pump air into complex devices that mimicked the structure of the chest, vocal cords, and mouth. The idea was later taken up by an inventor named Joseph Faber, who created a talking dummy dressed in a Turkish costume - but people didn't see the point at the time.

It wasn't until household appliances became more commonplace that the idea of making machines talk became more interesting.

At the 1939 World's Fair, Voder, a speech synthesizer invented by Bell Labs engineer Homer Dudley, gave mankind the first "machine voice. Voder's mechanism was simple and easy to understand. All the audience could see it: a female operator sat in front of a toy piano-like machine and, by deftly controlling 10 buttons, produced a vocal effect similar to the friction of the vocal cords. The operator could also press a foot pedal to change the pitch, simulating a more upbeat or heavier tone. Meanwhile, a host kept asking the audience to come up with new words to prove that Voder's voice was not pre-recorded.

The New York Times described Voder's voice as "an alien greeting from the depths of the ocean" or a drunken man spitting out words in an unintelligible manner. But at the time, the technology was enough to amaze people. During the Expo, Voder attracted more than 5 million visitors from around the world.

Early visions of intelligent robots and alien voices drew much inspiration from these devices. In 1961, scientists at Bell Labs made the IBM 7094 sing the 18th-century English ditty "Daisy Bell". This is the earliest known song sung by a computer with synthesized voices. Clarke, the author of 2001: A Space Odyssey, went to Bell Labs to hear the IBM 7094 sing, and made "Daisy Bell" the first song learned by the HAL 9000 supercomputer in his novel. In the movie version, when HAL 9000 is rebooted and confused, it begins to sing "Daisy Bell", and its agile and anthropomorphic voice is gradually reduced to a mechanical growl.

Since then, speech synthesis has evolved over decades. Before the neural network technology of the AI era matured, concatenative synthesis and formant synthesis were the most common methods - in fact, many of today's common speech functions are still realized by these two methods, such as screen reading. Of these, formant synthesis was dominant in the early days. The principle of sound generation was similar to that of Voder, using a controlled combination of fundamental frequency, clef, opacity, and other parameters to generate an unlimited number of sounds. This had the advantage that it could be used to produce any language: as early as 1939, the Voder could speak French.

In 2006, Huang Guanneng, a Guangzhou native studying for a master's degree in computer software theory at Sun Yat-sen University, was planning his thesis project when he came across eSpeak, an open-source speech synthesizer using resonant peak synthesis, with the idea of creating a Linux browser for the visually impaired. Because of its linguistic advantage, eSpeak was quickly put to practical use. In 2010, Google Translate began adding read-aloud capabilities to a large number of languages, including Mandarin, Finnish, and Indonesian, which were made possible by eSpeak.

Huang Guanneng decided to add support for his native language, Cantonese, to eSpeak. However, the synthesized pronunciation of eSpeak has an obvious seam, "just like when you learn Chinese, not through Hanyu Pinyin, but the English phonetic symbols," said Huang.

That's why he created Ekho TTS, a speech synthesizer that now supports Cantonese, Mandarin, and even more obscure languages such as Hakka, Tibetan, Yayan, and Cantonese-Taishan, etc. Ekho uses a crosstalk approach, or more simply, collage - pre-recorded human vocalizations, and then collage them together when you "speak". In this way, the pronunciation of individual words becomes more standardized, and common phrases sound more natural when recorded in their entirety. Huang Guanneng compiled a Cantonese pronunciation list of 5,005 sounds, which took two to three hours to record from start to finish.

The advent of deep learning has revolutionized the field. Speech synthesis based on deep learning maps between text and speech features from large speech databases without relying on predefined linguistic rules and recorded speech units. This technique is a huge step forward in making machines sound so natural that they are often as good as real humans and can clone a person's timbre and speech habits with just a few seconds of speech - this is the technique used in ChatGPT's TTS module.

Compared to concatenative and formant synthesis, such system save a lot of upfront labor costs for speech synthesis, but they also place higher demands on resources for text and speech pairing. For example, Tacotron, an end-to-end model introduced by Google in 2017, requires more than 10 hours of training data to achieve better speech quality.

To account for the resource scarcity of many languages, in recent years, researchers have proposed a migration learning approach: first train a generalized model on a dataset of high-resource languages, and then migrate these rules to synthesize low-resource languages. To some extent, the migrated patterns still carry features from the original dataset-just as a native speaker learning a new language brings linguistic knowledge from his or her native language. In 2019, the Tacotron team proposed a model that could clone the voice of the same speaker across languages. In the demo, native English speakers "speak" Mandarin with a very distinct "foreign accent" despite their standardized pronunciation.

The latest GPT-4o model goes even further in terms of using a single solution for universal problems. As OpenAI explained, they trained an end-to-end model across text, vision, and audio, with all inputs and outputs processed by the same neural network. How the model handles different languages is unclear, but it looks like it's more generalizable across tasks than ever.

However, the generalization of Cantonese and Putonghua often complicates rather than simplifies the issue.

A commentary in the South China Morning Post pointed out that when Hong Kong people write, they have to use the modern Standard Chinese word "他們" to be understood by all Chinese speakers - "他們", spelled "taa1 mun4" in Cantonese, meaning "they", is a word that is almost never used in spoken Cantonese; the Cantonese word for "they" is "tai1 mun4", written as "佢哋" (keoi5 dei6).

In linguistics, the concept of "language stratification" or "diglossia" refers to the existence of two closely related languages in a given society, one more prestigious and usually used by the government, and the other often used orally as a dialect.

In the Chinese context, Putonghua is the most prestigious language, used for formal writing, news broadcasting, schooling, and government affairs. Local dialects, such as Cantonese, Minnan (Taiwanese), and Shanghainese, are lower-level languages used mainly for everyday communication in families and local communities.

As a result, in Guangdong, Hong Kong and Macau, Cantonese is the native language of the majority of the population, while the official written language is usually written in standard Chinese using Putonghua.

There are many similarities and practical differences between the two, and many "dissonant pairs" such as "他們" and "佢哋". They can make the transition from Putonghua to Cantonese even more difficult and confusing.

Increasingly Marginalized Cantonese

At this point, it seems safe to assume that the poor performance of Cantonese speech synthesis is a result of the technology's inability to deal with low-resource languages. Models using deep learning algorithms will produce sound illusions when confronted with unfamiliar words.

But Tan Lee, a professor in the Department of Electrical Engineering at the Chinese University of Hong Kong, disagreed after listening to ChatGPT's speech performance.

Tan Lee has been involved in speech-related research since the early 1990s. He has led the development of a number of Cantonese-centered speech technologies that have been widely used. In 2002, he and his team released CU Corpora, the largest Cantonese speech corpus of its kind in the world at that time, containing recorded data from more than 2,000 people. Many companies and research institutes bought this resource when they wanted to develop Cantonese features, including Apple's first generation of speech recognition.

In his opinion, ChatGPT's Cantonese voice performance is "not very good, mainly because it is unstable, and the voice quality and pronunciation accuracy are not very satisfactory". However, this poor performance is not due to technical limitations. In fact, many voice generation products on the market today that have Cantonese capabilities are of much higher quality. So much so that he was incredulous about the performance of ChatGPT in online videos, and thought it was a deep fake.

Taking the system developed by the Chinese University of Hong Kong as an example, the most advanced batch of voice AI has made it difficult to tell whether they are real people or synthesized voices. Compared with more powerful languages such as Mandarin and English, AI Cantonese is only weaker in some more personalized and lifelike scenes, for example, in the dialogue between parents and children, psychological counseling, job interview scenes, Cantonese will appear relatively cold.

"Strictly speaking, it is not technically difficult, but the key lies in the given social resources," Tan Lee said.

The field of speech synthesis has changed dramatically since 20 years ago, and the amount of data in CU corpora is "probably less than ten thousandths" of today's databases. The commercialization of speech technology has turned data into a marketable resource, and data companies can provide large amounts of customized data at any time. As a spoken language, the lack of parallel data between text and speech is no longer a problem with the development of speech recognition technology in recent years. Today, Tan Lee believes that it is no longer accurate to say that Cantonese is a "low-resource language".

That's why he believes that the performance of Cantonese in machines on the market does not reflect technological capabilities, but rather market and commercial considerations. "Suppose the whole of China learns Cantonese together now, it's definitely doable; or now that Hong Kong and the mainland are becoming more integrated, and Hong Kong primary and secondary schools can no longer use Cantonese, it's a completely different story," he says.

Deep learning only "spits out what it eats." And its Mandarin accent is showing how much Cantonese has been marginalized in the real world.

Huang Guanneng's daughter has just started her second year of kindergarten in Guangzhou, and after one month of school, she has become proficient in Putonghua, having only spoken Cantonese since she was a child. Nowadays, even in her daily communication with her family and neighbors, she is more accustomed to using Putonghua, and only with Huang Guanneng is she still willing to speak Cantonese. In Huang's eyes, ChatGPT's behavior is very similar to the way his daughter speaks Cantonese today. When she can't remember how to say a word, she substitutes Putonghua or guesses its pronunciation from Putonghua.

This is a result of the fact that Cantonese has long been disregarded in Guangdong, and even excluded from official contexts. A 1981 government document from the Guangdong Provincial People's Government stated that "the promotion of Putonghua is a political task," and that especially in Guangdong, where the dialect is complex and there is a great deal of internal and external communication, "we will strive to make Putonghua the language of choice for all public occasions in large and medium-sized cities within three to five years, and to make Putonghua basically universal in all kinds of schools within six years. Putonghua will be used in all public places in large and medium-sized cities within three to five years, and in all schools within six years".

Frank, who grew up in Guangzhou, also has fond memories of his childhood, when movies aired on public television channels were dubbed into Mandarin while other foreign movies were only subtitled.

The decline of Cantonese and the number of its speakers, as well as the campus-led "Cantonese blockade," have sparked a heated debate about the survival of the language and the identity associated with it. In 2010, a massive online and offline campaign to "support Cantonese" took place in Guangzhou. Reports that year likened the debate to a scene in the French novel The Last Lecture, and argued that the cultural radicalism of the past half-century was seen to have shriveled the branches of an otherwise lush language.

For Hong Kong, Cantonese is also a key carrier of local culture, with Hong Kong films and music shaping social life here.

In 2014, an article published on the Education Bureau's official website referred to Cantonese as "a Chinese dialect that is not an official language," sparking a heated debate that ended with an apology from the Education Bureau. In August 2023, the Hong Kong Institute of Linguistics (HKIL), a Hong Kong-based organization defending Cantonese, announced its dissolution, and its founder, Lok-hang Chan, referred to the current state of affairs in Hong Kong in an interview: "The government has been actively promoting the use of Putonghua to teach the Chinese language subject, but due to public concern, the government has "slowed down the pace".

All this shows the importance of Cantonese in the minds of Hong Kong people, but it also shows the long-term pressure on the language, its vulnerability without official status, and the constant tug-of-war between the government and the public.

Unrepresented Voices

Linguistic illusions don't just exist in Cantonese; Reddit forums and OpenAI discussion boards have reported that ChatGPT performs similarly when speaking non-English languages:

"It recognizes Italian very well and always understands and expresses itself fluently, like a real person. But strangely, it has a British accent, like an Englishman speaking Italian".

"It has an American accent. I hate it, so I choose not to use it."

"Dutch, too. It's annoying. It's like it's been trained with English phonemes."

Linguistically, accent is defined as a way of pronouncing words, and each person, depending on geography, social class, and other factors, has more or less different pronunciation choices, which are often reflected in different intonation, stress, or vocabulary choices. Interestingly, some of the accents that have been widely mentioned in the past are largely the result of habits that people around the world have picked up from their native languages as they try to master the English language, such as Singlish. It reflects the diversity of the world's languages. But what artificial intelligence demonstrates is the reverse invasion of regional languages by mainstream languages.

Technology is amplifying this invasion, as highlighted in a Statista data report from February this year, which found that while only 4.6 percent of the world's population speaks English as a native language, it overwhelmingly accounts for 58.8 percent of online text, meaning it has a much greater influence online than it does in reality. Even if you include all English speakers, these 1.46 billion people represent less than 20% of the world's population, which means that roughly four-fifths of the world's population is unable to understand most of what is happening on the Web. Moreover, it would be difficult for them to get an English-speaking AI to work for them.

Some African computer scientists have noticed that ChatGPT often misinterprets African languages and translates them in a crude way, giving "mixed and hilarious" performance in Zulu (a Bantu language with about 9 million speakers worldwide). The question about Tigrinya (Tigrinya; native speakers are mainly in Israel and Ethiopia; there are about 8 million speakers worldwide) was met with a garbled response. This discovery raised concerns that the lack of AI tools for African languages that could identify African names and places would make it difficult for Africans to participate in the global economy, such as e-commerce, access information, and automate production processes, and would shut them out of economic opportunities.

Training in a particular language as the "gold standard" will also bias AI's judgment. A 2023 Stanford University study found that the AI incorrectly labeled a large number of TOEFL essays (written by non-native English speakers) as AI-generated, but not essays written by native English speakers, while another study found that the AI made almost twice as many errors with black speakers as with white speakers, and these errors were not caused by grammar, but by "phonetic, phonological, or rhythmic features" or "accents".

Even more troubling, in mock trial experiments, the Large language model gave the death penalty to African-American English speakers at a higher rate than to speakers of Standard American English.

Some concerned voices have pointed out that if existing AI technologies are deployed for the sake of convenience, without considering the shortcomings of the underlying technology, there will be serious consequences. For example, some court transcripts have already begun to use automatic speech recognition, which is more likely to produce biased voice recordings of accented or non-English speaking parties, resulting in unfavorable verdicts.

Thinking further, will people in the future give up or change their accents to be understood by AI? In reality, globalization and socio-economic development have already brought about such changes. Frank's Ghanaian classmates have shared with her the current state of language use in African countries: Written texts are basically in English, even personal texts such as letters. Spoken language is peppered with English words, which has led to a gradual forgetting of native African words or expressions, even among the locals.

In Tan Lee's view, people are now falling into a kind of obsession with machines. "We're trying to talk to machines because they're good at what they do," he thinks it is putting the cart before the horse. "Why do we talk? The purpose of talking is not to turn them into texts or to generate answers. In the real world, we speak to communicate."

He believes that the direction of technological development should be to enable people to communicate better with each other than with computers. With that premise, "it's easy to think of a lot of problems that need to be solved, like someone who can't hear, maybe because they're deaf, maybe they're too far away, maybe they don't understand the language, maybe an adult can't speak the language of a child, and a child can't speak the language of an adult.

There are a lot of fun language technologies out there today, but do they make us communicate more smoothly? Do they embrace everyone's differences, or do they bring people closer to the mainstream?

While people celebrate the innovative breakthroughs of ChatGPT, some of the basic applications in our daily lives are still not benefiting from it. Tan Lee still hears the synthesized voice pronounce incorrectly in airport broadcasts: "The first thing in communication is accuracy, and it's unacceptable that this hasn't been done.

A few years ago, Huang stopped maintaining Ekho's version of Android, but after a while, users suddenly came to him asking to revive it. It was only then that he learned that there were no more free Cantonese TTS available for Android.

From today's point of view, Ekho developed by Huang Guanneng is completely out of date, but it is still unique. As a local independent developer, he brought his personal experience into his design. The Cantonese he recorded contains seven tones, the seventh of which is a non-existent pronunciation in Jyutping, Cantonese spelling suggested by the Hong Kong Language Association. "The word '煙(cigarette)' is pronounced with different tones in '抽煙(smoking)' and '煙火(fireworks)', i.e. the first and the seventh tone.

When he was compiling the pronunciation dictionary, he consulted the developer of Jyutping and learned that with the passage of time, the younger generation of Hong Kong people could no longer tell the difference between the first and seventh tones, so the sound had gradually disappeared. However, he still chose to include the seventh tone, not because of any recognized standard, just out of his personal memory: "The native people of Guangzhou can hear it, and it is still very common today.

Old Guangzhou people can tell whether you are a local or a foreigner just by hearing this sound.


0 Comments Add a Comment?

Add a comment
You can use markdown for links, quotes, bold, italics and lists. View a guide to Markdown
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply. You will need to verify your email to approve this comment. All comments are subject to moderation.