Malware. Data theft. Ransomware. Everyone wants to know who was behind the latest audacious attack. Several attempts have been made over the years to use linguistics to identify perpetrators, but when it comes to attribution, there are limitations to using this method.
Linguistic analysis came up recently when analysts at intelligence firm Flashpoint said there was a Chinese link with the WannaCry ransomware. Much of the security research up till then had pointed to North Korean ties, as the attacks reused infrastructure components associated with the shadowy Lazarus Group. Before that, a Taia Global report suggested the The Shadow Brokers’ manifesto was actually written by a native English speaker, despite the broken English.
Linguistic analysis also was used to suggest that Guccifer 2.0, who released documents stolen from the Democratic National Committee, was likely not Romanian as claimed. Back in 2014, Taia Global said linguistic clues pointed the Sony breach to Russian actors, and not the North Koreans as the United States government had claimed.
Attribution is hard enough—and relying on linguistic tools appears to be just adding to the confusion. Was WannaCry the work of the Chinese or the North Koreans? Is Guccifer 2.0 Romanian or Russian? Linguistic analysis will very rarely lead to the smoking gun. At the very least, it will uncover a whole set of clues for researchers to track down, and at the best, it will support (or confirm) other pieces of evidence uncovered by technical research and forensics methods. Linguistic analysis is another tool in the arsenal when it comes to attribution.
“Linguistic evidence, to be reliable, must show a consistent pattern of different features pointing in a single direction,” says Shlomo Argamon, a professor at the Illinois Institute of Technology. He was behind Taia’s original analysis of the Sony hackers and that of The Shadow Brokers.
Understanding the analysis
There are two kinds of analysis, one looks at the actual source code and the other examines the actual text that is used. In the first kind, the analysis focuses on code style and patterns to see similarities to other known code samples. Many researchers have relied on this method to link different attacks to a single actor, but this isn’t linguistic analysis.
The second method relies on human language, such as error messages, dialog boxes and messages directly shown to victims. For it to be effective, there needs to be text and plenty of it. Flashpoint’s WannaCry analysis focused on the ransom notes that victims were shown. Argamon analyzed The Shadow Brokers’ rambling manifesto. In the case of Guccifer 2.0, Argamon looked at Motherboard’s Lorenzo Franceschi-Bicchierai’s interview with Guccifer 2.0 over Twitter. In some cases, there is text in the code itself—such as comments—but that is typically considered too little to be useful. “You have to have enough text,” says Argamon.
Ransomware is particularly well suited to linguistic analysis because the attack relies on delivering ransom notes victims can read and understand, says Jon Condra, director of Asia-Pacific Research at Flashpoint. Most malware—even spear phishing—campaigns don’t hold up under this kind of scrutiny because the lure is carefully crafted to look legitimate and resemble something else.
The analysis starts off by collecting all possible text. Restricting certain data sets can lead the analysis down an unexpected path, so it is important to include everything available. For example, the team at Taia looked at 20 messages reported in the media and posted to Pastebin, allegedly from the Sony hackers. Even then, the team noted in the report the amount of data—fewer than 2,000 words in total—was small.
Mistakes in grammar, spelling, punctuation, tense confusion and even word usage can give certain clues. In the case of English, there are certain grammatical errors that native English (US-English) speakers typically don’t make, such as leaving out definite and indefinite articles “a” and “the,” or omitting words such as “to,” “should,” “must” or “will” in sentences. Another clue is using the “-ing” ending incorrectly, such as in the case of “they are go” instead of “they are going.” With these clues, the analyst can generate a suspect list of five possible languages, and then compare each “oddity” to see which language they align most closely with. For example, if “the” is being dropped, that is a good indicator the person is a Russian native speaker or some other Slavic-language. Argamon said the fact that Guccifer 2.0 kept dropping articles during the Twitter interview was evidence the speaker was more likely Russian than Romanian, since Romanian has definite and indefinite articles.
The more errors or language features that can be identified, the more thorough the analysis. The Sony report showed 25 different elements.
Not cut-and-dried
However, analyzing language isn’t all that straightforward, since people can speak multiple languages and have different levels of proficiencies. In the case of someone who is a native speaker of Mandarin Chinese but learned to hack from the Russians—learning Russian along the way—and carried out an attack in English, there may be “more features from the L2 [the second language learned] than the L1 [the native language] leaking through in writing the L3 [the third language being used],” Argamon says.
Context matters. The clues may point to a Russian speaker, but if there’s reason to believe the attackers are Chinese, then it could very well be a Chinese attacker who had been trained by the Russians. Linguistics help with triangulating the evidence from other research paths, such as source code evidence and network forensics.
“By itself, it [linguistic analysis] means very little. No one should be using linguistics for attribution on its own,” says Argamon.
Flashpoint analyzed 28 different ransom notes used in the WannaCry attacks, written in 27 different languages, and concluded a Chinese-speaking author was behind the original ransom message. Two of the notes in Chinese, one written with simplified characters and the other with traditional characters, appeared to be written directly, while the others—in Bulgarian, French, German, Italian, Japanese, Korean, Russian, Spanish, Vietnamese, and others--were translated from a note written originally in English.
The Chinese notes included several colloquialisms that were not in other notes, suggesting the writer was very comfortable with the language.
The English text was written by someone who spoke the language well as it used correct spelling and capitalization, but the incorrect use of “couldn’t” instead of “can’t” suggested it wasn’t a native English speaker. The Korean ransom note was riddled with basic errors and incorrect grammar.
Flashpoint acknowledged the limitations of its analysis, noting that the patterns could have been inserted intentionally as a false flag, in the way the Korean note is thought to be intentionally poorly written to obscure links to Lazarus. While the analysts had “high confidence” the authors were fluent in Chinese and familiar with English, and "moderate confidence" the English note was based on the Chinese notes, the team stopped short of making definitive conclusions. “This alone is not enough to determine the nationality of the author(s)," the researchers wrote in their report.
In fact, Flashpoint didn’t say Lazarus was not linked to WannaCry when claiming the ransom notes were written by a Chinese speaker. There’s a lot that is unknown about the North Korean attack group. Perhaps the group has native Chinese speakers as members. Many North Koreans can speak Chinese, and that many North Korean attackers have been trained in, and operate out of, China, especially in the northeast region of the country. Researchers have also noted in the past that many North Korean attackers actually operate outside of North Korea.
Why linguistics has value
It's also just as likely the attackers employed subcontractors and outsourced the work for different elements. Cybercrime is a global economy, and translations could have been done by one party and the malware purchased from an entirely different party. “This is part of the reason why in our WannaCry linguistic analysis, we made the differentiation between the authoring of the malware and the ransom notes--they could have been done by different people/actor sets entirely under a variety of scenarios,” Condra says.
Language is idiosyncratic and analyzing the words used give researchers additional insight into the identity, mindset and motivations of attackers, said Leroy Terrelonge III, director of research at Flashpoint. For example, a Spanish-speaking threat actor may use the word “grifo” to mean “gas station.” Since the word is used almost exclusively in Peru, a researcher can assess with a high degree of confidence that the actor is Peruvian or has extensive ties to a Peruvian community. The fact that a threat actor used a derogatory term to refer to a minority group can give insight into the racial and gender identity, as well as the political ideology of that person or group.
False flags are also possible. Just as The Shadow Brokers appear to have intentionally inserted grammatical errors to make it seem like they didn’t speak English well, attackers can intentionally insert specific phrases or errors to make their nationalities to throw off law enforcement and security researchers. This is why having a lot of text to analyze is important—it is harder to consistently make the same kind of errors. It’s also very difficult to sustain that over time.
A recent study by New York University professor Damon McCoy showed how studying linguistic style can be used reliably and accurately to identify individuals in an underground community, even when they use different aliases and accounts. Many attackers may overlook the language used in their strings, comments and notes, and not even realize that these items can and will be analyzed by researchers.
“Linguistic analysis is not definitive proof of attribution, but can be used in conjunction in certain circumstances with technical evidence to link malicious actors to attacks,” says Condra.