Language detection, a usability enhancer?

May 14, 2006 • Emil Stenström • Other

Natural language is by many considered something with very little structure. What many people don't know is that there are parts that are easily analysed. This article explains a simple but very effective algorithm for detecting what language a text are written in and continues to discuss possible applications.

More and more of the worlds population comes online and far from everyone want to write in English. In fact, most of the content online is not in English. This leads to the need for more tools that handle different languages; that translate, finds synonyms and so on. Different languages are important so we are going to need tools to handle that. This article handles the latin alphabet, simply becuase it's the only one I know. I'm sure you can apply the same ideas to other alphabets aswell.

How to detect language#

When looking at the differences between languages you can find that they are many. Words, grammar, sentence length, number or syllables… we can go on forever. People skilled in both languages and algorithmics quickly found a very reliable way to tell languages apart: Bigram statistics. A bigram is just a pair of two letters like "xt" or "ty".

It turns out that bigram statistics for a text stays fairly constant (there's always few "tp" in English) and sometimes you don't need more than a sentence to determine the language. This is how the algorithm works:

Remove or translate characters you don't want to count. You might want to take into account special characters in different languages like É and Ü and decide how to handle them. Be sure not to remove the space character, it gives good statistics over with which letter words start and end.
Split the text into bigrams. For this to work you need to take all bigrams in each word, that is, each letter is included twice. Example: "hello" contains "_h", "he", "el", "ll", "lo", "o_" ("_" means space). The more text, the more reliable the result will be.
Count the occurrence of each bigram in a long list that contains all the bigrams.
Calculate statistics over how often each bigram occurred.
Compare with precalculated lists over different languages. You need to compile this list yourself but it's easy. Use Google Advanced Search and set the language you want from there.
Pick the language that has the most similar list.

The only hard part to program is the comparing of two lists. For this we will use the Euclidean distance of the two lists of numbers. In essence, this is how Euclidean distance works:

Pick the values for the first bigram from the two lists you want to compare.
Calculate the difference between the two values and square it.
Repeat 1 and 2 until you have one result for each bigram pair.
Add all those results together and calculate the square root of that sum. You have your distance.

Euclidean distance gives us a number. The lower the number the more like each other the two lists are. So we compare the calculated list with the precompiled lists for other languages and pick the language with the lowest number.

Real world applications#

As you've see above detecting language isn't rocket science. Anyone with some solid programming skills could implement the algorithm above no matter what programming language. Anyone could also gather the statistics needed to make some bigram lists for different languages. So you could expect that this is used all over right? Wrong. A few examples:

Since my native language is Swedish I have downloaded an extra dictionary for Thunderbird to make it able to check both my Swedish and English e-mails for errors. It works well except for one thing. Since about half of the e-mails I send are in each of the languages, the dictionary is almost always set to the wrong one. Since the language is not shown anywhere I usually type a few words, think "what, isn't that correct?", and then remember that the wrong dictionary is selected. You can see where the language detection above would fit in right? They don't even need to see what I'm going to type; the language I will reply with will be the same as in the e-mail I'm answering to. That's usability.

Another example. Too few webmasters know about the lang attribute. It can be set on all HTML elements and specifies what language the content is in. The HTML 4 specification lists the following reasons to why you should use lang:

Assisting search engines
Assisting speech synthesizers
Helping a user agent select glyph variants for high quality typography
Helping a user agent choose a set of quotation marks
Helping a user agent make decisions about hyphenation, ligatures, and spacing
Assisting spell checkers and grammar checkers

It all boils down to: it's a good thing to say what language a page is written in. The problem is, not many know that the lang attribute even exists. Solution? Let the screen reader, browser, or search engine (they are doing it) auto-detect the language and act on it.

A third example is all those chat rooms. There are rooms all over the place and the topics range from poker strategies to horse equipment. What they all have are lots and lots of users and rooms where you can talk, in any language you want. Wouldn't it be nice if the chat software automatically detected which language was spoken in a room? Then you could see even before joining. Or even what languages a certain user speaks.

I'll call that the end of this article. It moved a bit outside of the webdev area but I hope you still liked it. Have any more ideas of where this could be useful? Want to show your javascript/python/ruby/etc implementation of it to me? Leave a comment.

Comments

May 14, 2006
By: Sarven Capadisli (#1)

Concerning:
# Assisting search engines

I am not sure how effective the lang attribute is.

A while back I had a problem with this in fact.
I've placed lang="en" on my pages, however google had a problem choosing the wrong language for one of my articles.

The only reasoning I could come up with was that the article (at the time) was more popular in the German (de) community.

So I had to write this:
Google Deutschland stole my article

Although this is no longer an issue, I am still spectacle when it comes to search engines (in fact anything) if they are really acknowledging the language of a given document.
May 15, 2006
By: Steve Tucker (#2)

I try to always put some reference to the language in my markup documents - it makes logical sense. At worst cannot do any harm!
May 15, 2006
By: Jesse Skinner (#3)

Thanks for the algorithm.. I would never have known where to begin to implement something like this.

I think, maybe dictionary files would be the perfect place to build up a bigram statistic list. Those are usually (somewhat) easy to find.
May 16, 2006
By: Emil Stenström (#4)

@Jesse Skinner: I wouldn't use a dictionary. Since the texts you will be testing are ordinary texts you should base your text data on that kind of texts. Say if English contains a lot of "in". Then that bigram should have a high percentage, whether the reason is because it's in many different words or that those words are common.
Jun 16, 2006
By: Adam Zakreski (#5)

I find your statement, "In fact, most of the content online is not in English." A little hard to swallow. Even with the citation, the original source seems to only be refering to a certain type of blog posts. I believe saying, "most blog content online is not in English," would be more accurate (though still a stretch). I do find it a lot easier to believe that the Japanese are more avid bloggers than English speakers, though.
Jun 19, 2006
By: Emil Stenström (#6)

@Adam Zakreski: The link indeed handles only blog content that technorati indexes. Other sources have managed to approximate the number to 68% which means you are right there. Well done :)

Global Reach meassures the number of people online by language and finds 35%, something that tells much about the future of the web.
Aug 27, 2006
By: Dan Pettersson (#7)

Does anyone have an example of code (e.g. in PHP) that would compare two texts and give you a number...

In my own tests squaring and square rooting messes things up. E.g. if I duplicate a text it won't show the same result.

I've solved it by making my own version without the squares, and I think that works fine...
Aug 27, 2006
By: Emil Stenström (#8)

@Dan Pettersson: so your algorithm gives different numbers for the exact same texts? That's some bug in your code. If you post a link to it (syntax highlight it) and I'll have a look.
Mar 11, 2007
By: Peter Vigren (#9)

I am also interested in example code for PHP. I find this quite intriguing, maybe because I find programming AND languages so fascinating. :-) But one thing, if you remove or translate characters like é etc, won't that make the result a bit weird? After all, if a language uses é a lot and others don't, shouldn't that be significant? In any case, I really enjoyed this article, first time I ever had heard of this so it tickles my brain like crazy! Thank you. :-)
Mar 15, 2007
By: Emil Stenström (#10)

@Peter Vigren: I'll see what I can do, I'm moving atm so not too much extra time available.

How to detect language#

Real world applications#

Comments

By: Sarven Capadisli (#1)

By: Steve Tucker (#2)

By: Jesse Skinner (#3)

By: Emil Stenström (#4)

By: Adam Zakreski (#5)

By: Emil Stenström (#6)

By: Dan Pettersson (#7)

By: Emil Stenström (#8)

By: Peter Vigren (#9)

By: Emil Stenström (#10)