Tamil to English and English to Tamil translation - Google

Languages used in Carnatic Music & Literature
Post Reply
vasanthakokilam
Posts: 10956
Joined: 03 Feb 2010, 00:01

Tamil to English and English to Tamil translation - Google

Post by vasanthakokilam »

http://translate.google.com/#ta|en| ( tamil to english )

(Example: http://translate.google.com/#ta|en|%E0% ... 4%E0%AF%81 )

http://translate.google.com/#en|ta| ( English to Tamil )

( Example: http://translate.google.com/#en|ta|we%2 ... or%20lunch )

It has Tamil Text to Speech and a tamil key board as well.

It is only alpha quality and so it is not perfect. We can expect it to improve over time.

Hindi translation has been around for quite some time ( I think ) and Listen link sounds much better ( not sure if it is text to speech or recorded voice ) http://translate.google.com/#en|hi|

I see telugu in the list of languages as well.

srkris
Site Admin
Posts: 3497
Joined: 02 Feb 2010, 03:34

Re: Tamil to English and English to Tamil translation - Goog

Post by srkris »

It doesn't improve automatically, but with user input. We are allowed to click on the mistranslated word(s) and supply our corrections real time and it learns from such corrections over time.

I did my bit of help to thus correct google but eventually tired of it. Will continue once I get some enthu again.

thanjavooran
Posts: 2972
Joined: 03 Feb 2010, 04:44

Re: Tamil to English and English to Tamil translation - Goog

Post by thanjavooran »

Thanx shri Vasanthakokilam Avl for the link. As shri srkris correctly puts in, little time consuming and requires lot of patience to get required result.
Thanjavooran 26 06 2011

vasanthakokilam
Posts: 10956
Joined: 03 Feb 2010, 00:01

Re: Tamil to English and English to Tamil translation - Goog

Post by vasanthakokilam »

Sri. Thanjavooran, try English to Tamil translation, that is much less work to check it out.

Srkris: Such crowd-sourcing is a favorite technique of Google to improve things over time. In addition, for the translation, they use Statistical Machine Translation where they feed it human translated data and it figures out patterns from it. So, as they find new material of human translated texts, they feed that in to improve it.

Here is Google explaining how statistical machine translation is done: http://www.youtube.com/watch?v=_GdSC1Z1Kzs

Bureaucracies like U.N. help here since they require the same text to be posted in multiple languages.

Google says that they can not find enough material for Tamil. Indian government requires official documents to also appear in Hindi and they hit pay dirt with that one. Hence the quality of Hindi translation is much better. They looked at Japanese, Russian, Turkish and German to help out with the sentence structure. There was an article at WSJ "India Realtime" section about this where they interviewed Sri. Ashish Venugopal, Research Scientist at Google who works in the translation project. If I find the link, I will post it.

Here is Google's announcement of the availability of Bengali, Gujarati, Kannada, Tamil and Telugu.
http://googletranslate.blogspot.com/201 ... indic.html

cmlover
Posts: 11498
Joined: 02 Feb 2010, 22:36

Re: Tamil to English and English to Tamil translation - Goog

Post by cmlover »

Machine translation is an exciting frontier so that one can directly start reading what is written in one language in one's own language. Especially the attempts on translating English to Tamil/Sanskrit fascinate me. The translation by google language tools from English to European languages is excellent. The attempt to Indian languages is still primitive. But I guess Sanskrit being a structured language should be worth attempting first which should be the route for other Indian languages. I find Sanskrit to Tamil much easier to understand (grammatically and in word order) than into English. It will be nice if Govt of India allocates more resources to Sanskrit/Tamil (machine translations , to start with - now that the anti-sanskrit regime has been ousted from power in TN!

vasanthakokilam
Posts: 10956
Joined: 03 Feb 2010, 00:01

Re: Tamil to English and English to Tamil translation - Goog

Post by vasanthakokilam »

CML: Statistical Machine translation works without knowing anything deep about the syntax and semantics of the language. Watch that youtube video from Google on how they do it. It is quite a brute force method. The main requirement is that they get lots of samples of human translated work. The quality is proportional to that. Sanskrit is not there yet in because there is not much English to Sanskrit and Sanskirt to English material available on line.

It is amazing how well this brute force strategy actually works.

vasanthakokilam
Posts: 10956
Joined: 03 Feb 2010, 00:01

Re: Tamil to English and English to Tamil translation - Goog

Post by vasanthakokilam »

This is the WSJ Article from a few days back.
---
June 24, 2011, 3:33 PM IST
By Tripti Lahiri

Google Inc. announced earlier this week that it had extended its translation service to five more languages spoken around the subcontinent.

The service, already available for Hindi and Urdu, can now be used on an experimental basis for Bengali, Gujarati, Kannada, Tamil and Telegu. Bengali is spoken in both India and Bangladesh, while Tamil is also spoken in Sri Lanka. But Google Translate had to look to content in Japanese and Russian to translate some of the structural complexity of these languages.

“The real challenge was the relative lack of data,” said Ashish Venugopal, the Google research scientist who oversaw the translation service’s expansion into the five new languages, in a phone interview on Friday. “There’s not a lot of content being produced in these languages on the Web relative to other languages of Asia.”

Google Translate relies on something called “statistical machine translation.” Watch a video explaining the method: http://www.youtube.com/watch?v=_GdSC1Z1Kzs

Mr. Venugopal offered the example of how a person might understand the meaning of a Hindi word like “malai” by seeing several dishes on a menu with the word in its name consistently translated as “in cream.” Machine translation pretty much works in the same way, using software to trawl the Web for translated content and analyze its patterns.

But if you don’t have enough menus—or enough instances of dishes on the menu—the translation is not going to be very good.

Huge bureaucracies, like the United Nations, which require the same content to be posted in several languages, can make the job easier for machine translation.

Google Translate had an easier time with Hindi content in part because the Indian government requires official documents to also appear in Hindi and although the amount of English content that appears on Indian official Web sites is greater than the Hindi content, there are a fair amount of reports and documents that appear in both languages. News organizations can help, too.

“It’s great if a news service is publishing in multiple languages,” said Mr. Venugopal.

For the five new languages, the machines looked outside the subcontinent.

“We really looked at Japanese for the [sentence] order because it follows such a similar subject-object-verb order,” said Mr. Venugopal. “There’s a lot of parallel data from Japanese to English so we have strong models there.”

The main thing the machines needed to learn from Japanese was not to get confused by the different placement of verbs in a sentence so that it would be able to translate a sentence from Tamil as “I kicked the ball” and not “I ball kicked.”

Another problem was “agglutination,”—where some languages tack prepositions or other words on to a verb rather than keeping them distinct as in English.

“In English, you use all these auxiliary words: ‘I have not kicked the ball,’” said Mr. Venugopal, who also speaks Tamil. “But when you go to Tamil the ‘not’ goes inside the verb and it’s a single word. So the other challenge was how do we redefine the notion of what a word is.”

To figure out how to break down verbs that have been joined up with other words, the machines looked to Russian, Turkish and German, he said.

However other languages can only help so much—and their usefulness is mostly directed towards helping Google Translate figure out sentence structure. When it comes to improving its vocabulary, though, machine translation requires what humans require: More reading material. Until more content translated by human users is available on the Web in these languages for the translation system to consume, Google Translate warns that the quality of the service in the five new languages will vary.

“We always say the best thing you guys can do to help your language enter Google Translate is to use it—and to use it on the Web,” said Mr. Venugopal.

ShrutiLaya
Posts: 225
Joined: 14 Sep 2008, 01:15

Re: Tamil to English and English to Tamil translation - Goog

Post by ShrutiLaya »

On the other hand, here's a fascinating article I saw a few days ago. It seems that as more and more people use google's tools to create automatic translations, the quality drops because most of what google reads is it's own translation :) Google is now "deprecating" its API, to discourage other websites from polluting the web ..

=============
http://www.theatlantic.com/technology/a ... ar/240283/

An 'Economic Burden' Google Can No Longer Bear?
By James Fallows

Jun 12 2011, 4:16 PM ET
This is insider-tech talk, but I think it is very interesting in its implications -- about language, "big data," Google's strategies, and the never-ending recalibration of goods vs bads, "signal to noise," on the internet.

[Brief summary of what follows: Google is dropping an automatic-translation tool, because overuse by spam-bloggers is flooding the internet with sloppily translated text, which in turn is making computerized translation even sloppier.]

There has been a rumble in the tech world about Google's announcement last month that it was "deprecating," and phasing out, its "Translate API." In simplest terms that means that website developers will no longer be able to use code that makes Google's translation algorithms automatically provide material for other sites. The standalone Google Translate site, which allows you to enter text or URLs for translation, will remain (along with some other features that apply Google translations to others' sites). But as an announcement on the Translate API site said:

GoogleTrans2.png

For a very, very detailed explication of what this "economic burden" might mean for Google, check this analysis from the eMpTy Pages site on translation technology and related topics. Here is the part of the explanation that, for me, had the marvelous quality of being obvious -- once it's pointed out -- and interesting too:

The intriguing problem is the way that over-use of automatic translation can make it harder for automatic translation ever to improve, and may even be making it worse. As people in the business understand, computerized translation relies heavily on sheer statistical correlation. You take a huge chunk of text in one language; you compare it with a counterpart text in a different language; and you see which words and phrases match up. The computer doesn't have to "understand" either language for this to work. It just notices that the English words "good" or "goods" show up as bon in French in certain uses (ie, as in "opposite of bad"), but as a variety of other French words depending on the context in English -- "dry goods," "I've got the goods," "good grief," etc.

Crucially, this process depends on "big data" for its improvement. The more Rosetta stone-like side-by-side passages the system can compare, the more refined and reliable the correlations will become. Day by day and comparison by comparison, the translation will only get better. So that some day, in principle, we could understand anything written in any language, without knowing that language ourselves.

UNLESS ... the side-by-side texts used to "train" the system aren't any more accurate and nuanced than what the computer already knows. That is the problem with a rapidly increasing volume of machine-translated material. These computerized translations are better than nothing, but at best they are pretty rough. Try it for yourself: Go to the People's Daily Chinese-language home site; plug any story's URL (for instance, this one) into the Google Translate site; and see how closely the result resembles real English. You will get the point of the story, barely. Moreover, since these side-by-side versions reflect the computerized-system's current level of skill, by definition they offer no opportunity for improvement.

That's the problem. The more of this auto-translated material floods onto the world's websites, the smaller the proportion of good translations the computers can learn from. In engineering terms, the signal-to-noise ratio is getting worse. It's getting worse faster in part because of the popularity of Google's Translate API, which allows spam-bloggers and SEO operations to slap up the auto-translated material in large quantities. This is the computer-world equivalent of sloppy overuse of antibiotics creating new strains of drug-resistant bacteria. (Or GIGO -- Garbage In, Garbage Out -- as reader Rick Jones mentioned.) As the eMpTy Pages analysis describes the problem, using another analogy (emphasis added):

>>Polluting Its Own Drinking Water
...An increasing amount of the website data that Google has been gathering has been translated from one language to another using Google's own Translate API. Often, this data has been published online with no human editing or quality checking, and is then represented as high-quality local language content....

It is not easy to determine if local language content has been translated by machine or by humans or perhaps whether it is in its original authored language. By crawling and processing local language web content that has been published without any human proof reading after being translated using the Google Translate API, Google is in reality "polluting its own drinking water."...

The increasing amount of "polluted drinking water" is becoming more statistically relevant. Over time, instead of improving each time more machine learning data is added, the opposite can occur. Errors in the original translation of web content can result in good statistical patterns becoming less relevant, and bad patterns becoming more statistically relevant. Poor translations are feeding back into the learning system, creating software that repeats previous mistakes and can even exaggerate them.<<

That's all I have about this story, which I offer because it reveals a problem I hadn't thought of -- and illustrates one more under-anticipated turn in the evolution of the info age. The very tools that were supposed to melt away language barriers may, because of the realities of human nature (ie, blog spam) and the intricacies of language, actually be re-erecting some of those barriers. For the foreseeable future, it's still worth learning other languages.

vasanthakokilam
Posts: 10956
Joined: 03 Feb 2010, 00:01

Re: Tamil to English and English to Tamil translation - Goog

Post by vasanthakokilam »

ShrutiLaya: Thanks for the link. A lot of value added. Yes, that 'polluting its own drinking water' is an interesting phenomenon. The challenge for Google is to water-mark the machine translated text so they do not pull that back in. I am sure they are thinking about it. I wonder if UTF encoding has any 'dont't care' spare bit available that they can use for this purpose. Even without the API, people will still copy and paste it in web sites which can cause pollution if they are not careful.

For getting more human input, Google has the CAPCHA which they are currently using to correct scanned text by the crowd. Quite ingenious. I hope they have something in their tool chest for human translation also. We will see.

ShrutiLaya
Posts: 225
Joined: 14 Sep 2008, 01:15

Re: Tamil to English and English to Tamil translation - Goog

Post by ShrutiLaya »

vasanthakokilam wrote:Even without the API, people will still copy and paste it in web sites which can cause pollution if they are not careful.
As the (updated) old adage goes, to err is human, but to really *$%^& things up, you need a computer :) I don't think individual actions will change the statistical picture, you need something like the API to generate enough content to matter ..

I'm sure the engineers at google are scratching their heads to come up with a solution to this problem, some form of watermarking which will work even when cut/pasted in text form. Don't know if they can add something to the UTF8 encoding without breaking half the rendering engines around.

But I find google and its problems of big data fascinating, there is an almost mythical aspect to them (like the good vs. evil metaphor they like to stress). Although completely unrelated to this thread, see today's NY Times http://opinionator.blogs.nytimes.com/20 ... &emc=thab1
for another interesting good vs. evil Google story.

- Sreenadh

vasanthakokilam
Posts: 10956
Joined: 03 Feb 2010, 00:01

Re: Tamil to English and English to Tamil translation - Goog

Post by vasanthakokilam »

What Google can do is, support the API but provide an image bit map as the output, so there is still some functionality provided for automatic use. Also, one curious thing I found is, in the Google announcement they say we need to download Tamil fonts and provide a link to it. That may be a way out, in a sub-optimal way. For automatic translations, they can do their own font which will be the watermark. That is actually going backwards, that is why it is strange that Google says you need language fonts.

ShrutiLaya
Posts: 225
Joined: 14 Sep 2008, 01:15

Re: Tamil to English and English to Tamil translation - Goog

Post by ShrutiLaya »

vasanthakokilam wrote:What Google can do is, support the API but provide an image bit map as the output, so there is still some functionality provided for automatic use.
That's a great idea! Hope google's watching this board :)


- Sreenadh

cmlover
Posts: 11498
Joined: 02 Feb 2010, 22:36

Re: Tamil to English and English to Tamil translation - Goog

Post by cmlover »

As a starter for us CM buffs, if Govindan's translations of T into tamil (which is word for word and grammatically accurate) we can have an 'engine' that will permit tamils to appreciate and sing T meanngfully and incidentally learn rudiments of Telugu. It is a project worth undertaking by one of our competent programmers...


Post Reply