Language is one of the few remaining barriers on the Internet. The web has rendered time and distance largely irrelevant, but much of it remains fragmented by language. The Worldwide Lexicon, an open source project I have worked on for some time, aims to lower those barriers by combining people and computers to participate in translating and curating web content. We're releasing a new web services API, hosted on Google's cloud computing platform, that makes it easy to embed social translation features in virtually any website or web app. This tutorial explains how to build a variety of translation tools using straightforward web services development techniques. If you're interested in a review of Google's App Engine platform, see my companion post, Why I Started Coding Again (Thanks Guido!).
The Worldwide Lexicon is what's known as a translation memory. While we communicate with machine translation services, it is primarily designed to collect and curate human translations and meta data about those translations, such as feedback scores, comments, and editor approvals/disapprovals. Typically we log sentences, their proposed translations into other languages, and meta data such as scores from users or editors, comments, and so on. It is an open system that allows anonymous as well as registered submissions, like Wikipedia, while the sites using the translation memory can decide when and how to use the translations they receive (for example, some may filter anonymous translations while others may encourage them). The system is easy to communicate with, and because it does not need to know the particulars of how any one website is organized, it can be integrated with almost any website or web app because WWL just stores sentences and text strings, and is not concerned with the design or layout of the sites using it. It is also designed around the principle of smart endpoints talking to a relatively dumb system or network (the apps using WWL decide how to display and filter translations using the meta data we provide along with them).
A Polyglot WSAPI
We designed the system so that it could communicate with a broad range of systems, including specialized systems that deal specifically with translation and localization. When you call a WWL service, you can request the response in the following formats:
- XML : loosely typed XML, a good choice for parsing recordsets in a language that recognizes XML
- RSS : we can make recordsets look like RSS feeds, for example if you want to show the contents of the translation queue on your blog
- PO : the file format used by the GNU gettext() localization library, makes importing translations from WWL into other localization/translation packages easy
- XLIFF : an XML format used for localization and translation projects
- TEXT/CSV : returns results as a table of comma separated values
- Plain Text : for simple calls that do not return structured data.
- HTML : plain old HTML, great for manually testing services from a browser
Sending data in to the Worldwide Lexicon API is simple. You just make an HTTP or HTTPS call to a request handler, and you'll receive a response in the format selected using the output=xxx parameter.
A Typical Interaction
A typical interaction with WWL will unfold as follows. Here we describe an AJAX widget that redraws a site in a visitor's preferred language, and also allows visitors to submit or score translations. It does something like the following
- Auto-detect the visitor's preferred language(s) based on browser settings or a pick list
- Call http://worldwidelexicon.appspot.com/q?tl=es&output=json&url=...... (e.g. give me all translations in Spanish that have been catalogued for the URL xyz/abc.html)
- Receive a recordset in the format of your choice (e.g. JSON), read the results into an array of x=y substitutions
- Search/replace the original text and redraw the page with the translated texts (optionally call /mt to request machine translations for the remaining untranslated texts)
- To edit or score a translation, the user mouses over a sentence to trigger a dialog which, in turn, writes back to other request handlers, /submit to submit a translation, /comments/submit to post a comment about a translation or /scores/submit to post a score
That's just one type of query. You can also request translations by username, for an entire domain, or for a sentence or list of sentences (with an option to proxy through to machine translation engines if desired). The web services speak back in your choice of output format, so you can take your pick based on the development tools you're using to build your app.
With the WWL web services interface, currently in testing at worldwidelexicon.appspot.com (docs at /help, quick synopsis of the API is at the end of this post), you can easily perform the following tasks:
- Request human translations for a URL, domain, username or sentence into one or more languages (metadata such as scores are also included in results)
- Request combined machine and human translations for a sentence or set of sentences
- Submit human translations to be tagged by url, domain, username and other parameters
- Submit new source texts to be queued for translation by web and mobile users
- Fetch new source texts that are awaiting translation
- Request and submit translations for web applications (simple localization service)
- Submit a score for a translation by a user, get a list of raw scores by url, domain or user.
- Request and submit comments about a translation, or URL or a user
- Notify the system that you're available via IM on one or more popular IM networks, search for other IM users (can be used to broker IM translation sessions etc)
- Add and manage user logins (you can create your own WWL namespace for user accounts)
- Query several popular machine translation services with a single API call
- Bulk submit a large number of translations, scores or comments in a single query (useful for building admin or 'power user' tools)
So what sorts of things can you build around a platform like this?
Interactive translation viewers and editors
- Sense user's preferred language via browser prefs or pick list
- Call /q web service to request a list of cached human and machine translations for the desired URL in a specific language
- Replace source texts where possible
- Call /mt service to request machine translations for remaining untranslated sentences (if MT is enabled)
- Allow user to edit, score or comment on translations by mousing over a sentence to trigger a popup window, results are posted back to /submit, /submitscore or /submitcomment respectively
We also host a special case of our translation memory that is customized for web app localization, SLS (simple localization service). With it you can build and manage a set of tokens that are mapped to texts, e.g. home.greeting = Welcome to Foo.Com which in turn can be translated to any number of languages via a simple web tool, or interactively via widgets that talk back to the web API.
This service enables you to make a query like:
which returns a Spanish text for the token home.greeting
which returns a machine translation for the token home.greeting
or, if you need to translate an unregistered text
which returns a human or machine translation for the text
- Sense user's preferred language via browser prefs or pick list
- Look for specially labeled texts or divs within the page (e.g. Why Hello There!), build an index of tokens to be translated via the localization server
- Call /sls web service to request translations for domain=yoursite.com token=(label name or div name), filter translations per your set of rules
- Replace texts where possible
- Translate site content using procedure for interactive translation viewer (you can probably tolerate errors in content translation, but don't want a malicious user translating Delete as Save)
- Allow users to submit translations, but they will not appear unless an editor marks them editor_approved=y
NOTE: you may also want to cache the translations on your server, so you are not dependent on an external system to serve these items.
Content Management System Integration
The translation memory can also be used to render static translations of webpages, such as articles in a content management system. For example, you might want to build a Drupal module that creates child documents for each parent document (one for each desired language). It then reads the parent document, parses it into sentences, and requests translations from WWL, building a translated document as it proceeds. A similar approach would be to build PO files by exporting translations from the translation memory, for example, to automatically export translations into a format supported by the widely used gettext() module. A program that generates translated documents derived from parent documents would work as follows:
- Look for new or recently changed parent documents or posts
- For each document, call /q web service to see if there are translations available for that permalink and target language
- If yes, create a child document, and for each sentence in the parent, try to replace it with the translation. If none is available use the source text.
- Include hyperlinks to point to utilities to edit, score or comment on translations (home grown or hosted at WWL).
This approach has the advantage of being fully self contained, so once a translation is rendered, the CMS is not dependent on WWL to serve the translations to viewers. The disadvantage is that these documents will probably not be updated very frequently, so user submissions will not appear quickly, which could discourage participation. Readers are also more likely to perceive the system to be slow to incorporate changes. However, this will perform well, and is not dependent on a third party system, an issue to consider for high traffic sites.
Mobile Translation Utilities
There are many mobile applications that can be built around the platform. One developer is already working on an iPhone application that invites people to submit translations to their favorite websites, one sentence at a time. An easy application to build is an on demand translator that calls our translation search engine at worldwidelexicon.appspot.com/mt to request a combined list of machine and human translations for a sentence or phrase. The mobile app in development is particularly interesting because of the potential to turn idle mobile users into a workforce for the sites they're passionate about. Mobile applications are simple to build because they will typically do one of two things: either they will be used to lookup translations, or will be used to submit translations to short texts (as a way of encouraging people to contribute to their favorite site while waiting for the bus). To build a translation lookup utility, just call the /mt service to fetch a list of human and machine translations for a sentence (we'll be adding dictionary services soon as well). To submit translations, you'd query the /new service to get texts awaiting translation, and /submit to send new translations back.
Translation communities. The translation tools industry is almost entirely centered on enterprise users, and apart from a few consumer facing machine translation services (Babelfish, Google Translate, etc), there really hasn't been any focus on creating social translation tools. One of our goals at WWL is to facilitate the creation of translation communities around popular websites, topics and affinity groups. This activity can be channeled through WWL, via community tools in development, or you can simply layer collaborative translation memory onto an existing site or web app, such as a social news site. For example, we've been hearing from country specific sites that want to embed translation memory in the social applications they are building. They know more about their user community than we do, so this is an effective way to make an existing service translatable, thus adding new functionality such as the ability to annotate and translate popular news stories.
Language learning. Using the translation memory as a data source, it is straightforward to build 'flashcard' applications that display texts and prompt users to translate them, and then compare the results against known translations in the translation memory or translations submitted by other users. This is not an area we're especially focused on though, as there are a lot of language learning systems already out there (Rosetta Stone, etc).
So while it is a simple system, you can do quite a bit with it. If you'd like to have a look around the web services API and docs are both available at worldwidelexicon.appspot.com You can find source code for the Simple Localization System at code.google.com/p/slsphp (this will be updated to work with the cloud based memory in the near future). We'll be publishing the translation corpora as well, and are currently working out the most efficient way to do this (because of some limitations in the way Google's cloud computing platform processes queries we'll probably host this on a separate system). Lastly, we'll publish the source for the translation memory later this spring, once we've had time to vet and document everything in detail.
Share Your Utilities
If you build a library or widget around WWL, we're hosting a library of utilities at code.google.com/p/worldwidelexicon starting with libraries and client side utilities. We'll also be publishing server side code here, starting with the Simple Localization Service, so web developers can self host translation memories for their web applications.
Worldwide Lexicon API Index
The Worldwide Lexicon API is fairly compact, and in fact, you can do the bulk of what you need with just a few web services. Full API documentation is here. See also SLS documentation. (note that the web services themselves aren't much to look at, as they're primarily there to talk to other applications, widgets, etc).
- /q : search the translation memory for translations by url, domain, username and target language
- /mt : request machine translations from one or more popular machine translation services, also returns cached human translations that match the search request
- /new : request a list of newly submitted source texts that are awaiting translation
- /submit : submit a new translation, either anonymously, or as a registered user
- /comments/get : get comments about a translation, URL, domain or translator
- /comments/submit : submit a comment about a translation (also indexed by translator, parent URL and domain)
- /editors/approve : flag a translation as approved/official
- /editors/reject : flag a translation as rejected
- /editors/score : assign a manual score to a translation (co-exists with user submitted scores as a separate field)
- /scores/get : get a list of raw scores for a translation, all translations for a URL or domain, or from a specific user
- /scores/submit : submit a score for a translation, will also be indexed by url, domain and translator
- /sls/add : create new SLS token in your site's home language
- /sls/download : download all translations for a specific language (e.g. fetch a PO file for use by gettext)
- /sls/get : get a translation for a specific SLS object in a specific language, with option to fallback to machine translation
- /sls/translate : submit a translation for an SLS object to a specific language
- /source/submit : submit a new source text to be added to the public translation queue
- /source/parse : parse a sentence into words (for internal use, but publicly accessible)
- /users/add : create a new user account (sends verification email to user)
- /users/get : get user profile and meta data
- /users/login : authenticate a user or session key
- /users/set : set user flags and language preferences
- /users/update : update user profile and meta data
- /users/verify : callback link to verify newly created user account