In this article, I’ll give an overview of some issues to consider when translating or localising your web site. From my experience as a translator and IT specialist, I’ll try to highlight not only various linguistic considerations, but also some subtle practical and technical issues to bear in mind.
Why is a web site different to a “normal” translation project?
In the simplest case, translating a web site may not be significantly different from translating regular documents. You may find you can supply static copy to the translator in a Word file, and then extract and upload the text when you receive back it in the same format.
However, many web sites don’t consist of a few pages of static text, meaning that a web site translation project may require some special consideration and additional skills on the part of the translator:
- you may have pages constructed “on the fly” from a database rather than existing in static files;
- you may have a server application, e.g. for processing form input, which itself generates text visible to the user;
- from a linguistic point of view, it’s rare for web site content to just be about one field: some IT terminology will almost certainly creep in somewhere.
For the first two of these reasons, it’s not uncommon for your web site to involve text in different formats held in different files. You might have some raw HTML files or text that you can easily extract to a text file or word document from your content management system, plus some data in a database that you may need to extract to a CSV file or SQL dump, plus some properties files used by your back-end server. In the initial stages of getting a quote for the project, tell the translator what file format is most convenient for you to work with (and send a sample) and ask if they can work with that format. (In my case, for example, I’ve seen clients spend time attempting to convert CSV files into word documents and mangling the text in the process, when I would have been quite happy working with the original CSV files.)
Although most web sites will involve some IT terminology at some stage, this should probably not be the main linguistic issue involved in web site localisation. My reason for saying this is that given the technical issues we’ll look at below, I strongly recommend contracting web site translation to a translator who is IT-savvy in the first place.
An initial linguistic decision, but one which the translator will probably be able to make for you, concerns form of address: as you may know, various languages use different verb forms to address the reader/listener either “informally” or “formally” (e.g. the tu vs vous distinction in French), with some languages having even a three-way distinction. Which form of address is appropriate will depend on your target audience and the customs of the countries you are targeting; the translator may therefore need to consult with you on who your main target audience is and what impression you want to give (do you want your text to sound “serious” or more “hip and trendy”?).
Other linguistic issues come in translating short items from a database or properties file, where there is sometimes the lack of context. Do you mean a “check” as in a “cheque”, or as in a “verification”? Do you mean “up” as in “higher price” or as in “go to the top of the page”? And in the case of strings that can have parameters (denoted by the sequence 0, 1 etc in property files in Java and various other languages), what are the various values that these parameters might have (since they can affect the translation)?
Sometimes resolving these issues will require you to answer direct questions from the translator about the interpretation of your text. But as a simple measure that can save some time and questions, I recommend using multiple properties files. Let each major area of your site/application have its own properties file. And in particular, let sections of your server/site that address different people have their own properties file. Crucially, if you can possibly avoid it, don’t mix in the same file strings aimed at the web site visitor and strings that are part of your back-end administration system.
Practical and technical issues
When you get the translated material back from the translator (or indeed, ideally beforehand!), there are one or two practical issues you may need to consider. You may have already observed the differences in word count that can occur between one language and another (typically, text in languages derived from Latin such as French and Spanish is about 20-30% longer than its English counterpart). This could have an effect not only on your page layout but also on database field sizes. More subtly, the character count in another language may be similar, but the word count could vary drastically if that language uses compounding more extensively than English (for example, you may find that a translated text in Finnish has a similar character count to the English, but half the number of words). A layout with narrow columns that works on your English page may suddenly look disastrous when applied to the German or Finnish translation.
If your site is interactive, then you have the added issue of accepting the input that users will expect to be able to supply in your web forms etc. This will include, for example, the ability to enter accented characters or a greater range of characters, plus some more subtle changes to your site’s validation. In English, you might have disallowed spaces in the Surname field. But speakers of various other languages typically have multiple surnames and would expect to be able to enter a space in this field.
Two other, sometimes related, issues are character encoding and collation. The first essentially refers to the way in which characters are stored/represented by the computer (how characters are translated into bytes). The second refers to how characters and strings are compared and sorted: for example, whether an e with an acute accent is considered equal to one without the accent for the purpose of searching, and in which order they appear when sorting. These issues don’t usually arise when dealing purely with English, but will typically need to be considered when dealing with text in another language.
Character encoding differs from system to system, with some common standards including ISO-8859-1, UTF-8 plus other encodings such as Mac OS Roman. Depending on your web site/application, you may need to ensure you have the correct character encoding configured at various layers:
- when reading in the translated file;
- when reading/writing to your database via JDBC or other application-layer framework;
- when reading data input by the user via the Servlet API etc;
- on the database field definitions themselves, to ensure they can store the range of characters necessary.
How do you know if you have the correct character encoding? A tell-tale sign of the wrong character encoding in various Latin-based languages like French and Spanish is if you frequently see sequences of two accented characters next to one another, possible including a capital letter in the middle of words. (This happens when a file encoded in UTF-8 is incorrectly interpreted as though it were in ISO-8859-1 or Mac OS encoding.)
The issue of collation (sorting/matching) may be dealt with at the database layer (most DB systems allow collation modes to be configured for a particular column/table/database). Or it may be dealt with at the application layer (in Java, look at the Collator class as an alternative or extension to the raw Collections.sort() and String.equals() methods).
I hope in this article to have highlighted some of the main areas of concern when localising a web site, and shown that such issues can go well beyond the translation itself. Working with a translator who is aware of these issues could save you time and effort in making your business available in the different countries you wish to target.