Avoiding the Tar Pits of Localization with Jeff Casimir21 Nov 2012
- The tar pits of localization
- Writing better code
- Fixing magic data
- White labeling your app
Even if you're only going to support one language, internationalization will help you to build a better application.
We're about to embark on a dangerous journey ..
As a developer considering internationalization and localization, you're like this mammoth. You're out walking and you think: "I'll walk through this water - how deep could it be?" .. and pretty soon, you're stuck in the tar pit.
That's how internationalization is. It starts out easy enough, and eventually you'll be drinking tar.
i18n & l10n
Internationalization, i18n is the one-time process of preparing an application to support more than one locale. In a Rails application, that typically involves going through your controllers, models and views, pulling out all the content strings and replacing them with references to a look-up file or some other means of finding the translated text.
Localization, l10n is the many-time process of taking an application which has already been internationalized and creating a locale for it - US Spanish or Brazilian Portuguese, for example.
Why businesses care
Localization provides a better user experience. The more users you have, the more customers you have. The more customers you have, the more money you make.
Why developers care
Developers typically try to stay away from localization. The tar pit is scary, and making money is generally somebody else's problem.
The main reason developers care about internationalization is that it results in better code.
Internationalization takes a scalpel to your code. I've written many controller actions like this - something I call magic data.
Magic data leads to several problems, one of which is copy-edit commits. That's where you change a line of code only to fulfil a marketing or copy change demand.
For example when a submit button changes to save, or a flash message changes from your article was created to your article was saved.
That kind of stuff pollutes your code. It looks like you're making functionality changes - but you're just changing a bunch of strings.
When you look at a controller action like this, you're just looking for the magic data, the strings.
It's one of the reasons I like to use symbols in Rails whenever possible because it separates functional code from magic data. Whenever you see strings, they should be removed. So how do you remove them?
Establish a look-up dictionary
This is what a YAML-based dictionary in Rails looks like.
With this dictionary established in config/locales in a default Rails application, you can utlise those keys with the t helper.
You can access the YAML hierarchy via a dot notation. So this means a top level key of article having a child key of deleted.
The t helper works in a view or a controller context. If you're not in either of those contexts, you can speak directly to the i18n class, call t on it, and pass in your key that way.
However, that's typically a warning sign. If you're calling the translate helper from the model layer, you're probably doing a bad thing because translation is about presentation and models shouldn't be presenting things.
I'm not saying it's wrong to use I18n.t - but do it with hesitation.
Structuring the keys
The first rule is to keep them short - and you do that by avoiding snake-casing the text.
I've often seen the text snippet turned into the key by replacing all the spaces with underscores. This isn't a maintainable way to go. Instead you should focus on the core meaning of the text.
Previously, you saw a key called article.deleted rather than article_has_been_deleted - because has_been is not core to the meaning.
Instead, just focus on the central idea - because that text is very likely to change, but the core meaning of the key is unlikely to change. Also, forget about reusability.
Look-up dictionaries are not programming. Resist the urge to do fancy things with your keys and translations - interpolation, automatic pluralization and singularisation.
The more complex you make your keys and translations, the further down in the tar pit you'll find yourself. As far as possible, use simple keys and simple strings.
If you want to do something more complicated, the Draper gem can be helpful in handling complex logic at the view layer.
I was inspired to write this gem from the Microsoft community. They have this idea of view models and Draper helps create models that deal with presentation concerns at the view layer.
So, we remove magic data from the controller and use the t helper to replace them with keys. This is what a nicely internationalized controller action looks like.
When you look down at the model layer, even more magic data tends to spring up. These are all spots that could be dealt with using internationalization.
The numbers here are probably better handled in configuration files because they're unlikely to change between different locales.
If they're not going to change across different locales then they belong in a config file.
If they are going to change - like that message in line 3 - those belong in a locale file.
In this case, ActiveRecord can help you out - especially when you're setting a specific validation message.
When an ActiveRecord model fails validation, ActiveRecord will attempt a sequence of look-ups in the current locale.
This is a description of that look-up hierarchy - the most useful of which is activerecord.errors.messages
In your YAML file, under the messages child key, you can define keys for each built-in validation and Rails will automatically use those strings.
One gotcha is that you have to define the keys by what the Rails guide calls message rather than what it calls validation.
In this example, your key would be blank, not presence
Putting it all together, when I structure my keys like this, I can delete the message line from my model, leaving just validates_presence_of :title
A word of warning, this is an example of where you'll be tempted to get fancy. You'll say "I can get the attribute name and dynamically inject it into the message."
Don't do it. You'll regret it.
Just keep it simple. It works fine in English because title is one word. In other languages, it might be two words, you might also have a capitalisation problem, etc etc.
This leads down a road of pain. Just keep the messages simple and use the string you define in the YAML file.
You can also define the print name of your attributes if you follow this kind of hierarchy:
So, here I've given the article model a string of Title and the comment model a string of Your comment.
If you're using label tags like you should be, along with any other form helpers built in to Rails, they will make use of these strings automatically - you don't have to monkey around with your forms.
In this example, if you're just outputting a label, it will show the string Your comment instead of body on the comment form.
By pulling these strings out of my functional code, I have one less reason to change my functional code, which improves its durability, which makes it better.
Even if you're only going to support one language, internationalization will help you to build a better application.
Let's say you have a blogging app, and you want to present the articles in multiple languages. Pretty soon you're going to run in to the problems of creating multiple translations, managing the import and export of the content and keeping these translations in sync.
This is where you start to enter the nightmare mode of localization because of this scenario:
- Write a blog post in English
- Send it off for translation to Japanese
- Make a couple of revisions in the English text
- Receive back the Japanese translation of the original English text (minus the revisions)
Now the Japanese translation is a slightly different version than the English one - and we're only talking about two languages. When you're supporting 12 or 20 languages, this becomes a tremendous issue.
The model of CopyCopter is that you keep your keys and strings in a separate app. You run a rake task on your primary app to fetch copy from that remote app. Your translators just interact with the remote app, without needing access to your primary app.
Globalize3 works but it makes you realize what a complicated problem this is, because it is a complicated solution.
You end up creating tables for your fields so each field in your primary table will have its own sub-table where each row is a different language - so that they can all be time-stamped and versioned. Your database schema will be bananas, you'll have tables proliferating like bunnies.
This third option is my favourite one. Since I discovered this app, I will not use anything else.
To clarify, I have no affiliation with these people other than thinking that they're bad-asses!
Honestly, I don't care how much it costs. Once you try those other options, you will start throwing your money at these people to save you from the tar pit.
How do we determine what locale your user actually wants? The Locale_Setter gem helps you do this, and there are five broad strategies for deciding which locale to display:
- Browser preferences
- URL parameters
- Account preferences
- Default settings
1. Geolocation - sucks, don't do it.
Think about when you're travelling - does your current physical location on the planet change the language you want to read on the web? Geo-location is stupid. Please don't tie a person's location to the language they speak.
2. Browser preferences - deep down in your browser preferences there are language settings which you can access. Those preferences are submitted with every request made on the web.
If you go to a Rails view template and use the debug helper to output request.env - you'll see http_accept_language buried in the output.
One of my favourite debugging or investigation techniques is from a controller you can call render: text and then pass it some text. It'll spit that raw text back to your browser without any wrapping html.
Here's a slightly more complicated result:
There are three separate comma-separated parameters in this second example.
It means that the user:
- Prefers US English
- Understands 80% of generic English
- Understands 20% of generic Spanish
Because the user submits this information of every request, you can just parse it out with a regex and pull out the locale names.
You can probably get away with assuming that the language they submit first is the one they want most.
LocaleSetter will give you back an ordered list of locales.
Once you know what language the user wants, you can query the i18n library to see what locales are available.
Then the challenge is to match the user's preferences to the available locales. You find the locales they want and convert them to symbols and compare them to the supported locales (because they're stored as an array of symbols), and then find the first match in order of their preference (not yours).
This seems like a simple problem. This matcher module takes in requested as an array of the symbols the user has asked for, and matches that against your available locales.
The single ampersand operator finds the intersection of two sets - the elements that are common to both - and orders them by their position in the first set.
In this example requested will take order priority over i18n.available_locales.
3. URL parameters can also be used to determine which locale to serve.
What's good about using url parameters is that it makes it really easy to switch locales for development and debugging.
It's easy to pull the locale parameter out of the url with a simple:
before_filter :set_locale to get back the params hash and look for the locale key. If it's nil, we leave the current locale unchanged.
There is a danger here. You can query the symbol module for all the symbols currently defined on a system - and you'll get back a very large array of hundreds, maybe thousands of symbols.
If you set the locale to some string that the user passed in:
garbage_from_a_user, and then query for it, you will find that it is now set as a symbol.
The danger is that there's now one more symbol in the symbol table than there used to be. This opens you up to a denial of service attack because it allows users to spam your urls, generate symbols, fill your symbol table and crash your application.
A safer option is to match more intelligently by pulling out the params locale and sending it to the params module which triggers the matcher.
Things get interesting in the matcher where instead of symbolising the input and comparing it to the available locales, we stringify the available locales.
Here in the available method, we map to strings which is safer because strings are garbage collected - and they're our own strings anyway.
We map those strings and compare them to the requested strings.
Isn't it wasteful to create these strings?
Yes, but Rails creates thousands of strings - something like 16,000 objects on a Hello World request. So the strings we're creating here are statistically insignificant - don't worry about it.
There's another problem with these locale-embedded links. You embed the locale in the url, click any link on the page and now your the locale is lost to subsequent requests.
Thankfully, Rails provides an easy way to handle this, you just need the default_url_options method.
This allows you to manipulate any url generated from a path helper. If you are a good citizen of your view templates, and you're only ever using path helpers and never writing your own a tags, path helpers will automatically embed the locale in every url.
But, you probably have a primary locale which serves the majority of your users - so why bother embedding the locale in this default locale? Here, we don't embed the locale if we're using the default locale.
4. User preference You might store locale in the user table and the locale_setter gem checks whether the current user has a preferred locale.
If they have, the gem will pull it out and attempt to match it against the available locales with the matcher module.
Putting it all together. How do we prioritise these locale selection options? Here's my recommended look-up chain:
- URL parameter - this allows you to override your own user preference, despite being logged in, and manipulate the url to control your locale
- User preference - If users are logged-in and they haven't monkeyed with the url, go ahead and use their stored preference.
- Browser settings
- Default locale
- No geolocation
The most interesting part here is the set_locale method which prioritises the params, then the user, then the http and then the default locale.
4. i18n also makes white boxing (white labeling) easy.
Here's a strategy you could use to get one application serving two different clients
- Step 1: Internationalize - carve up your code, pull out the magic data, pull out the strings and get them into lookup dictionaries.
- Step 2: Localise your user content. Don't do this on you're own, you'll hate life. Just use Locale, its awesome.
- Step 3: Determine the locale and try out my gem
- Step 4: Hack i18n beyond translations - anything that has to do with strings, you can do with i18n
Q: Is there anything which will test localization files to ensure that all lookups are defined for all languages?
A: No, but by default the i18n gem will output the name of the key if the translation is missing. This is good because the app doesn't crash but it's bad because users see garbage. You can overrride that exception handler, and I recommend you do so - particularly at the CI level. Run CI through each of your locales and set your exception handler to raise an exception for missing translations. You will catch issues that happen because of translations.
Q: Is there a way to mount a locale using a url holder pattern like /locale/stuff?
A: Yes, it is possible but it might cause some issues. I'm not too wild about using subdomains either.
Q: How do you decide when to keep something in a model as a constant, as opposed to keeping it as a setting?
A: In my opinion it's always wrong to keep it in the model. From a code quality standpoint it is always wrong to have numbers in the middle of your functional code - because that number has a domain meaning and it should have a name. It shouldn't be 6. it should be front_page.articles_limit. The only reason we put that 6 there is because we're lazy. You write that code once, but you read it and debug it many times. You're better off typing the long thing. So, the first step is to go through code and move the integers up to constants. Once you see all the constants, you realise that if you wanted to change them you could change the source code or monkey patch it with an initialiser. That could work, but if you're going to write an initialiser you might as well just write an initialiser. So, take those constants, bring them over to an initialiser and handle configuration data there.
Q: Whether you're changing a constant in the model or data in the initialiser, you're still making code changes?
A: Yes, same repository, but different conceptual area. We're changing the parts of the code that change when configuration changes, but we leave the models alone which reflect business logic.