This is the first article in a series of articles about best practices programming for BiDi support. The full series can be viewed here.
This series of articles covers different aspects of internationalization (also called i18n) and localization (l10n), with special awareness to bi-directional languages (also called BiDi). These languages include, mostly, Hebrew and Arabic (but also Farsi, Yiddish and Urdu), and are unique in that writing in these languages is a series of "runs", each run being written in a different direction (either left to right, called LTR, or right to left, called RTL). A typical BiDi paragraph may contain several runs of opposing directions.
Writing support for these languages into programs (i18n) presents unique challenges. Many of these challenges present situations that have no automatic solutions, and there is, often, no better solution than consult someone who understands the language and can tell a good solution from a bad one. This series of articles is an attempt to provide programmers with tools that will help them understand the issues at hand, give some tools to help cope with these challenges, and offer further reading for anyone wishing to better understand the proper way to prepare a program for BiDi support (i18n), as well as perform actual translation (l10n) of a program.
An important note: This article (and the rest in the series) are not meant for people who wish to build platform support for BiDi languages. Most platforms today already have support for BiDi languages, and to write support for a currently unsupported platform would require much better understanding than what is presented here. The purpose of these articles is to allow people working with platforms that already have support to better use this support for creating programs that allow BiDi data and translation. As such, there are some subjects we merely skim through. Wherever possible, we try to provide pointers for further reading to anyone who wishes to gain a better understanding.
As with most documentations relating to BiDi, this document will use examples where lowercase means a left to right character, and uppercase means a right to left character. English words will be used for both lower case and upper case. The examples will keep the spirit of "Logical text" in that the words will be written in the order they should be pronounced. This means that the words are readable in Logical order and less so in correct Visual order.
The BiDi Reordering Algorithm
At the heart of BiDi support is the BiDi reordering algorithm. Despite the alternating directions in which BiDi text is written to screen, the best way to store the text in the program's memory and on disk is in "logical order". This means that the letters in the string need to be reordered before they are displayed, so that they are displayed in an order that makes sense to a native speaker of the language.
- Logical order
- The order in which letters are stored in memory. This order needs to be simple to sort and to perform manipulations (add parts, remove parts, concatenate two strings etc.). Ideally, logical order should be the order in which the letters would be spoken. For most BiDi languages, writing text vertically, where each letter is beneath the previous one, cancels all BiDi considerations. As such, vertical text is always written in logical order.
- Visual order
- The order in which letters are to be displayed on screen. This can be either monotonically left to right (typical with computers) or monotonically right to left (used to be typical with manual type-writers). Either way, with visual order, some of the text would have to be written back to front.
In practice, logical order is, almost always, an exact record of the order in which the letters were typed by the user. An ideal reordering algorithm is one that produces a visual order of the letters that cause a native language speaker to automatically scan the letters in logical order. The reader must be advised that there are texts that the experts would disagree on what that would mean, exactly.
Unicode's Reordering Algorithm
The algorithm will not be explained here. It is fully documented in technical report #9 of the Unicode standard. The most important highlights, however are:
- Each character is assigned a type from the Unicode database. These types can be hard left to right (such as an English letter), hard right to left (such as a Hebrew letter), neutral etc.
- Each character's type is potentially modified based on context in which it resides. This can happen to each character more than once.
- Each character is assigned a "direction level" based on its type and the paragraph direction. Even levels mean a left to right run, odd levels mean a right to left run.
- The characters are actually reordered based on the levels.
Paragraph Base Direction
When performing the BiDi algorithm, it is important to know the paragraph direction. This is the core direction of the paragraph, and is not easy to know automatically. The BiDi algorithm suggests that the first "hard" directional letter in the paragraph dictates the paragraph direction. We suggest that this heuristic be used only as a last resort. Wherever possible, try to know what the paragraph direction is, and to dictate it to the rendering engine in some out of bounds way. This is entirely conforming to the BiDi algorithm definition, as it explicitly states that the paragraph direction can be specified in any way.
Some characters have an inherent directionality. Unicode opted to define these characters based on their semantic meaning, rather than actual glyph description. For example, U0028 is defined in the Unicode character base to be "open bracket", rather than "left bracket". Since the reading direction in right to left languages is from right to left, an open bracket needs to have its open side point to the left. In other words, in RTL context, an open bracket is a right bracket, not a left one.
To achieve this effect, a mirroring run is usually performed before displaying the strings. Each mirrored character in an odd BiDi level is assigned a mirrored glyph. Typically, instead of reversing the the glyph, a different glyph that already looks mirrored is picked. So an open bracket character (U0028) in an odd BiDi level is displayed using the glyph generated for the close bracket character (U0029), which is a right bracket.
In some languages, a single character may have different visual representation, depending on its position inside the word. In Arabic, for example, a single character may be displayed differently based on whether it is displayed alone, at the beginning of the word, at the end or in the middle. Unicode tries to define the characters as their logical meaning, which means that an Alef is an Alef, regardless of whether it has another letter before it or not. As such, it is up to the rendering engine to figure out the actual glyphs to display when characters appear in a certain order with a certain BiDi level.
One special case worth mentioning is the Lam-Alef combination (ل followed by ا). When these two letters appear in this order, they are combined by the shaping engine into one glyph (لا). This means that the text, after shaping, might contain less characters than it did originally.
An interesting point is that Hebrew also has a few context sensitive letters. Despite what was said in the previous paragraph, Unicode does define the Hebrew "final letters" as distinct code points from the non-final same letters. The reason is mostly one of legacy. There are 22 Hebrew letters, with five letters that have a final form, making a total of 27 letter forms. There are 28 base Arabic letters, most may have up to four representation forms, making a total of over a hundred representation forms. An Arabic keyboard has only the 28 base forms on it, while a Hebrew keyboard allows directly typing all 27 presentation forms.
The main lesson to take away from this is - do not try to fix it. Treat Hebrew letters as non-shaping letters, to be output as typed. Treat Arabic letters as shaping letters, to be shaped before being output. Doing anything else will create usability problems for your users!
Line breaking should take place after the final BiDi levels have already been assigned to each character of the string, but before the reordering takes place. Due to the fact that shaping might change the length of the output string, that, too, has to take place before line breaking. The proper order is, therefor:
- Apply the Unicode algorithm and calculate BiDi levels
- Perform mirroring
- Perform shaping
- Calculate line breaks
- Perform reordering of each line individually