Observations and Recommendations on the Internationalisation of Software Abstract As computer programs enter the lives of more and more people worldwide, it is becoming increasingly unacceptable to assume that software with a user interface designed for an indigenous English speaking market will be acceptable outside its country of origin simply by changing the currency symbol. Developers of software who are serious about expanding sales into new markets must consider many issues when giving thought either to the creation of new software or the modification of existing software to work within the linguistic and cultural constraints of these new markets. The purpose of this paper is to examine the task of preparing software to be used in countries and cultures other than that in which it is created. We do this by reviewing some of the most important localisation issues that have been identified, and some of the tools and practices that are available to the software designer to deal with them. We shall also consider some of the areas of the software development process that are currently less well understood and supported. Our major emphasis is in non-graphical applications targeted at European markets. Part I Cultural Differences 1 Text Strings For developers intending their software to interact with human users in different cultures, the most obvious obstacle to be overcome is one of language. Ideally, all textual communication between the program and the user should take place in his or her own language. 2 CHARACTER SET 2 1.1 Text output by the Program Text output by a program often constitutes a large proportion of the user interface: command prompts, items in a menu, error messages, help, general information, and so on. Such text that would be presented by a program must be identified, and an equivalent translation prepared that can be used as the output of a localised program. This process is often mistakenly thought of as being both relatively simple and the only significant effort required in the migration process. However, it is our contention that this purely `string-handling' view of the task is quite mistaken, and will often actually require little effort when compared with that required to deal with some of the more subtle cultural and dynamic aspects of a program's behaviour. Rather, it is the task of maintaining the cultural elements in a form that will enable a qualified translator to create a satisfactorily localised program that is one of the major hurdles to be overcome in the area of internationalisation. Nevertheless, in later sections we shall discuss various methods for achieving the identification, separation, and translation of program text as a necessary first step. 1.2 Text supplied by the User Where a user interacts with a program via textual responses, handling of such input must be isolated and appropriate translations found, as with output text. On input, there is a slightly different set of considerations; the layout problems associated with output become issues of handling multi-format input, for instance. 2 Character Set Migration to most languages other than English requires the use of a character set other than the commonly used 7-bit ASCII. For the most common European languages it is sufficient to employ the 8-bit ISO-8859-1 (ISO Latin-1) but for migration to other markets, most notably Japan, it will be necessary to consider the implications of multi-byte character sets, such as Kanji. Typically, a localised application will be used with only a single interface at a given site, where keyboards and output devices will be appropriate for that locale and in the rest of this paper we shall concern ourselves with this scenario. However, multi-lingual working environments do exist, and here it may be necessary to make adequate provision for the generation of input and output not typical of that locale { such as generating accented characters from a UK keyboard [2]. 3 Collating Sequence Many applications rely upon the underlying values within a character set to impose an ordering on alphabetic data. This will often fail to take account of the distinctive ways in which some countries properly order such data. Del Galdo [4] shows the collating sequences for English, 4 CHARACTER CLASSIFICATION AND CONVERSION 3 German, and Swedish, noting that `?a', `?o', and `?u' in German are grouped with their unaccented equivalents, whereas Swedish groups `?u' with `y' and places `?a' and `?o' at the end of the alphabet. Sprung [15] adds that sorting `o' before `?o' and vice-versa are both common in German. Some languages sort particular pairs of letters as a singleton, `ch' and `ll' in Spanish after `c' and `l', respectively, for instance. 4 Character Classification and Conversion Many applications need to classify characters into groups, such as upper case, lower case, digits, white space, and so on. Such classifications need to recognise the place of accented characters within such groupings. Different cultures also have different conventions about whether a character retains it accent when converted to upper case [17, p.124], and it cannot be assumed that simply `flipping a bit' will convert between upper and lower case. 5 Hyphenation Writers of text formatters recognise that hyphenation rules are often highly specific to individual languages [10, p.88]. In addition, sometimes changes in spelling are required when a word is broken [15, p.83], [17, p.124]. 6 Monetary Information Currency symbols vary from country to country, as do their lengths and positions, for instance: $23.45p, 12FF, US$13, etc. Formatting of monetary items must take this into account. 7 Numbers Separators for thousands and decimals vary considerably. Del Galdo [4, p.4] suggests allowing for thousands a comma, period, space, apostrophe, and no separator. In addition a period, comma, and centre dot should be allowable as radix characters. These issues affect both the output of information and what the user is allowed to supply as input. It is essential that the conventions are made clear to the user and that sanity checks, feedback, and opportunity for confirmation provide the necessary safeguards against incorrect data. 8 Dates Numeric day/month ordering is a potential source of error, particularly where the day is within the range 1-12. Differences between the UK day/month/year and the US month/day/year orderings have even provided a fertile ground for clues within detective fiction! A third common 9 TIME 4 form of year/month/day offers an acceptable alternative, as long as four digits are used for the year as after the year 2000 two digit years will render this form indistinguishable from the others. Where ambiguity might arise, an alphabetic month with four digit year is always to be preferred on output, and user supplied numeric dates should be confirmed. Further considerations are found in Taylor [17, p.124], who notes that the Western Gregorian calendar is by no means ubiquitous, and in Beyls [3, p.258] who describes some of the unique demands of the Arabic lunar calendar. 9 Time Although the ordering problem associated with dates does not arise with time formats there is a need to clearly distinguish between twelve and twenty-four hour representation; AM and PM strings may be needed, for instance. Where the time displayed is not local time, or the context demands it, reference to a recognised time zone should be included. Del Galdo [4] suggests that typical three letter abbreviations of this information are not appropriate. 10 Units Not all countries regularly use SI (Syst?eme International) units. In the UK, where the metric system has been part of the education system for a long time, the older Imperial system of miles, gallons, pounds, and ounces, still flourishes in every day life. In addition, there are differences to be found between countries with similarly named units { pints and billions between the US and UK, for instance. Failure to agree on the right units could have life threatening consequences, typified by the notorious failure to distinguish between land miles and nautical miles during WWII [9]. 11 Telephone Numbers A great deal of flexibility is required in handling telephone numbers { particularly in recognising what is acceptable as input. Numbers of digits vary, as do the typical separators and lengths of groups within a number. Provision should be made to accept abbreviated local forms, as well as full national and international formats, where appropriate. 12 Names and Addresses As with telephone numbers, no standard format can be assumed to be available. Some countries order the most local information first, others order it last. Names are also an area where care must be exercised, and can be a source of much personal annoyance [14]. Provision should be made for people with single letter names, hyphenated names, no forenames, and so on. 13 THINGS THAT NIGGLE 5 Nielsen[11, page iv] points out that simply initials and surname are insufficient to identify a person in Denmark since three surnames alone account for nearly 25% of the Danish population. 13 Things that Niggle A program with an interface developed in one country and then made available in another where the same language is spoken might not be an obvious candidate for the skills of a translator, but seemingly minor cultural differences can create a sense of unease that may significantly affect a user's perception of whether or not an interface really has been tailored to their needs. Examples between the UK and US markets might be the different spelling of common words (such as colour/color), the use of different terms for the same thing (bonnet/hood), as well as the date ordering problems that have already been mentioned. Part II Support for Internationalisation 1 Where it happens The multiplicity of examples within the literature demonstrates that there is no single accepted method for creating and maintaining a piece of software in a form that will enable its adaptation to national variations [2, 5, 6, 12, 13, 17]. In addition, there are different points within the source or binary form of a program where diversification might take place. Taylor [17] identifies all three phases of program building { compile-time, link-time, run-time { as being potential points for this. To illustrate the essential differences between these points we shall use the program fragment in Figure 1, which is written in ANSI C [1]. The purpose of the function is to prompt the user with a question and return `1' or `0' for a `Yes' or `No' response. Our goal is to write a French language version that will do the same thing. We shall concentrate on the purely textual aspects of the differences. 1.1 Compile-Time With this approach, no significant attempt is made to isolate the language specific elements of the program from the rest. A French version might look like that in Figure 2, therefore. The position usually taken is that the compile time solution is impractical for all but the simplest of applications. Its use results in having to maintain multiple versions of what is, essentially, the same software. Bug fixes and enhancements in one version must be carried over to all others and it is very easy for even just two versions to eventually diverge to the extent that they become different pieces of software. Nielsen, however, does give an example of a commercial application that used this approach at one stage [12, p.160]. 2 EXISTING SUPPORT FOR INTERNATIONALISATION 8 1.2 Link-Time This approach consists of isolating into distinct modules those cultural elements that require translation. When a program is prepared for a new locale, new versions of only these modules need to be prepared. These are substituted for the original locale-specific modules and linked in with the common modules of the application. Following this method, our example must be split into two: a shared part (Figure 3), and a locale-specific part (Figure 4). The locale-specific part is simply a data structure, question strings holding the possible answer strings and the error message. Access to this information is granted to the shared part via the module interface: question.h that is included in both parts (Figure 5). The French version of the language-dependent module is shown in Figure 6. 1.3 Run-Time This method involves the complete removal of the cultural elements from the source of the program to external files. Whereas the other two methods produce a different binary version for each interface, this method produces a single binary whose interface will depend upon the external environment in which it is run. The external files containing the environmental information are variously known by designations such as resource files [12], message files [6], and catalogs [5, 17]. In their simplest form they contain the texts of the different user interfaces. Following the conventions of [5] the message files from which the catalogs of our example are created might look like those in Figure 7 and Figure 8. Catalogs are prepared from these message files, and at run-time, the required catalog is opened for that locale. The language-dependent strings are then retrieved via calls to the catgets function, and our ask a question function would look something like that in Figure 9. The run-time approach is usually associated with a degradation in execution speed, when compared with the other two approaches, because of the need to access external data. This may be an important factor, especially if an existing base of users would find a significant decrease in speed unacceptable [13, p.105]. However, if it is possible to load a heavily used catalog into memory at runtime then the degradation may not be a factor [2]. 2 Existing Support for Internationalisation As we saw earlier, there is a good deal more to localisation than simply translating text from one language to another. Several vendors have addressed many of these issues and provided support for them. DEC Ultrix NLS, Hewlett-Packard HP-UX NLS, IBM and Microsoft's OS/2, and Microsoft's Windows are all examples that do so to a greater or lesser extent. The areas covered typically include the infra-structure facilities listed below. 2 EXISTING SUPPORT FOR INTERNATIONALISATION 9 These provide a good base upon which to develop the higher-level aspects of internationalisation that we shall be dealing with in Part III. 3 Enabling Tools Many software developers will want to take an existing program and convert it for use in other cultures. In order to facilitate this, some tools have been developed to eliminate much of the repetitive nature of part of this task. The Ultrix and HP-UX NLS [5, 8] have a great deal in common, both leaning towards the X/Open Portability Guide [18]. Both provide broadly similar facilities to extract text from existing software into text message files, from which the run-time catalogs are then created. Ultrix NLS provides both an interactive program, extract, and a batch program strextract. These tools make use of directives to determine which strings to match and which to ignore during this process. When a string is matched and placed in the message file, its occurrence in the source can be automatically replaced by the appropriate call to retrieve it from a catalog at run-time. For example the line in Figure 1 above if(samestring(answer,"Yes"))f would be converted automatically to something of the form if(samestring(answer,catgets(catd,set number,item in set,"Yes")))f by the extraction process. The values in the places set number and item in set are simply numbers allocated by the tool. Their purpose is to allow catalogs to be organised into different sets of messages. The final argument to catgets is a default string in case the requested value cannot be retrieved, for some reason. It serves the additional purpose of documenting the code. Peterson's method [13] retains the original message text for this purpose, too, but instead of the set/item principle he uses the original text as an argument to a hashing function in order to retrieve the translated version from an external file. It is important to note that the service offered by such tools as these is simply a cut-and-paste exercise. The tools only save the programmer from a large amount of hand editing. No attempt const char *menu strings[ ] = f Figure 11: Incorrectly enabled command data structure is made to logically group the items into different sets, or provide symbolic names for sets and items, for instance. In addition, the tools contain no understanding of where these replacements are inappropriate. Faced with the data structure in Figure 10 that might represent a menu of commands to be presented to the user, the Ultrix NLS extract tool can only convert this into an equivalent of Figure 11. This is no use because data structures in C may not be initialised in this way. Rather, what is needed is an additional function that performs the run-time initialisation (Figure 12). The reason that existing tools do not handle this situation properly is that, typically they are little more than pattern-match and replacement utilities; they do not take account of the program context in which the text occurs. Part III Exploring the Difficulties Having examined some of the areas for which there is existing support it is now necessary to look at some of the areas that do not yet have ready made solutions. 1 The Goal The ultimate goal of the software developer who is working towards a localised product must be to have the new users feel as comfortable with it as with a home-grown product. Nielsen says, \an interface which is used in another country than the one where it was designed is a new interface : : : translating the text strings and icons in the software is not enough { one has to translate the total user interface including manuals" [11, p.39]. Implementation of such a goal, with a commitment to the design of the total user interface, will require the involvement of a translator who ffl is a native speaker of the intended language, ffl who has a good knowledge of the target user population, and ffl who has a good knowledge of the operation of the program. Such a combination is likely to be hard to find but it is probable that the first two requirements will be more readily met than the third, particularly if localisation actually takes place outside the development company and/or the country of origin. If this is so then the development and enabling process must provide as much information as possible to support its localisation 2 DYNAMIC CONTEXT 15 outwith its basic linguistic and cultural aspects. Unfortunately, this sort of support is completely lacking for the general developer of localised software. 2 Dynamic Context Much of our previous discussion has focussed on the removal of text in order that it might be translated. It would be simplistic to assume that dealing with such text is analogous to the translation of a technical paper. A good accurate translation of anything depends significantly upon contextual information, and as Nielsen again says, \dialogue elements cannot be seen in isolation since they form part of a dynamic interactive context" [11, p.42]. However, the text extraction process that we have looked at does precisely this { removing text to a static external file devoid of all dynamic context. As an example, Nielsen's analysis of an initial Danish version of MacPaint shows how, when seen in context, the translation of eject as aflever (`hand-over') caused a user to think the button's purpose was for saving a file, rather than ejecting a disc. A similar problem can exist when translating text representing toggles; without giving the translator the context, it might not be apparent that two isolated texts should be translated as opposites so that the linkage is clear to the user [11, p.41]. Sukaviriya and Moran [16, p.201] point out that direct translation of some words simply does not make sense when placed in context { box and bar in the Thai version of a word processor, for instance { where the direct translations refer to unrelated physical objects. A translator, therefore, must have available information that covers the general application area, as well as the static, dynamic, and relational contexts of the text to be translated. This will often have to be directly provided by the enabler. 3 Dynamic Text It is likely that not all text output by a program is fixed at the point of binding. Error messages typically include portions of the input deemed to have been responsible for causing the associated errors, for instance. Such dynamic text creates two challenges ffl providing the translator with enough information to make the static components of a parameterised text translatable, ffl providing a mechanism whereby the runtime arguments may be appropriately substituted. The C programming language [1] uses the `%' character to introduce parameter specifications into text strings. The formatted print statement tries to print a grammatically correct message requiring three arguments to be substituted for the parameter specifications: a string (`%s'), an unsigned integer (`%u'), and a single character (`%c'). What the translator is likely to see is simply The %s command requires %u parameter%c Is this sufficient to yield something with the equivalent meaning in the target language? Certainly the translator will need to know what the various parameters mean, and likely values of the runtime arguments. Furthermore, translations of some parameterised texts will require the order of argument substitutions to be changed in order to make good grammatical sense. ANSI C, by itself, does not provide this capability. Hence the HP-UX NLS environment extends the range of parameter specifications to include those of the form `%d$', where d is a digit in the range 1-9 that refers to the required argument's position in the argument list. These allow arguments to be reordered when formatted and so provide greater scope for producing a good translation. A similar feature is found in OS/2's DosGetMessage API [6, p.4], and INFOFLEX [12, p.173]. 4 Format One piece of advice often quoted is the need to allow for translations producing longer text than the original. The recommendation usually involves simply making sure that array sizes defined within the program are large enough to cover this possibility. How much extra space to allow depends on who you read allow up to 50% more when translating English [4] at least 60% more [8] for strings of 0{10 characters allow 100%-200%, for strings of over 70 characters allow 30% (with percentages for intermediate lengths) [7] The INFOFLEX project [12] went further. Believing that fixed layouts lead to \absurd and meaningless abbreviations" they incorporated explicit formatting information along with the text held in their resource files. This enabled them to create WYSIWYG translations of screens and reports using a custom built resource editor to design them. Doubtless their task was made easier by the fact that the application was directed to 24x80 character terminals; nevertheless they successfully tackled an important issue not addressed by most other projects. 5 Menus, Macros, and Accelerators A common feature of menus is that an option may often be selected via an accelerator { typically a keystroke combination involving the first character of its name. This can lead to awkward 6 PROGRAM EVOLUTION 17 naming of items, even in the original version, to keep these characters distinct. A translator needs to be made aware of those places where this feature must be present in the new version. Furthermore, where these accelerator characters differ between the versions there is a danger of thwarting users who switch between them [11, p.40]. A related problem with macros in early versions of Lotus 1-2-3 is noted by Sprung [15, p.81] where US macros could not be used by French users because of different key bindings. Support for developers of software to be shared across cultures needs to be provided, therefore. 6 Program Evolution Successful programs rarely remain static but undergo continuous improvement and development. Established user bases are likely to want to reap the benefits of improvements as soon as a new version is released in the country of origin. This means that the development process must be supported in such a way that the benefits of an original enabling are not lost, and the release of a new version to a wider market only requires the merging of translations of the new features with those that exist already. Tools that support these requirements are few. The ARRIS project built their own [13, p.107], but environments whose enabling tools effectively yield multiple versions of the same program produce potential conflicts: which version to develop { the original or the enabled [8, p.6-55]? The latter, surely, but typically the tools available are most suited for the initial enabling process, and not the long term development and maintenance of a program. Catalogs become out of date as code is changed, yet it may be unsafe to remove redundant messages lest existing translations become unsynchronised. There are no formal mechanisms for tracing the linkage between different translations. Part IV The Way Ahead We have seen that there is far more to the internationalisation of a program than simply finding all embedded text strings and replacing them with alternatives in a new language. Many of the issues to be considered have to do with the infrastructure of the locale { character set, date and numeric formats, collating sequence, etc. The importance of dealing with these at the base language and operating system level has now been recognised by several operating systems manufacturers, and reasonably well catered for. However, there are at least two major higher-level areas still lacking ffl The provision of a framework in which sufficient contextual information is made available to a linguistically competent translator to allow him or her to create an accurate and complete localisation. (It must be borne in mind that such a person is unlikely to be fully versed in the technical implementation of the program.) ffl Support of evolution and maintenance of a program that has already been enabled. REFERENCES 18 The implementation of suitable mechanisms in support of these would permit the removal of the enabling and localising processes out of the ad hoc home-grown arena of current practice into one that is rigorous, professional, and generally applicable. It is our belief that both of these areas can be dealt with by the development of appropriate software tools that build upon the existing infrastructure support to create a programming environment for the internationalisation of software. References [1] American National Standards Institute, `American National Standard for Information Systems { Programming Language { C', ANSI X3.159-1989. [2] David Barnes and Tim Hopkins, `Experience with Adapting Software for use in a Multi-Lingual Workplace', submitted to Information and Software Technology, 1992. [3] Pascal Beyls, `The Arabisation of UNIX', EUUG Conference Proceedings, Autumn 1988, pp 253-264. [4] Elisa del Galdo, `Internationalization and Translation' in Designing User Interfaces for International Use Ed. Jakob Nielsen, Elsevier 1990, pp 1-10. [5] Digital Equipment Corporation, `Guide to Developing International Software', Ultrix Software Development Manual, Volume 1, June 1990. [6] Asmus Freytag and Michael Leu, `Using the OS/2 National Language Support Services to Write International Programs', Microsoft Systems Journal, 5:2, Mar 1990, pp 1-26. [7] William S. Hall, `Adapt Your Program for Worldwide Use with Windows Internationalization Support', Microsoft Systems Journal, 6:6, Nov-Dec 1991, pp 29-58. [8] Hewlett-Packard Company, `Native Language Support: User's Guide. HP 9000 Computers', Jan. 1991. [9] J.E. Johnson, `The Story of Air Fighting', Arrow, London, 1990, page 164-165. [10] Leslie Lamport, `LaTEX User's Guide and Reference Manual', Addison-Wesley, 1986. [11] Jakob Nielsen, `Usability Testing of International Interfaces' in Designing User Interfaces for International Use, Ed. Jakob Nielsen, Elsevier, 1990, pp 39-44. [12] Jakob Peter Nielsen, `International User Interface for Infoflex' in Designing User Interfaces for International Use, Ed. Jakob Nielsen, Elsevier 1990, pp 159-187. [13] M.C. Peterson, `ARRIS: Redesigning a User Interface for International Use' in Designing User Interfaces for International Use, Ed. Jakob Nielsen, Elsevier, 1990, pp 103-109. [14] Gene Spafford, `O, Oh, what a difficult name', in USENET comp.risks, Volume 12, Issue 19, 28 August 1991. [15] Robert C. Sprung, `Two Faces of America: Polyglot and Tongue-Tied' in Designing User Interfaces for International Use, Ed. Jakob Nielsen, Elsevier, 1990, pp 71-102. REFERENCES 19 [16] Piyawadee Sukaviriya and Lucy Moran, `User Interface for Asia' in Designing User Interfaces for International Use, Ed. Jakob Nielsen, Elsevier, 1990, pp 189{218. [17] Dave Taylor, `Creating International Applications, A Hands-On Approach using the Hewlett-Packard NLS Package' in Designing User Interfaces for International Use, Ed. Jakob Nielsen, Elsevier, 1990, pp 123-158. [18] X/Open Consortium, `The X/Open Portability Guide', release 3, April 1989.