Read an Excerpt
Chapter 9: Information Processing Techniques
Perl
Usually described as a scripting language, Perl, developed by Larry Wall, is much, much more than that. Perl's main strengths include rapid development. regular expressions (described later in this chapter), and hashes (associative arrays). It is not so much these individual features that provide Pert with extraordinary text-manipulation capabilities , but rather how these features are intertwined with one another. Other programming languages offer similar features, but there is often no convenient way for them to function together. in Perl, for example, a regular expression can be used to parse text, and at the same time used to 'store the resulting items into a hash for subsequent lookup.
Perl is the programming language of choice for those who write CGI programs or do other web-related programming (a topic that is discussed at the end of Chapter 13, The World Wide Web), because it is well suited for the task.
Although the current incarnation of Perl has no built-in support for internationalization (to the level that Java currently has), it is something that is being discussed by its developers. There are, however, clever ways to use Perl for handling multiple-byte data, most of which make use of regular expression tricks and techniques. The Perl code examples provided in Appendix W should he studied by any serious Pert programmer. Gisle Aas and Martin Schwartz have been diligently working on some extremely useful Unicode modules for Perl (Such as Unicode:: String, Unicode::Map8, and Unicode::Map), so you can expect some useful and interesting things to happen in the future. The Unicode Map module byMartin Schwartz, in particular, already supports code conversion between Unicode and a number of legacy CJKV encodings.
Kazumasa Utashiro has developed a useful japanese-enabling Perl library called jcodepl, which includes Japanese code conversion routines.** Some may find the Japanese version of Perl, called JPerl, to be useful, although I suggest using programming techniques. that avoid JPerl for optimal portability. JPerl adds: Japanese support to the following features: regular expressions, formats, some built-in functions (chop and split), and the tr / / / operator. The definitive guide to Perl is Programming Perl, Second Edition, by Larry Wall et al. (O'Reilly & Associates, 1996). Tom Christiansen and Nathan Torkington's Perl Cookbook (O'Reilly & Associates, 1998) is also highly recommended as a companion volume to Programming Perl. The comp.langperl.misc newsgroup should also be of interest. The best place to find Perl is at CPAN (Comprehensive Perl Archive Network).
Python
Like Perl, Python is also sometimes described as a scripting language. Python was developed by Guido van Rossum, and is a high-level programming language that provides valuable programming features such as hashes and regular expressions.
An excellent guide to Python is Mark Lutz's Programming Python (O'Reilly & Associates, 1996). The comp.1angpython newsgroup should also be of interest if you want to learn about recent Python developments and join discussions. There is also a Python web site from which Python itself is available.
Tcl
Tcl, which stands for Tool Command Language, is a programming language that was originally developed by John Ousterhout while a professor at UC Berkeley. Like Perl and Python, Tcl is considered a high-level scripting language that provides built-in facilities for hashes and regular expressions. John later founded Scriptics Corporation where Tcl is now being advanced.
Some important milestones in Tcl's history include its byte-code compiler introduced for Version 8.0, and support for Unicode (in the form of UTF-8 encoding) that began with Version 8.1. Tcl will also have a regex package comparable to Perl's by the time you read this. The lack of a byte-code compiler has always kept Tcl slower than Perl.
Tcl is rarely used alone, but rather with its GUI (Graphical User Interface) component called TK (standing for Tool Kit).
Other Programming Environments
While it is possible to write multiple-byte-enabled programs using all of the programming languages mentioned above, there are some programming environments that have done all this work for you, meaning that you need not worry about multiple-byte enabling your own source code because you depend on a module to do it for you. This may not sound terribly exciting for companies with sufficient resources and multiple-byte expertise, but may be a savior for smaller companies with limited resources.
One example of such a programming environment is Visix's Galaxy Global, multilingual product based on their Galaxy product. (Visix Software has since gone out of business.)
Perhaps of greater interest is Basis Technology's "Rosette: C++ Library for Unicode," which is a compact, general-purpose Unicode-based source code library. Embedded into an application, this library adds Unicode text processing capabilities that are robust and efficient across a variety of platforms (MacOS, Unix, Windows, and so on). Its functions adhere to the latest Unicode specifications. Major functions include code conversion between major legacy encodings and Unicode encodings, character classification (identification of a character), and character property conversion (such as half- to full- width katakana conversion). Basis Technology also offers a general-purpose code conversion utility, called "Uniconv," built using this library. Also of interest is UniScape's Global C and Global Checker packages, Sybase's Unilib, and Alis Technologies' Batam (their own Tango web browser is an example of this library's usage in a real product).
Code Conversion Algorithms
It is very important to understand that only the encoding methods for the national character sets are mutually compatible, and work quite well for round-trip conversions. The vendor-defined character sets often include characters that do not map to anything meaningful in the national character set standards. When dealing with the Japanese, ISO- 2022-JP, Shift-JIS, and EUC-JP encodings, for example, algorithms are used to perform code conversion - this involves mathematical operations that are applied equally to every character represented under an encoding method: This is known as algorithmic conversion....