July 6, 2005

Chinese conversion the wiki way

Creation of a unified Chinese-language Wikipedia has been problematic for numerous reasons. The primary one is the differences between Simplified and Traditional Chinese. Simplified Chinese is the official written language in China and Singapore, and is the official writing system taught in Malaysia. Traditional Chinese is also used in Malaysia, as well as Tawain, Hong Kong, and Macau. There are additional regional differences, as well as the problem of foreign words, which are often translated using characters that phonetically mimic the foreign pronounciation. Any arbitrary combination of characters can be combined to achieve this phonetic translation.

Current natural language processing techniques are not entirely accurate, and are too expensive to apply to large scale projects such as Wikipedia. Additionally, they do not work for regional idioms. The authors have developed a semi-automatic approach for wiki environments that facilitates conversion between Simplified and Traditional Chinese. The automatic portion relies on mapping tables. The wiki end includes end-user ability to alter mapping tables and manually correct conversions with project-specific markup. The process results deliver text tailored to the user’s language preferences.

The development team is currently working to modulize this system so it can be applied to other national languages with similar issues — for instance, Serbian, which can be written in both Latin and Cyrillic.

