Java Kanji Flashcard 500: Kanji, Java, and the World Wide Web

Stephen Wood Ryner, Jr.
Academic Technology Development, DePaul University

Nobuko Chikamatsu
Department of Modern Languages, DePaul University

Hironari Nozaki
Nagoya City University, Japan

Shoichi Yokoyama
The National Language Research Institute, Japan

Sachio Fukada
Yokohama National University, Japan

The Java Kanji Flashcard 500 project uses Java to teach non-native students of Japanese the 500 most commonly occurring kanji in Japanese publications using a flash card format over the web. Students can generate vocabulary lists stored either locally or on a web-based Kanji Server which keeps track of their progress and allows them to customize the appearance of the flash cards. While Java’s Unicode support and distributed nature make it a natural choice for this instructional application, few client browsers can display Japanese Kanji correctly. The Java Kanji Renderer works by converting SJIS and Unicode encoded strings into a sequence of individual character images, including optional animated images showing the stroke order for kanji and kana. In addition to displaying characters for the flashcard application, this approach allows any Java-enabled browser to view readable Japanese text from web sites or other sources.

Introduction

Kanji present one of the greatest challenges to students studying Japanese as a foreign language. Many students of Japanese grow discouraged when they are told that 2,000 to 3,000 characters are required to read Japanese newspapers. However, according to a recent study (Chikamatsu, Yokoyama, Nozaki, and Long, in preparation), the 500 most frequently occurring characters cover approximately 80% of total kanji usage in newspapers. Furthermore, the top 1,000 most frequent characters covered 95% of total usage and the final 2,000 characters make up only 5% of the total use. Thus, if students know these most frequent 500 characters, they should comprehend the gist of most Japanese newspaper articles.

By studying the most common kanji, adult students of Japanese can quickly learn the kanji, compounds, and vocabulary they are most likely to encounter when reading Japanese publications. A computerized flashcard system can help organize, present, and review the hundreds of vocabulary items efficiently. An educational application that takes advantages of the cross-platform nature of Java and web-based instruction can help more students worldwide than a product aimed at a single platform, such as Macintosh or Windows. With this in mind, DePaul University is designing and developing a Java-based kanji flashcard system, using the 500 most frequently occurring characters, which will be accessible anywhere via the World Wide Web.

The most serious obstacle to computer-based Japanese language instruction has traditionally been the lack of support for Japanese characters on computers outside of Japan. While there are tools available, such as the Japanese Language Kit for Macintosh computers or the Kanji Kit for Windows, it is often costly to enable even a single computer for Japanese computing and schools or individual students can seldom afford it. As a result, while a number of reasonably priced programs have been developed, they are not distributed or used as widely as they could be. Similarly, on the Internet, there are currently several sites available for Japanese language practice (e.g., kana or kanji character exercises). Unfortunately, most of those sites require a Japanese enabled browser or the installation of system resources such as kanji fonts or Java class libraries.

Design Objectives

From the start, the Java Kanji Flashcard project was designed to be distributed and used via a web browser. Previous Japanese computing projects at DePaul University inconvenienced students who were required to use dedicated machines at specific campus locations and times. It was determined that the new application had to work on any machine, whether in University administered campus clusters or personal computers in dorms and homes.

Our users remain reluctant to install new operating system features such as language support, fonts, or class libraries to run a single application. In the case of the public laboratory machines, such modification of the operating system is often expressly forbidden. The application therefore needed to be independent of any native operating system support for Japanese fonts.

The Java implementation had to be kept simple and lightweight. Students seldom have the most powerful machines or the most recent Java virtual machines. In addition, many users use dial-up connections with limited bandwidth, so after an initial download, communication between the client and server had to be minimized.

Finally, the Flashcard project will include a server-side application. Students logging into a Kanji Server from any location should be able to resume their studies where they left off, allowing them to study when and where they choose without worrying about the computer they sit down to work at, however briefly.

Java Kanji Flashcard 500

The main interface to the Flashcard application is shown in figure one. In this example an applet containing an internal bitmapped font has been downloaded and runs within a web browser. The application allows students to select ranges of kanji to drill from, and select what information is shown on the "front" and "back" of each card. The information available in the current 500 kanji flashcard database includes on and kun readings, English gloss, stroke order animation, stroke, radical, and frequency information. Each entry also includes five compound words showing the target kanji in combination with another kanji from the top 500 list. Some of this information is seen in the sample below.

The original implementation assumed students would only study the 500 kanji in order, selecting the kanji range to work from each time. Current work with server-side session management will enable students to create and manage accounts with customized drill lists and allow them to stop and resume their study from different client computers. In addition to studying the kanji cards in "drill mode", students can search for specific kanji and compounds and print cards for study away from the computer.


Figure 1. Java Kanji flashcard applet demonstrating the SJIS Render method. This applet uses an internal bitmapped font and can display SJIS and Unicode encoded Japanese text on any Java platform. The larger image at the upper left and animated handwriting on the upper right are downloaded from the server as needed.

Selecting Kanji

The kanji used as the basis for the Java Kanji program were selected from research that found 500 kanji accounted for 80% of the characters occurring in a daily newspaper. As part of this study, the researchers built a computer database of kanji frequency and context using Japanese text from one year of articles in the morning and evening editions of the daily Asahi Shinbun. This database is the largest such Japanese corpus to date, and readily identifies the 500 most frequently occurring kanji.

The kanji character frequency database from the same corpus was also used to determine the most common two character compound words for each of the 500 kanji. Using a custom application, the researchers analyzed a month of the newspaper corpus to find candidate kanji compounds where both kanji were from the top 500. These candidate compound words were then sorted by frequency and hand-culled to remove nonsensical compounds, for example non-word strings or part-of-word strings. Each kanji can now be studied in the context of 5 frequently occurring compound words.

Implementation

The Flashcard 500 application uses a custom method called SJISRender to display SJIS encoded kanji using an internal bitmapped font. The entire font is sent to the client with the applet as a single GIF from which the Render method can extract individual kanji images following an algorithmic mapping from the SJIS encoding. The five compound words and readings in figure one are displayed with this Render method, which is a platform independent replacement for the Java drawString method. The larger images in the upper left for each of the top 500 kanji are requested as individual GIFs from the server as needed, along with vector representations that can be used to display animated images in the upper right region of the card.

The Render method was originally developed for use in the flashcard program but can be used more generally to convert SJIS or Unicode text streams into readable images on client browsers. In the context of reading Japanese web pages, for example, a "Kanji Proxy" application can retrieve Japanese encoded HTML pages and replace each SJIS or Unicode entity with a GIF. This has the advantage of leaving the formatting of the page and rendering of tables and other HTML elements to the browser. This proxy server can run either on a server or on the client machine. Some interesting applications that seem to use variations of this approach are discussed later in this paper.

The advantage to the current implementation with Java and internally stored bitmapped font images is portability and swift implementation within various networked applications. The disadvantage is that no bitmapped font can match the quality and speed of a true scalable font. We hope in the future to replace the bitmap-driven Render method with a scalable font rendering that can download font subsets from a server. This will be particularly helpful if Unicode encoded fonts come into widespread use and can be bundled with the application, preferably as an internal resource that does not require intervention by the user.

Bootstrapping into Unicode

The internal representation of the Japanese encoding originally used SJIS strings since the program was developed on Macintoshes using the Japanese Language Kit, a powerful environment familiar to many Japanese language instructors outside Japan. In the interests of porting the flashcard client and server to other languages besides Japanese, the Flashcard 500 now supports Unicode text conversion for Java 1.0x browsers, using code based upon a conversion class available from Sun Microsystems Japan.

The migration to Unicode should allow us to create new Flashcard databases and applications for languages such as Arabic that will take advantage of Unicode encoded fonts. By storing the flashcard database in Unicode and using a Unicode Render method, we will not have to create custom Render methods for every language encoding we encounter. The developers still need access to a native font to create a bitmapped subset of the Unicode family for the language in question, but we hope even this step will be removed with the migration to a scalable embedded Unicode font.

One remaining obstacle to implementing a truly platform and language independent Flashcard program lies in the need for text entry services when authoring the cards. For example, a professor teaching Japanese needs native services to enter new data into the Flashcard database at this time, even though their students will be able to freely view the resulting flashcards. Similarly, the lack of a generic front-end processor means there is no easy way for students to create new cards without the aid of a native language services. Work such as OMRON Corporation’s JavaWnn romaji to kana conversion application demonstrate this is certainly possible, but entering arbitrary languages as Unicode encoded text appears to be significantly more difficult than simply displaying it.

Distribution

Students will most often access the Flashcard family of applications from a DePaul University web server and run the program as a Java applet over the Internet. This allows the student to log in from any Java-enabled computer and stop and resume their study at any time, since the Flashcard Server will keep track of their progress.

Students and instructors may download the database and client to run locally at their schools on their home computers. The client will run much more quickly if saved on the student’s machine, and can save personal data locally on their computer. The server program will also be available for schools to install with their web servers, without the need for Japanese OS support on the server.

The Flashcard 500 program may be distributed on CD-ROM for students and instructors to install on their school or home computers. Since the database, client, and server together can be fairly large, this would be most economical for international students with slow or no Internet access.

Related Work

There are many interesting examples of Java applications that display Japanese encoded text from web pages. The following selection cites some English language examples but is by no means exhaustive.

Jim Breen’s work at Monash University on a computerized Japanese dictionary, Edict, has inspired work by people on many platforms. KanjiFlash, Marshall Ramey’s free Java application for studying Japanese vocabulary, is an excellent example. At the time of writing, KanjiFlash requires the user to install Japanese support classes and a scalable Japanese font.

Several examples cited from the W3C Web Fonts working group use Java to download server-side font information into a generic Java-enabled browser. PCFFont from Sun Lab’s Ken Shirriff is one colorful example, in this case using X11 Portable Compiled Format (PCF) fonts rather than GIFs. Shirriff specifically includes an example that renders JIS and EUC encoded Japanese text. OMRON Corporation’s JavaWnn can intelligently convert romaji and display kana and kanji even if the client has no Japanese support.

Shodouka, the popular application originally developed by Ka-Ping Yee and now available from Mediator Technologies, acts as a proxy server that translates Japanese-encoded web sites into streams of GIF characters for graphical web browser clients. Mediator Technologies has a licensing program that allows organizations to install their own Shodouka server, but it is unclear to what extent this technology depends upon underlying operating system-level support for Japanese.

Nihongo Surfer, a program by Jack Palevich, is a Java application that not only displays Unicode and Japanese-encoded web pages within browsers without native Japanese support, but allows the user to look up words in a dictionary based upon Jim Breen’s Edict database.

Bitstream Corporation has recently announced two promising products that should work well together: a development library for Java-based Extendible Typography (JET) which displays TrueType and Postscript Type 1 fonts, and a freely-available TrueType Unicode compliant font, optimized for web delivery, called Cyberbit.

These examples and recent developments such as Bitstream and Netscape’s dynamic font technology demonstrate how servers can encode and transmit font information for web browser clients displaying pages. Ideally, Java developers will soon have a similar mechanism for requesting high quality fonts from a server for use on a client without violating copyright or licensing agreements.

Conclusion

The goal of the Kanji Flashcard and similar programs is simple: to display textual information as needed without the end user’s awareness of, or intervention in, low-level details such as character encodings or font installation. Widespread acceptance of high-quality scalable font technology for Internet delivery and the increasing availability of Unicode font families for embedding within web pages and applications will make the experience easier for developers and users. In the meantime, Java applications are an excellent medium to leverage the portability and power of Unicode encoded fonts into multinational applications.

References

  1. Chikamatsu, N., Yokoyama, S., Nozaki, H., and Long, E. (in preparation) A Japanese logographic character frequency list for cognitive science research.
  2. Asahi Shinbun Kanji Database (Japanese)
  3. Bitstream Corporation’s JET, Cyberbit, and Dynamic Fonts.
  4. Jim Breen’s Japanese Page
  5. Java Kanji Flashcard 500 Project Page
  6. Lude, K. (1993). Understanding Japanese Information Processing. Sebastopol, CA: O’Reilly & Associates, Inc.
  7. Jack Palevich’s JJReader and Nihongo Surfer.
  8. OMRON Corporation’s JavaWnn
  9. Purdue University's Japanese Related Projects
  10. Marshall Ramey’s KanjiFlash
  11. Ken Shirriff ‘s PCFFont
  12. Shodouka
  13. Sun Systems Japan SJIS to Unicode example (JDK 1.02)
  14. W3C Web Fonts and Cascading Style Sheets 2 Fonts Working Drafts
  15. Yokoyama, S. and Nozaki, H. (1996). Compilation of kanji frequency list for psychology (shinrigaku no tame no kanji hindo kijunhyo no sakusei). Proceedings of the 60th Japanese Psychological Association Conference, Tokyo, 599.