Working with Electronic Tibetan Texts

July 03, 2020

Working with Electronic Tibetan Texts

Tibetan electronic texts are very useful as you could keep them on your computer for translation work, studies, search lookups and so on. There are some things good to know about when using such electronic texts.

To start with, the input was done by humans and now more frequently even optical scanning (Google has an excellent Tibetan font OCR scanning algorithm that BDRC is now using). This is yet another generational step from the original text. Depending of the origin, the electronic version is from a previous typed version by input operators (that sometimes are not even native Tibetan speakers), this input might itself be done from a wood print or even a book typed version. The wood print might be a reprint of previous wood print. Ultimately it's either a translation from Sanskrit or written down by the author or an assistant, or notes taken during teachings. So it means there are many links where corruption could happen. We could even have cases where the wood carver, editor, input operator or someone else mistakenly corrected a part.

If translating texts and using electronic versions, if possible always have a wood print version available for double-checking dubious words and statements. The closer to the first edition, the better. Actually the best is to also have access to an expert, such as a Tibetan scholar/khenpo/geshe or lama who knows the text (and much more) to clarify unknown parts.

If you have a large electronic Tibetan text file collection, let's say all of Kangyur and Tengyur and much more, you could actually use various search tools to find the statistical occurrence of a problematic word of statement. If it shows up extremely seldom, there's a good chance this is a mistake and then you need Tibetan language detective skills to figure out what the right letters were. Note that if you do this in a translation or text reproduction, you should make a foot note or mark it as a change from the original to indicate this rather than just changing the letters with no information why it was done -- this to avoid future misunderstandings.

Also, always keep an original version of the text, if possible as a read-only file, and also archived somewhere as a copy. There are so many things that could go wrong, hard disks fail, mistaken deletion of files or contents of the files, or re-arrangement of the text from formatting that could corrupt the file.

Put somewhere inside the file or another file in the same file location information where the file was obtained from, what URL location, originator, when and so forth. You might need to get a new copy suddenly, or check if a better version is available. It's also hopeless to figure out what the text is, where it came from and so on, with no information whatsoever -- let's say you found and saved the Tibetan text file ten years' ago and now suddenly you find the text interesting.

From a practical point of view, some of these files are very large. If you try to edit them, most word processors can't handle updates that well with multi-megabyte documents. There are cases where you need to start from the original file and break it down to separate files -- let's say suitable 20 to 50 folio files. Especially scrolling, zoom, edits and so on could be very slow with huge files, even with modern laptops. The Tibetan fonts could introduce a lot of layout operations that will cause such slowness, as well (compared with plain roman fonts).

Opening up large files could take a long time, I have just now a 550kb Word file that has opened up after 2 minutes and is still not useable after opening (the spinning ball of MacOSX running). The corresponding PDF file opened up immediately. Now, PDF files usually are not for editing work, but for plain reading and searching purposes they work really well. The fonts are also embedded so you get a 1:1 look with the original text. You could also copy out text parts from the PDF file into your word processor for translation work. You could always add marking notations in the PDF file where you are; most PDF readers have such features today.

Microsoft Word used to be the default format for most translators, scholars and other users. Today many programs could import or export Word format so there's no necessity to own Microsoft Word. On the Mac platform the free Pages application is actually very efficient and works fine with Tibetan texts. I've also started to use Markdown text files for some projects, these are plain text files with annotations for web publishing, possibility convert to word format and much more. You could edit Markdown files with any text editor. Text files are also very efficient and fast, especially for large file sizes. Google Docs could also be used for Tibetan texts even if its editing and formatting capabilities are limited compared with modern word processors.

As for fonts, today's Unicode standard makes it possible to open just any Unicode Tibetan text file using Unicode -- if possible always use the default Unicode encoding system which is UTF-8 today for most platforms. Some fonts might have missing special characters such as some old Unicode fonts, but those are for specific tantric symbolism material that is not so frequent. However, if you copy the Tibetan text around, always preserve all formatting content, including small size (for sadhana comment sections or end notes). The platforms have default Tibetan fonts installed, or could be done via various system preference settings, so consult your platform for more information.

I prefer to work with the Tibetan text sections and the translation as blocks, one after another, in suitable sections. This makes it easier to check and look at the translation later rather than having separate files for Tibetan and English, or all Tibetan up front in the document and the translation at the end.

The commenting tools in word processors are very handy to markup sections in the text. Those should not be published, for publishing use proper notes.