Friday, March 28, 2008

Issues with ACIP to Unicode Conversion and MacOSX

I'm really having fun with Unicode and Tibetan. But part of this is to take a lot of ACIP encoding material that I have as well as available from AsianClassics.org and similar places.


I saw that JSkad had a conversion from ACIP to Unicode (text file). So I tried this, but the output didn't look like Unicode at all. I was using Notepad and Pages (latest), but both didn't show Tibetan Unicode fonts from the output, rather Roman letters with strange numbers.

Now, it could be an operator error, so I need to do something with the text file before using it, or something else.  In case someone has ideas what is happening and how to fix this, please post a comment. Also, if you have other tools or ideas how to convert ACIP encoding to Unicode on the Macintosh platform. If I get this working, a lot of really cool Tibetan material will be posted on dharmadictionary and similar places for public access. 

5 comments:

Evan Osherow said...

Uh-oh, I just tested to see for myself and got the same gibberish. ACIP>Unicode used to work fine. It doesn't seem like Leopard would have messed up a good thing, but I don't know what else has changed since then. I tried numerous different plain text encodings via TextEdit, but the all had different problems when converted.


That's really a shame. I was banking on ACIP>Uni on a Mac. I'll keep searching for a solution.

Kent Sandvik said...

Thanks. I tried both the tested version of Jskad as well as last night's build. I suspect that the file header needs some specific information that this is a unicode 16 or UTF-8 file, but I'm no expert on Unicode files.

Maybe someone from the Jskad team is reading this...

Anonymous said...

I found the same thing as you when using Jskad.jar, but I did manage to get readable unicode from a ACIP file in a two step process.

First go to:
Tools→Launch Converter...→ACIP to Wylie (Text->Text)→Convert

Second:
Close the converter dialogue and open the text file that was produced by the converter and select all and copy then paste into Jskad.

Third:
Select all in Jskad, then:
Tools→Convert All→Convert Tibetan Machine Web (non-Unicode) to Unicode.

Fourth:
Select All→Copy and paste into a text file. Save the file (make sure that the encoding is UTF) and you have a Unicode file.

It's a little clumsy, but not too bad, I think.

*Except that the Unicode stackings are far from perfect - at least in Windows. I still haven't tried with Linux. (I left a message about this at http://jigtenmig.blogspot.com/2008/03/tibetan-unicode-fonts-and-this-blog.html)

Daniel said...

I've converted several hundred of pages of our project with JSKAD on Mac OS X
(http://www.ittm.org/projects/dataInput/
DataInputProject.htm)

If you familiar with Terminal on Mac OS X, try the following command:

java -Dthdl.acip.to.unicode.conversions.use.0F52.et.cetera = true -cp PATH/lib-vanilla/Jskad.jar org.thdl.tib.input.TibetanConverter --colors no --warning-level None --acip-to-tibetan-warning-and-error-messages long --acip-to-unicode ACIP_file.txt >> UNICODE_file.txt

Replace PATH with the JSKAD path and ACIP_file.txt is the input file and UNICODE_file.txt the output file.

BTW, JSKAD don't include a UTF-8 BOM at the beginning of the file, which is in hexa: EF BB BF.

Hope this helps,
Daniel

Kent Sandvik said...

Thanks, this will be handy, especially for converting a large set of ACIP files using a bash on the command line.

Now, for non-programmers, this all could daunting. There's an option with MacOSX to make an icon that accepts files, and underneath it will trigger bash scripts, so if I ever had more spare time something like this would be handy for those who don't dare to open up the terminal app.