Issues with ACIP to Unicode Conversion and MacOSX
I'm really having fun with Unicode and Tibetan. But part of this is to take a lot of ACIP encoding material that I have as well as available from AsianClassics.org and similar places.
I saw that JSkad had a conversion from ACIP to Unicode (text file). So I tried this, but the output didn't look like Unicode at all. I was using Notepad and Pages (latest), but both didn't show Tibetan Unicode fonts from the output, rather Roman letters with strange numbers.
Now, it could be an operator error, so I need to do something with the text file before using it, or something else. In case someone has ideas what is happening and how to fix this, please post a comment. Also, if you have other tools or ideas how to convert ACIP encoding to Unicode on the Macintosh platform. If I get this working, a lot of really cool Tibetan material will be posted on dharmadictionary and similar places for public access.

5 comments:
Uh-oh, I just tested to see for myself and got the same gibberish. ACIP>Unicode used to work fine. It doesn't seem like Leopard would have messed up a good thing, but I don't know what else has changed since then. I tried numerous different plain text encodings via TextEdit, but the all had different problems when converted.
That's really a shame. I was banking on ACIP>Uni on a Mac. I'll keep searching for a solution.
Thanks. I tried both the tested version of Jskad as well as last night's build. I suspect that the file header needs some specific information that this is a unicode 16 or UTF-8 file, but I'm no expert on Unicode files.
Maybe someone from the Jskad team is reading this...
I found the same thing as you when using Jskad.jar, but I did manage to get readable unicode from a ACIP file in a two step process.
First go to:
Tools→Launch Converter...→ACIP to Wylie (Text->Text)→Convert
Second:
Close the converter dialogue and open the text file that was produced by the converter and select all and copy then paste into Jskad.
Third:
Select all in Jskad, then:
Tools→Convert All→Convert Tibetan Machine Web (non-Unicode) to Unicode.
Fourth:
Select All→Copy and paste into a text file. Save the file (make sure that the encoding is UTF) and you have a Unicode file.
It's a little clumsy, but not too bad, I think.
*Except that the Unicode stackings are far from perfect - at least in Windows. I still haven't tried with Linux. (I left a message about this at http://jigtenmig.blogspot.com/2008/03/tibetan-unicode-fonts-and-this-blog.html)
I've converted several hundred of pages of our project with JSKAD on Mac OS X
(http://www.ittm.org/projects/dataInput/
DataInputProject.htm)
If you familiar with Terminal on Mac OS X, try the following command:
java -Dthdl.acip.to.unicode.conversions.use.0F52.et.cetera = true -cp PATH/lib-vanilla/Jskad.jar org.thdl.tib.input.TibetanConverter --colors no --warning-level None --acip-to-tibetan-warning-and-error-messages long --acip-to-unicode ACIP_file.txt >> UNICODE_file.txt
Replace PATH with the JSKAD path and ACIP_file.txt is the input file and UNICODE_file.txt the output file.
BTW, JSKAD don't include a UTF-8 BOM at the beginning of the file, which is in hexa: EF BB BF.
Hope this helps,
Daniel
Thanks, this will be handy, especially for converting a large set of ACIP files using a bash on the command line.
Now, for non-programmers, this all could daunting. There's an option with MacOSX to make an icon that accepts files, and underneath it will trigger bash scripts, so if I ever had more spare time something like this would be handy for those who don't dare to open up the terminal app.
Post a Comment