thaiconv is a command line tool for converting text coding for Thai.
If you have read my Thai email guide then you will know that sending and receiving email in Thai can be beset with problems. To help you, thaiconv will determine the coding of a text file and convert it to a coding that you can read.
People on Un*x systems will be using something called iconv. I didn't know about iconv until after several versions of thaiconv, however, thaiconv has features that iconv doesn't have. I have created thaiconv to painlessly convert between Thai codings whereas iconv is a general tool.
thaiconv standard features include:
thaiconv assumes that the text file you want to process is in Thai of some form. It's not suitable for processing other languages although it is OK if there are Roman letters in the file (i.e. no accents).
Updated coded based on suggestions from static analysis.
Improved guessing for the text coding.
File | Platform | md5 |
---|---|---|
thaiconv-1_8-ARM.tar.bz2 | Linux ARM (Zaurus) | ef7abcd9e879ab2f20f18cd36991a692 |
thaiconv-1_8-mac.tar.bz2 | MacOS X | 269873120dad1d505237e23238805c5d |
thaiconv-1_8-linux.tar.bz2 | Linux x86 | c165d31faede5f0bbaff6d4a8c71b6d3 |
thaiconv-1_8-win.zip | Windows | 1ebb17f41d59646381fac66dc928f26a |
Linux and Windows binaries are statically linked. If the idea of typing a command into a shell sounds too technical for you, then use Sontana instead.
thaiconv [-h] [-s] [-sq] [-in X] [-out Y] [-noent] [-bom] -r input-filename -w output-filename
Using thaiconv is straightforward, use -h to get comprehensive help information:
Work:Dev/ThaiConv:> thaiconv -h thaiconv: Thai text transcoding tool. Version 1.7, Build 15122013. Usage: thaiconv [-h] [-s] [-sq] [-in X] [-out Y] [-noent] [-bom] -r infilename [-w outfilename] Convert plain text file encoding for Thai to another encoding. --- -s scan file to determine type -sq scan quiet - as above but only output the input file mode number -in input format, see list below -out output format, see list below -r filename to read (required) -w filename to write (default = stdout) -noent don't convert HTML entities when reading -bom write BOM when writing Unicode -h this help --- Input/Output Formats: 0 = TIS-620 1 = UTF-8 Thai 2 = HTML 3 = UTF-8 Latin 1 (cross coded Thai) --- Notes: If the input format is not specified then it will be determined automatically. If the result is not obvious TIS-620 will be assumed. Use scan mode to find automatic result. Output format defaults to TIS-620 unless specified. For extended information please see <http://www.lyndonhill.com/Projects/thaiconv.html>
To convert a file from UTF-8 to TIS-620
Work:Dev/ThaiConv:> thaiconv -r utf8file.txt -out 0 > tis620file.txt
Work:Dev/ThaiConv:> thaiconv -r tis620file.txt -out 1 > utf8file.txt
Work:Dev/ThaiConv:> thaiconv -r htmlfile.txt -out 0 > tis620file.txt
To get thaiconv to tell you about the text file's coding:
Work:Dev/ThaiConv:> thaiconv -s -r testfile.txt thaiconv Scan Report -------------------- 12 plain ASCII characters. 0 extended ASCII characters. 17 HTML Unicode entities in Thai range. 29 Total characters. File is probably Thai HTML Unicode
If you want to use thaiconv in a script and just want to know what coding the file is without parsing a lot of output:
Work:Dev/ThaiConv:> thaiconv -sq -r testfile.txt 2
The following table lists formats understood by thaiconv.
Standard 7 bit ASCII | All alphabetical, numeric and punctuation characters used in standard ASCII. No accents, umlauts, ulls, fancy punctuation or graphics. |
---|---|
TIS-620 | Thai characters are stored in the upper half of ASCII, i.e. using characters represented using 8 bit ASCII; thus allowing ASCII and Thai to co-exist. |
UTF-8 (Thai range) | The Unicode standard, specifically the section on Thai characters (0xE00 - 0xE7F). |
HTML Unicode (Thai range) | Unicode as represented in HTML: An entity of the form &#NNNN; or &#xHHHH; where NNNN is a decimal number and HHHH is a hexadecimal number. |
Cross coded UTF-8 | TIS-620 that has been converted to UTF-8 Latin1 (0xA0-0xF0). For example, the Thai character that has the value 160 in TIS-620 may have the Latin representation é, this character gets converted to the Unicode for é. This mode is likely to be converted correctly only if the cross coding and decoding occur in the same locality. |