Cyclone 1.4.1

An Interface for Apple Text Encoding Converter.

“Butterflies stir a breeze
and the ripples flow unceasingly:
far away the cyclones swirl.
It's a whole, connected world.”*

Theory of operation
Text Encoding Converter (called TEC) is a Mac OS engine for handling different languages using different character sets. It supports many standards, it is robust and pretty fast. Many applications use it for their internal conversion needs and that's great but I could not seem to find a plain converter using this engine. So here comes the Cyclone.

Highlights revisited
Because Cyclone is using TEC's conversion maps, it will grow with TEC even if the program itself will not be developed. When more encodings appear in future incarnations of TEC or any maps are corrected or modified, Cyclone is supposed to use them as if nothing has changed (OK — one exception — I hard-coded the names of encodings, because I was not satisfied with the names returned by TEC, but if Cyclone will not find the name for any new encoding in its own resources, it will use the name given by TEC). TEC does not change line endings properly so I added this option (any bugs in this field are mine), look “More details” section for specifications.
Cyclone can convert many files dragged at it or chosen from standard file dialog (Navigation Services needed for multiple selection). The conversion is streamed, so the size of input and output files is not limited, but of course the more memory you give to Cyclone by “Get Info”, the larger chunks of text it will be able to read-in and convert at a time. Speed changes can be significant. Clipboard conversion is limited by the size of Cyclone's memory. No memory outside Cyclone's own heap is used for safety reasons.

Conversions
When you look at the conversion dialog you will see the two sets of pop-ups, left for input, right for output. Choose the standard/platform first, then specific encoding and lastly the variant (if any variant for the given encoding exists). You may choose whatever you want for input and output encodings, but you must be aware that not all conversions make sense — you cannot translate from Chinese to Greek with TEC (not yet :-)). Sometimes you will get an error, but sometimes not. You are responsible for choosing a valid encodings for input and output. You may use content sniffers, which can help with input encoding (look “More details” section for description of sniffers), but do not rely on it.

Preferences
I implemented the following options to make my life easier (and hopefully yours too):

— Remember last used encoding settings
(last used input and output settings are remembered while application is running and saved to prefs file when it quits)
— Use content sniffers to suggest input encoding
(some content sniffers are built into TEC, some may be provided by third party developers; look below for details)
— Change line breaks to match output standard
(you may match line endings for your destination standard; look below for details)
— Ask for input file at startup
(with this option checked Cyclone presents a standard file dialog at startup)
— Don't ask for output name (save with '•' at end)
(check this to have output files saved automatically — location the same as input, name with added solid dot)
— Keep partially converted text if an error occurs
(when an error occurs in the middle of conversion, the file may be kept; zero length files are deleted anyway)
— Use custom creator signature for output file
(choose the signature of your output file for easy, double-click opening with your favorite application)

Multiple file settings

— Ask for encodings for each file
(you may choose in-out encodings for each file separately or use one set for all converted files)
— Suppress error messages and generate log
(with this option checked, Cyclone does not show any alerts, it logs errors, if any, into a file named “Cyclone error log” which is saved at the location of the last input file — even if the conversion for that particular file fails — you may order a 10x100Meg conversion and go for a walk... )

Two little features:

If you want Cyclone to present preferences dialog at startup, hold command key.
The dialog is presented at startup when you run Cyclone for the first time.

More details for the very curious

Content sniffers
Content sniffing is a feature offered by TEC and used by Cyclone when checked in preferences.
When this option is active, Cyclone tries to suggest what input encoding is used. Unfortunately in current TEC version (1.5) can guess content ONLY for far-east languages. So if you are using these languages frequently, this option is for you. Otherwise you will be annoyed that Cyclone (or TEC, to be precise) suggests Chinese or Japanese every time you want to convert a plain ASCII.
This option is turned off by default.
Content sniffing is not working correctly.
I do not use it and people seem not to care about it — this is why it is not fixed yet.

Sniffers available in TEC 1.4.3 and 1.5 (in order of appearance):

Macintosh:

Japanese
Chinese Traditional
Chinese Simplified
Korean

Other:

Japanese JIS X0208-90
Simplified Chinese GB 2312-8
Korean KSC 5601-87
Japanese ISO 2022-JP
Simplified Chinese ISO 2022-CN
Korean ISO 2022-KR
Japanese EUC
Simplified Chinese EUC
Traditional Chinese EUC
Korean EUC
Japanese Shift JIS
Traditional Chinese Big-5
Simplified Chinese HZ GB-2312

Line Breaks
As mentioned before, TEC does not change the line breaks to match the output standard. For example when you convert from Mac to Windows, everything is converted OK except for line endings, which remain in Mac standard. So the option to change the line breaks has been added. Here are the rules for output standards:

— Mac: CR (0x0D)
— ISO: Unix standard LF (0x0A)
— DOS/Windows: CRLF (0x0D, 0x0A)
— Unicode standard (UTF-16): paragraph separator (PS = 0x2029)
— Unicode UTF-8: no change is made — if you need it, let me know; look below for workaround
— Unicode UTF-7: no change is made — Cyclone does not support it and you are discouraged to use it; look below for workaround
— Unicode 32-bit is not supported by TEC yet and Cyclone makes no attempt to change breaks in this case
— Other: miscellaneous standards — no change is made — please inform me if you need any improvements in this field, I don't use these encodings so I don't even know what line-break codes should be.

Unicode UTF-8 paragraph separator (PS) is 0xE280A9 and UTF-7 uses more complicated encoding. If you are converting from any 8-bit standard to one of these, no change will be made to line endings. But if you are converting from Unicode 16-bit with properly coded line endings (PS = 0x2029), the output will have correctly encoded paragraph separators.
So the workaround for obtaining standard line ends for UTF-8 & UTF-7 is:
1. Turn on the “Change line breaks to match output standard” option in preferences.
2. Convert from your source encoding to Unicode standard.
3. Convert the new file from Unicode standard to Unicode UTF-8 or UTF-7.

You may also try this trick for “Other” encodings, but I do not know if it will work for you.

Unicode and HTML
HTML writers please note, that if you are building a page where most (or all) characters are ASCII, the encoding of choice for you is Unicode UTF-8. If all characters are ASCII, the length of your page will be exactly the same as if no Unicode is used.
To inform a browser that the Unicode UTF-8 is used, type:
<META HTTP-EQUIV="content-type" CONTENT="text/html;charset=UTF-8">
between <HEAD> and</HEAD> at the beginning of your file.
Another UTF-8 issue is line breaks (again). I tried the UTF-8 text on recent versions of Netscape, Explorer and iCab and I found that it works fine provided that you will NOT use the PS = 0xE280A9 which is not recognized neither by Netscape nor Explorer. So if you are using Cyclone for HTML conversions from any 8-bit encoding to Unicode UTF-8, you are safe that this unwanted line break will not occur. If you are converting from Unicode standard where the PS (= 0x2029) is used, you will get the unwanted breaks.

Unicode standard (16 bit) may also be used for creating HTMLs, but it seems that only Netscape is able to handle it — and only if a byte-order mark is present at the beginning of the file (yes, you guessed it, Cyclone puts the needed mark :-) )

If you convert HTMLs with Cyclone, it would be nice if you gave a credit to it — but you are not obliged — just to spread the word and help Unicode become more popular. You may add something like this:

This page has been converted to Unicode by <A HREF="http://www.ire.pw.edu.pl/~tkukiel/cyclone.html">Cyclone</A>

Unicode and MS Office 98 for Mac
I had a chance to try MS Word's Unicode export feature — beware, it “eats” some characters, and there is no particular logic for what characters it happens. For one hundred chars at least one char is lost. The solution? Save in a plain text format and use Cyclone for conversion. If you are using different languages in one text, you must separate the chunks that use the same encoding and convert them one by one.

More Unicode notes
The registered type for standard Unicode (UTF-16) text is 'utxt' (used for file and clipboard), while plain 8-bit text uses 'TEXT'. You may not be able to see the content of the clipboard or paste it if the application you use does not support Unicode. Unicode UTF-8 and UTF-7 remain 'TEXT'.
Each standard Unicode (UTF-16) text produced by Cyclone has a byte-order mark (0xFEFF) at the beginning to ensure 100% portability.

Scripting
Beginning with version 1.1 “Cyclone” is scriptable via AppleScript. Please see the sample scripts provided in “Scripting” folder. A document entitled “Encodings Dictionary” contains predefined encoding names which can be used in scripts.
Available AppleScript commands:

convert <file_list> from <encoding> to <encoding>

You may pass a file or list of files for conversion. This command returns a list of converted files. If the conversion fails for some file, the returned file spec is invalid.

convert clipboard from <encoding> to <encoding>

This command returns an error code (zero if no error occurred).

convert text <some_text> from <encoding> to <encoding>

This command returns converted text or null string if an error occurs.
CAUTION: The size of the text is limited by Cyclone's memory.
Available in version 1.3 and higher.

Beginning with version 1.3 you may pass an Interent name for encoding.
This option is available with any “convert” command: “convert”, “convert text”, “convert clipboard”:

convert some_file from "ISO-8859-1" to "UTF-8"

For a complete list of available encodings and their Internet name equivalents look “Encoding Dictionary”.

Setting Options:

set option <an_option>

Options currently available:

NoAlert
NoAlertNoLog
DoAlert
DoAlertNoLog
NoAlertDoLog

The future
Cyclone is quite static now — the software is stable and not very buggy as far as I know. There are some feature requests and I have my own ideas to implement, but I am simply too busy with my regular job to do it now.

Small print
The author gives no warranty for this software and takes no responsibility for any damages that it may cause. If you cannot accept it, please delete your copy.
All trademarks are properties of their owners.

* the quotation is from Peter Hammill (“Gaia”).

Cyclone 1.4.1

An Interface for Apple Text Encoding Converter.

Contents: