I have converted a number of freely available electronic dictionaries to bedic format so that I can use them with my Zaurus PDA. You can find them at my Zaurus Dictionaries page and there are more on my Thai Dictionaries page.

Note: When you download the scripts, you should rename them to remove the .txt extension.

How to Convert a Dictionary

All the conversions I have done to date use scripts written in Perl. A few features of Perl make it ideal for this use: it's interpreted, uses regular expressions, handles Unicode and I can write some really BASIC looking code and test it quickly. Most of the time I cut and paste bits from other scripts although I do plan how to implement processes and functions, like a good programmer should. The strength in using a script is that if I discover a mistake in my conversion or a new source dictionary file is released I can resolve the problem or download a new file and run the script again.

My process is

At this stage I have a basic dictionary, but it probably has duplicated entries. Depending on how good your script writing abilities are, you may find mistakes at various stages. Checking the first and second versions probably means generating the dictionary more than 10 times and checking through definitions manually making sure they are correct.

I write out the basic bedic format and run mkbedic. I modified mkbedic so that the error messages give a little more detail about which entries are causing problems.

Tip Summary

  1. Write neat, simple bedic format
  2. Write a conversion script that doesn't create duplicates
  3. Detect that a headword was defined in a previous sense

Scripts

These scripts are actually quite simple but the duplicate fixing one was a headache to write. I'm sure that anyone writing bedic files will appreciate it.

Getting the Basic Format Right

My first tip is always write neat, simple bedic format, i.e. always start a new sense on a new line, e.g.

entry
{s}
First sense
{ss}Sub sense 1{/ss}
{/s}

Don't write entries like this:

entry{s}First sense{/s}{s}second sense{/s}

It's more difficult to read, harder to spot mistakes and you can't use my verify dictionary script! Never nest senses, i.e. start a new sense before finishing the current sense or start a new sub sense before finishing the previous one.

verifydic.pl is a simple script to check the format of a bedic file. It checks for extra blank lines between entries, matches opening and closing senses and looks for spaces on the end of definitions (which messes with look up because zbedic matches against "entry " not "entry"). To use this script you must use very basic bedic format.

Just run verifydic.pl mydictionary.dic

Duplicates

If you have two definitions with the same entry (perhaps from two different senses), mkbedic will complain and when you check the file in zbedic you find that only the first definition is correct the second entry is an exact copy of the first one. This can also be caused by

Assuming you have a duplicate as described above, it can be quite painful to have to find and fix thousands of entries in a >100,000 word dictionary. The following two scripts will find and fix duplicates.

My second tip is to write a conversion script that doesn't create duplicates in the first place. A bit obvious, I know.

checkdups.pl uses a hash to detect duplicates and outputs the line numbers into a duplicate list file, so it's quite quick. You can view the duplicate list, it's designed to be human readable.

fixdups.pl reads the duplicate list, generates a script for the ed editor found on most *nix systems and runs it. Fixing duplicates can take a long time for some dictionaries, maybe up to a couple of hours depending on your machine. The first version of this script didn't use a hash and it took a very long time.

These scripts are written to be accurate rather than fast, however, more experienced Perlmongers may find ways to speed them up. When your duplicate list file reaches a megabyte you know you'll have a long wait. Hence, avoid creating duplicates.

 checkdups.pl > duplicate-list
 fixdups.pl

To use these scripts you'll have to edit a few variables at the start, to set the input and output files. At the end of this page I give an example of the complete procedure.

Headword Fix

According to the bedic format document, headwords may be additionally defined for each sense which leads one to believe that their scope is limited to the current sense, however, headwords are repeated in following senses until it is redefined. e.g.

entry
{s}
this is sense 1
{/s}
{s}
{hw}new headword specific to sense 2{/hw}
this is sense 2
{/s}
{s}
this is sense 3
{/s}

Will get rendered as

entry
----
sense 1
----
/new headword specific to sense 2/
sense 2
----
/new headword specific to sense 2/
sense 3

which is probably not what you want.

The script fixhw.pl will reshuffle senses so that this is not a problem. What it does is, locate the entry, read in a sense, if it has no headword defined then it outputs it immediately. If the sense has a headword it is stored until the end of the entry has been reached. At that point all stored senses are output. In the case of the above example, you will get sense 1 followed by sense 3 and sense 2.

If the senses in your file are in a specific order then you should always mark up the headword. My third tip is detect that a headword was defined in a previous sense so you know you should define it for the next one. Blanketly defining headwords for every sense will bloat your dictionary.

Inequality Signs

Bedic can handle inequality signs (">" and "<") in entries but zbedic uses QT classes to render files. Inequality signs in an entry will not be shown because they are confused with HTML, however, if you replace them with the HTML equivalent then the sort algorithm will compare the HTML (&gt; and &lt;). There is no fix for dictionary writers that I can see, zbedic should parse out inequality signs.

Putting it all Together

As an example, suppose I'm converting an English-Kazakh dictionary:

  1. Convert the source dictionary file to utf-8 using iconv
  2. Run parse-kazakh.pl (my script to generate the bedic file, output = en-ka.bedic; I give simple bedic format files the extension .bedic)
  3. verifydic.pl en-ka.bedic (check the dictionary)
  4. checkdups.pl > duplicate-list (detect duplicates)
  5. fixdups.pl (fix duplicates, output = en-ka.bedic-new)
  6. fixhw.pl (fix headwords, output = en-ka.bedic-hwfix)
  7. mkbedic en-ka.bedic-hwfix en-ka.dic (convert to full bedic format, don't ignore any errors)
  8. dictzip en-ka.dic (compress the dictionary)