I have converted a number of freely available electronic dictionaries to bedic format so that I can use them with Frasmodic and ZBedic. You can find them at my dictionaries page
Note: When you download the scripts, you should rename them to remove the .txt extension.
All the conversions I have done to date use scripts written in Perl. A few features of Perl make it ideal for this use: it's interpreted, uses regular expressions, handles Unicode and I can write some really BASIC looking code and test it quickly. Most of the time I cut and paste bits from other scripts although I do plan how to implement processes and functions, like a good programmer should. The strength in using a script is that if I discover a mistake in my conversion or a new source dictionary file is released I can resolve the problem or download a new file and run the script again.
My process is
At this stage I have a basic dictionary, but it probably has duplicated entries. Depending on how good your script writing abilities are, you may find mistakes at various stages. Checking the first and second versions probably means generating the dictionary more than 10 times and checking through definitions manually making sure they are correct.
I write out the basic bedic format and run mkbedic
. I modified mkbedic
so that the error messages give a little more detail about which entries are causing
problems.
These scripts are actually quite simple but the duplicate fixing one was a headache to write. I'm sure that anyone writing bedic files will appreciate it.
My first tip is always write neat, simple bedic format, i.e. always start a new sense on a new line, e.g.
entry {s} First sense {ss}Sub sense 1{/ss} {/s}
Don't write entries like this:
entry{s}First sense{/s}{s}second sense{/s}
It's more difficult to read, harder to spot mistakes and you can't use my verify dictionary script! Never nest senses, i.e. start a new sense before finishing the current sense or start a new sub sense before finishing the previous one.
verifydic.pl is a simple script to
check the format of a bedic file. It checks for extra blank lines between entries, matches
opening and closing senses and looks for spaces on the end of definitions (which messes
with look up because zbedic matches against "entry " not "entry"). To use this script
you must use very basic bedic format.
Just run verifydic.pl mydictionary.dic
If you have two definitions with the same entry (perhaps from two different senses), mkbedic will complain and when you check the file in zbedic you find that only the first definition is correct the second entry is an exact copy of the first one. This can also be caused by
search-ignore-characters
which defaults to "-")char-precedence
)Assuming you have a duplicate as described above, it can be quite painful to have to find and fix thousands of entries in a >100,000 word dictionary. The following two scripts will find and fix duplicates.
My second tip is to write a conversion script that doesn't create duplicates in the first place. A bit obvious, I know.
checkdups.pl uses a hash to detect duplicates and outputs the line numbers into a duplicate list file, so it's quite quick. You can view the duplicate list, it's designed to be human readable.
fixdups.pl reads the duplicate list, generates a script for the ed editor found on most *nix systems and runs it. Fixing duplicates can take a long time for some dictionaries, maybe up to a couple of hours depending on your machine. The first version of this script didn't use a hash and it took a very long time.
These scripts are written to be accurate rather than fast, however, more experienced Perlmongers may find ways to speed them up. When your duplicate list file reaches a megabyte you know you'll have a long wait. Hence, avoid creating duplicates.
checkdups.pl > duplicate-list fixdups.pl
To use these scripts you'll have to edit a few variables at the start, to set the input and output files. At the end of this page I give an example of the complete procedure.
According to the bedic format document, headwords may be additionally defined for each sense which leads one to believe that their scope is limited to the current sense, however, headwords are repeated in following senses until it is redefined. e.g.
entry {s} this is sense 1 {/s} {s} {hw}new headword specific to sense 2{/hw} this is sense 2 {/s} {s} this is sense 3 {/s}
Will get rendered as
entry ---- sense 1 ---- /new headword specific to sense 2/ sense 2 ---- /new headword specific to sense 2/ sense 3
which is probably not what you want.
The script fixhw.pl will reshuffle senses so that this is not a problem. What it does is, locate the entry, read in a sense, if it has no headword defined then it outputs it immediately. If the sense has a headword it is stored until the end of the entry has been reached. At that point all stored senses are output. In the case of the above example, you will get sense 1 followed by sense 3 and sense 2.
If the senses in your file are in a specific order then you should always mark up the headword. My third tip is detect that a headword was defined in a previous sense so you know you should define it for the next one. Blanketly defining headwords for every sense will bloat your dictionary.
Bedic can handle inequality signs (">" and "<") in entries but zbedic uses QT classes to render files. Inequality signs in an entry will not be shown because they are confused with HTML, however, if you replace them with the HTML equivalent then the sort algorithm will compare the HTML (> and <). There is no fix for dictionary writers that I can see, zbedic should parse out inequality signs.
As an example, suppose I'm converting an English-Kazakh dictionary:
iconv
parse-kazakh.pl
(my script to generate the bedic file,
output = en-ka.bedic; I give simple bedic format files the extension .bedic)verifydic.pl en-ka.bedic
(check the dictionary)checkdups.pl > duplicate-list
(detect duplicates)fixdups.pl
(fix duplicates, output = en-ka.bedic-new)fixhw.pl
(fix headwords, output = en-ka.bedic-hwfix)mkbedic en-ka.bedic-hwfix en-ka.dic
(convert to full bedic format,
don't ignore any errors)dictzip en-ka.dic
(compress the dictionary)