THE GETFROMCELEX UTILITY

GetFromCelex is a simple tool to extract files from the Celex linguistic
database. You can filter the output to select the lemmas or wordforms you
need.

For more information on Celex have a look at the Celex User Guide. At the
CBU this is in the language group directory at location:
//Home/language/Celex2/English/EUG_A4.PS
You will need a postscript viewer, like GSView, to be able to open and
read the user guide. The guide will tell you all the fieldnames in Celex
and the data they contain.

Not all fields that are mentioned in the Celex manual are actually
available in our version of the database. If you want to know which
fields you can use, there is a file named "FieldListEnglish.txt" that
lists all fields, with some additional information attached.

GetFromCelex will only run on a Windows system (95, 98, NT or 2000),
and has to be called from a (DOS) command prompt like this:

	> GetFromCelex myscript.txt

Where the file called 'myscript.txt' should be a GetFromCelex script file
that you created yourself, containing a few simple commands. You can create
a script file using any text editor, like notepad.

Alternatively, you can just double-click on the GetFromCelex program in the
Windows explorer. It will launch a file-open dialog, asking you for the
scriptfile. This enables users without DOS skills to still run the program.


An example of a GetFromCelex scriptfile is:


	SetCelexDir L:\Celex2\English

	OutputFile C:\Compounds\output.txt

	Filter FlatSA = "*S*S*"    // Select only compounds
	Filter W:Cob > 500         // Cobuild frequency

	Output Word W:Cob MorphStatus FlatSA W:PhonStrsDISC


The first thing to know is that the first word on each line should be a
valid GetFromCelex command. All commands are case sensitive.
Everything following "//" is a comment, and will be ignored by the program.

The first line, with the SetCelexDir command, tells GetFromCelex where the
Celex files are to be found. The normal location is in the language group
directory. You will have to map a network drive to this directory for
GetFromcelex to work: the program cannot access Celex directly through the
network. Mapping a network drive can be done in Windows Explorer ('tools' menu).
If you cannot access the language group directories, you can install Celex
on your own computer: just ask me for the Celex CD.

The directory that has to be provided is the one containing all the subdirectories
for the different Celex files. This will normally be the 'ENGLISH' directory, or
the 'DUTCH' or 'GERMAN' directories for the other languages.

The next command is "OutputFile" and should specify the name of the file where
the output will be written. An existing file with the same name will be overwritten!

The next two commands are "Filter" commands. Only entries that satisfy these
are written to the outputfile. If you supply more than one filter, only entries
that satisfy all filters are selected.

Wildcard expressions are accepted: '*' means any character or no character
at all, and '?' means precisely one character. All other characters have 
to be matched literally. Wildcards can only appear on the right side of an
expression and can only be used with the '=' and '!=' operators.

Wildcards and string literals must be in double quotes, everything else will be
interpreted as a numeric value or a CELEX fieldname.

The operators you can use are:
	
	=	Equal, can be used with wildcards
	!=	Not equal, can also be used with wildcards
	<=	Smaller than or equal to
	>=	Greater than or equal to
	<	Smaller than
	>	Greater than

Filters can be combined using an OR operator like this:

	Filter FlectType = "S" OR FlectType = "P" OR Length Word > 5

The last command in the example, "Output", specifies which fields should
be written to the outputfile. Fields will be written in the order you specify, 
with '\' characters as field seperators like in the original Celex files, or
any other seperator specified with the OutputSeperator command.

The names of the fields, like 'FlatSA', are identical to the names used in the
CELEX manual. The only difference is that you will have to prefix some with
"L:" or "W:" to select fields from either the Lemma or the WordForm lexicon.
In the example script "Cob" could refer to the Cobuild frequency of the Lemma
or that of the Wordform. Using "W:Cob" disambiguates this for the program.


		ADVANCES FEATURES

Sometimes you only want information from Celex for a limited number of words.
You can do this by using a 'master' file, like this:

	MasterFile [-Celex] C:\Experiment1\Condition2.txt mstr

The last field ('mstr') is a nickname that you give the file so you can refer to it.
The -Celex option has to be used with masterfiles that have the celex file format,
with backslashes as field seperators and possible empty fields.

This masterfile can be used in filters so you can limit your output to the words
in the masterfile like this (assuming the words to filter on are on the first 
field of your masterfile):

	Filter mstr[1] = Word

As you can see you use the nickname and a fieldposition to refer to fields in your
masterfile. 

When using a masterfile, GetFromCelex will need much more time to produce the
output: every line in the masterfile will take several seconds (or more if your
computer is slow). So, a masterfile of 500 lines could easely take more than 15
minutes!

	
		FILTERS

There is a filter with which you can select fields on the basis of their length:

	Filter Length Word > 5
	Filter Length Word <= 15
	
The example will select only 'Word' fields that contain strings longer than 
5 characters and shorter than 16.

You can tell GetFromCelex to ignore certain characters in the length count:

	Filter Length L:PhonSylBCLX < 3 Ignore "[]" // number of phonemes

In this case the [ and ] brackets indicating syllables are ignores, resulting
in a count of the number of phonemes in a word.


There is also a special filter available that can count the number of substrings
in a field. It is used like this:

	Filter Count FlatSA "S" = 2
	Filter Count FlatSA "SA" < 3

In the first case only entries that contain exactly 2 occurrences of the
character "S" will be outputted. The second filter makes GetFromCelex only
output entries that have no more that two occurrences of "SA" embedded in
the given field.


If you only want to include fields that contain certain characters, or that
do not contain certain characters, you can use these commands:

	Filter CharSet Word "abcdefghijklmnopqrstuvwxyz"
	Filter NotCharSet Word "_&$"
	
The first line will select 'Word' values that are only lowercase, and the second
one will exclude Word's that contain a '_', '&' or '$' character.


Normally GetFromCelex will only output wordforms for which all filters (if any)
are true. But sometimes you want to see the other wordform too. If you want to
look for words that have a plural form that is identical to the singular, you
would probably be interested to see if there is another, not identical, plural.
You can make GetFromCelex output all wordforms for each lemma that has at least
one wordform that gets through all filters by using this command:

	OutputAllWordforms

By default, only the wordforms that matches the criteria for all filters are
copied to the output.


You can also change the output field-seperator. This is set to \ by default,
but can be set to something else, like a single space, with:

	OutputSeparator " "
	
If you want to specify a tab you need to use "\t", for a backslash use "\\".
You can use multiple characters if you want, the program will just insert
the string as given.


If you have problems, or want to ask a question, please contact me.


Maarten

Maarten.van-Casteren@mrc-cbu.cam.ac.uk