attachment:GetFromCelexManual.txt of GetFromCelex

Toggle line numbers
   1 		THE GETFROMCELEX UTILITY
   2 
   3 GetFromCelex is a simple tool to extract files from the Celex linguistic
   4 database. You can filter the output to select the lemmas or wordforms you
   5 need.
   6 
   7 For more information on Celex have a look at the Celex User Guide. At the
   8 CBU this is in the language group directory at location:
   9 //Home/language/Celex2/English/EUG_A4.PS
  10 You will need a postscript viewer, like GSView, to be able to open and
  11 read the user guide. The guide will tell you all the fieldnames in Celex
  12 and the data they contain.
  13 
  14 Not all fields that are mentioned in the Celex manual are actually
  15 available in our version of the database. If you want to know which
  16 fields you can use, there is a file named "FieldListEnglish.txt" that
  17 lists all fields, with some additional information attached.
  18 
  19 GetFromCelex will only run on a Windows system (95, 98, NT or 2000),
  20 and has to be called from a (DOS) command prompt like this:
  21 
  22 	> GetFromCelex myscript.txt
  23 
  24 Where the file called 'myscript.txt' should be a GetFromCelex script file
  25 that you created yourself, containing a few simple commands. You can create
  26 a script file using any text editor, like notepad.
  27 
  28 Alternatively, you can just double-click on the GetFromCelex program in the
  29 Windows explorer. It will launch a file-open dialog, asking you for the
  30 scriptfile. This enables users without DOS skills to still run the program.
  31 
  32 
  33 An example of a GetFromCelex scriptfile is:
  34 
  35 
  36 	SetCelexDir L:\Celex2\English
  37 
  38 	OutputFile C:\Compounds\output.txt
  39 
  40 	Filter FlatSA = "*S*S*"    // Select only compounds
  41 	Filter W:Cob > 500         // Cobuild frequency
  42 
  43 	Output Word W:Cob MorphStatus FlatSA W:PhonStrsDISC
  44 
  45 
  46 The first thing to know is that the first word on each line should be a
  47 valid GetFromCelex command. All commands are case sensitive.
  48 Everything following "//" is a comment, and will be ignored by the program.
  49 
  50 The first line, with the SetCelexDir command, tells GetFromCelex where the
  51 Celex files are to be found. The normal location is in the language group
  52 directory. You will have to map a network drive to this directory for
  53 GetFromcelex to work: the program cannot access Celex directly through the
  54 network. Mapping a network drive can be done in Windows Explorer ('tools' menu).
  55 If you cannot access the language group directories, you can install Celex
  56 on your own computer: just ask me for the Celex CD.
  57 
  58 The directory that has to be provided is the one containing all the subdirectories
  59 for the different Celex files. This will normally be the 'ENGLISH' directory, or
  60 the 'DUTCH' or 'GERMAN' directories for the other languages.
  61 
  62 The next command is "OutputFile" and should specify the name of the file where
  63 the output will be written. An existing file with the same name will be overwritten!
  64 
  65 The next two commands are "Filter" commands. Only entries that satisfy these
  66 are written to the outputfile. If you supply more than one filter, only entries
  67 that satisfy all filters are selected.
  68 
  69 Wildcard expressions are accepted: '*' means any character or no character
  70 at all, and '?' means precisely one character. All other characters have 
  71 to be matched literally. Wildcards can only appear on the right side of an
  72 expression and can only be used with the '=' and '!=' operators.
  73 
  74 Wildcards and string literals must be in double quotes, everything else will be
  75 interpreted as a numeric value or a CELEX fieldname.
  76 
  77 The operators you can use are:
  78 	
  79 	=	Equal, can be used with wildcards
  80 	!=	Not equal, can also be used with wildcards
  81 	<=	Smaller than or equal to
  82 	>=	Greater than or equal to
  83 	<	Smaller than
  84 	>	Greater than
  85 
  86 Filters can be combined using an OR operator like this:
  87 
  88 	Filter FlectType = "S" OR FlectType = "P" OR Length Word > 5
  89 
  90 The last command in the example, "Output", specifies which fields should
  91 be written to the outputfile. Fields will be written in the order you specify, 
  92 with '\' characters as field seperators like in the original Celex files, or
  93 any other seperator specified with the OutputSeperator command.
  94 
  95 The names of the fields, like 'FlatSA', are identical to the names used in the
  96 CELEX manual. The only difference is that you will have to prefix some with
  97 "L:" or "W:" to select fields from either the Lemma or the WordForm lexicon.
  98 In the example script "Cob" could refer to the Cobuild frequency of the Lemma
  99 or that of the Wordform. Using "W:Cob" disambiguates this for the program.
 100 
 101 
 102 		ADVANCES FEATURES
 103 
 104 Sometimes you only want information from Celex for a limited number of words.
 105 You can do this by using a 'master' file, like this:
 106 
 107 	MasterFile [-Celex] C:\Experiment1\Condition2.txt mstr
 108 
 109 The last field ('mstr') is a nickname that you give the file so you can refer to it.
 110 The -Celex option has to be used with masterfiles that have the celex file format,
 111 with backslashes as field seperators and possible empty fields.
 112 
 113 This masterfile can be used in filters so you can limit your output to the words
 114 in the masterfile like this (assuming the words to filter on are on the first 
 115 field of your masterfile):
 116 
 117 	Filter mstr[1] = Word
 118 
 119 As you can see you use the nickname and a fieldposition to refer to fields in your
 120 masterfile. 
 121 
 122 When using a masterfile, GetFromCelex will need much more time to produce the
 123 output: every line in the masterfile will take several seconds (or more if your
 124 computer is slow). So, a masterfile of 500 lines could easely take more than 15
 125 minutes!
 126 
 127 	
 128 		FILTERS
 129 
 130 There is a filter with which you can select fields on the basis of their length:
 131 
 132 	Filter Length Word > 5
 133 	Filter Length Word <= 15
 134 	
 135 The example will select only 'Word' fields that contain strings longer than 
 136 5 characters and shorter than 16.
 137 
 138 You can tell GetFromCelex to ignore certain characters in the length count:
 139 
 140 	Filter Length L:PhonSylBCLX < 3 Ignore "[]" // number of phonemes
 141 
 142 In this case the [ and ] brackets indicating syllables are ignores, resulting
 143 in a count of the number of phonemes in a word.
 144 
 145 
 146 There is also a special filter available that can count the number of substrings
 147 in a field. It is used like this:
 148 
 149 	Filter Count FlatSA "S" = 2
 150 	Filter Count FlatSA "SA" < 3
 151 
 152 In the first case only entries that contain exactly 2 occurrences of the
 153 character "S" will be outputted. The second filter makes GetFromCelex only
 154 output entries that have no more that two occurrences of "SA" embedded in
 155 the given field.
 156 
 157 
 158 If you only want to include fields that contain certain characters, or that
 159 do not contain certain characters, you can use these commands:
 160 
 161 	Filter CharSet Word "abcdefghijklmnopqrstuvwxyz"
 162 	Filter NotCharSet Word "_&$"
 163 	
 164 The first line will select 'Word' values that are only lowercase, and the second
 165 one will exclude Word's that contain a '_', '&' or '$' character.
 166 
 167 
 168 Normally GetFromCelex will only output wordforms for which all filters (if any)
 169 are true. But sometimes you want to see the other wordform too. If you want to
 170 look for words that have a plural form that is identical to the singular, you
 171 would probably be interested to see if there is another, not identical, plural.
 172 You can make GetFromCelex output all wordforms for each lemma that has at least
 173 one wordform that gets through all filters by using this command:
 174 
 175 	OutputAllWordforms
 176 
 177 By default, only the wordforms that matches the criteria for all filters are
 178 copied to the output.
 179 
 180 
 181 You can also change the output field-seperator. This is set to \ by default,
 182 but can be set to something else, like a single space, with:
 183 
 184 	OutputSeparator " "
 185 	
 186 If you want to specify a tab you need to use "\t", for a backslash use "\\".
 187 You can use multiple characters if you want, the program will just insert
 188 the string as given.
 189 
 190 
 191 If you have problems, or want to ask a question, please contact me.
 192 
 193 
 194 Maarten
 195 
 196 Maarten.van-Casteren@mrc-cbu.cam.ac.uk
MRC CBU Wiki

Quick Links

Search Wiki

Page Tools

Attachment 'GetFromCelexManual.txt'

Attached Files