# Dummy variables

(taken from a SPSS mailing list)

Since the values of a categorical variable do not convey numeric information, such a variable should not be used in a regression model. Instead, each value of the categorical variable can be represented in the model with an indicator variable. An indicator (or dummy) variable contains only the values 1 and 0, with a value of 1 indicating that the associated observation has the given categorical value.

For example, let the variable LANG take on three levels (British, French, and German) that were originally coded as 1, 2, or 3 (respectively). To include this categorical variable in a regression model, create an indicator variable for each type of LANG.

In SPSS, you must first create the three new variables and give them a value. In this instance we give the variables each a value of zero as a starting point. Then an IF command replaces these zeros with ones for the appropriate observations. The syntax is:

COMPUTE british=0. COMPUTE french=0. COMPUTE german=0. IF lang=1 british=1. IF lang=2 french=1. IF lang=3 german=1.

Now any two of the three new variables may be included in the regression model. It doesn't matter which two -- once you know who is in any two of the three groups, you know who is in the third.

For example:

REGRESSION VARIABLES = yvar british french german xv4 xv5 /DEPENDENT = yvar /METHOD = ENTER british french xv4.

The total sum of squares for the set of indicator variables will be constant, regardless of which subset you enter. However, the individual parameter estimates will differ, depending on which subset is used.

For more information, see the REGRESSION chapter in any SPSS Reference Guide and for information on using dummy variables to make interaction plots see these slides from Rajeev Dehejia of the National Bureau of Economic Research, Cambridge, MA.

You can equivalently use two dummy variables representing two of rthe languages, such as British and French such that the british dummy variable takes value '1' when British is the language, 0 for French and -1 for German and the french dummy variable takes value '1' when French is the language, 0 for English and -1 for German. This is coded in SPSS using the syntax below.

COMPUTE british=0. COMPUTE french=0. IF lang=1 british=1. IF lang=2 french=1. IF lang=3 british=-1. IF lang=3 french=-1.

The sums of squares for the two dummy variables will equal those for the 0-1 dummy variables. As before

intercept + british = expected (mean) British response

intercept + french = expected (mean) French response

The regression coefficients with the above coding need to be combined differently to obtain the German response:

intercept - british - french = expected (mean) German response

as opposed to

intercept = expected (mean) German response (using 0/1 coding)

So the difference in British and French means will be found by subtracting the regression coefficients representing the british and the french dummy variables using either dummy variable coding. The 0/1 british and french dummy variables individually represent outcome mean comparisons with the (German) reference group for the British and French respectively.