Diff for "FAQ/Messy" - CBU statistics Wiki
location: Diff for "FAQ/Messy"
Differences between revisions 5 and 6
Revision 5 as of 2011-02-10 11:00:09
Size: 4574
Editor: PeterWatson
Comment:
Revision 6 as of 2011-02-10 11:20:38
Size: 2853
Editor: PeterWatson
Comment:
Deletions are marked like this. Additions are marked like this.
Line 5: Line 5:
||||<50% style="TEXT-ALIGN: center"> '''Date of Birth (DOB)''' || '''Date on System'''||
||||<50% style="VERTICAL-ALIGN: top; TEXT-ALIGN: center"> 17.02.40 || 12.05.00 ||
||||<50% style="VERTICAL-ALIGN: top; TEXT-ALIGN: center"> 02.08.73 || 18.01.05 ||
||||||<50% style="TEXT-ALIGN: center"> '''Date of Birth (DOB)''' || '''Date on System'''|| '''Date Left study'''
||||||<50% style="VERTICAL-ALIGN: top; TEXT-ALIGN: center"> 17.02.40 || 12.05.00 || ||
||||||<50% style="VERTICAL-ALIGN: top; TEXT-ALIGN: center"> 02.08.73 || 18.01.05 || 24.03.08 ||
Line 9: Line 9:
 * [:FAQ/Messy/rates: Obtaining frequencies from string variables] Patient data is often entered in the form of dates. For example date of birth and date that a particular person started a study and, if applicable, left it.
We may wish to work out how long a person was a participant in a study. To do this we first of all need a cut-off date which is either the date when the study ended (studyEdate) for continuing participants or the date of drop out (DateFinished) for people who left the study.
Line 11: Line 12:
Supposing we have a list of particular messages which can be categorised into various types and rates of form with rates abbreviated to W (per week), D (per day), Y (per year) and M (per month) and wish to tabulate frequencies per week of each message type.

||||<50% style="TEXT-ALIGN: center"> '''Type''' || '''Rate'''||
||||<50% style="VERTICAL-ALIGN: top; TEXT-ALIGN: center"> Appointment || 3-W ||
||||<50% style="VERTICAL-ALIGN: top; TEXT-ALIGN: center"> appointment || 2-D ||
||||<50% style="VERTICAL-ALIGN: top; TEXT-ALIGN: center"> chore || 1Y ||
||||<50% style="VERTICAL-ALIGN: top; TEXT-ALIGN: center"> Medication || 2-M ||

Notice that type names start with either capitals or lower case letters (which SPSS would regard as separate categories) and times are expressed both with and without hyphens.

The first thing we need to do is make sure types with the same spelling are all regarded as representing the same group regardless of case. This can be done using the string, substring and upcase commands.
There are various commands for telling SPSS what the number in a column represents a date. The DATE.DMY command defines a date in the spreadsheet.
Line 24: Line 15:
STRING TYPE2 (A10).
COMPUTE TYPE2 = UPCASE(SUBSTR(TYPE,1,5)).
COMPUTE studyEdate = DATE.DMY(30/11/2010).
Line 29: Line 19:
We then need to create a single column representing the number of times a particular message was sent in a week. This is done by decoupling the number from each rate and saving this as a numerical variable. Existing columns of data which represent dates may be flagged to SPSS using various formats (found by clicking the variable view tab and 'Type' next to the column of interest. Data may also be defined as dates using syntax.
Line 32: Line 22:
STRING COUNT(A1).
COMPUTE COUNT=SUBSTR(FREQUENCY,1,1).
EXE.
FORMATS dob DateFinished DateonSystem studyEdate (DATE8).
EXECUTE.
}}}
Line 36: Line 26:
RECODE COUNT (CONVERT) INTO COUNTN. We can the work our duration time in the study using the DATEDIF command which works out the difference in days or months between two dates. We firstly thought need to tell SPSS to use the study end date for people who did not drop out or the date they finished if they did. Since these two dates are entered in separate columns we can do this using the SYSMIS command which tells SPSS if there is no date finished date to use the study end date as the cut-off date. We can now compute the number of days in the study for dropouts and non-dropouts.

{{{
COMPUTE DFMISS = SYSMIS(DATEFINISHED).
IF (DFMISS EQ 1) usertime = DATEDIF(studyEdate, DateonSystem, "months")/12.
IF (DFMISS EQ 1) usertime2 = DATEDIF(studyEdate, DateonSystem, "days").

IF (DFMISS EQ 0) usfintime = DATEDIF(DateFinished, DateonSystem, "months")/12.
IF (DFMISS EQ 0) usfintime2 = DATEDIF(DateFinished, DateonSystem, "days").
}}}

We can also work out the ages of the participants when they entered the study in a similar way using the TRUNC function to round down.

{{{
COMPUTE age = TRUNC(DATEDIF(DateonSystem, DOB, "days")/365.25).
COMPUTE yearsuse = usertime2/365.25.
Line 40: Line 45:
The number of times each particular message is sent per week can then be computed by looking for a particular character string and multiplying the number we have just obtained by a quantity dependent on this string. For example a message signified by '4-W' = 4x7 = 28 messages sent per week.

This is achieved using the DO IF command to create the number of times each message is sent per week (NPW) .
We can now obtain summary statistics for durations of (non-)dropouts and age.
Line 45: Line 48:
DO IF INDEX(UPCASE(FREQUENCY),"D ")>0.
COMPUTE NPW = COUNTN*7.
COMPUTE DAILY =1.
ELSE IF INDEX(UPCASE(FREQUENCY),"W ")>0.
COMPUTE NPW = COUNTN.
COMPUTE DAILY=0.
ELSE IF INDEX(UPCASE(FREQUENCY),"-Y ")>0.
COMPUTE NPW = COUNTN/52.
COMPUTE DAILY=0.
ELSE IF INDEX(UPCASE(FREQUENCY),"YEAR ")>0.
COMPUTE NPW = COUNTN/52.
COMPUTE DAILY=0.
ELSE IF INDEX(UPCASE(FREQUENCY),"M ")>0.
COMPUTE NPW = COUNTN/4.
COMPUTE DAILY=0.
ELSE IF INDEX(UPCASE(FREQUENCY),"OFF")>0.
COMPUTE NPW = COUNTN/52.
COMPUTE DAILY=0.
END IF.
EXE.
DESCRIPTIVES
  VARIABLES=usertime2 USFINTIME2 AGE
  /STATISTICS=MEAN STDDEV MIN MAX .
Line 67: Line 53:
There may be a few cases where a rate has been expressed in a non-standard way (such as 2-D Mon-Thurs) has been entered and these will need to be entered manually. These can be identified by using the SORT CASES command which will sort the number per week (NPW) variable such that the cases at the top of the file will represent the non-standard rates.

{{{
SORT CASES BY NPW.
EXE.
}}}

We need to do one final thing before we can obtain frequencies for each message type. If we ask for frequencies we will only obtain the number of different messages of a particular type. To take into account repeated messages we use the weight command to tell SPSS how many times per week each message is repeated.

{{{
weight by NPW.
frequencies vars=type2.
exe.
}}}

We can also obtain the number of messages sent per week per client in the same way.
{{{
weight by NPW.
frequencies vars=Client_ID.
exe.
}}}
When we have finished we swithc off the weight so that each row now returns to representing just one frequency.
{{{
WEIGHT OFF.
}}}

* DEFINE PAGER OR MOBILE MESSAGE USER

We can also obtain two-way frequency classifications for subgroups of types of data (e.g. for mobile users and pager users) using the SORT CASES and SPLIT FILE commands which reformat the data so that it duplicates whatever syntax commands we issue. As witht her WEIGHT command we also need to switch this facility off when we have finished using it.

{{{
SORT CASES BY PAGER_CLIENT.
EXE.
SPLIT FILE BY PAGER_CLIENT.

weight by NPW.
frequencies vars=type2.
exe.

WEIGHT OFF.
SPLIT FILE OFF.
}}}
 * [:FAQ/Messy/rates: Obtaining frequencies from string variables]

How do I handle messy data in SPSS to produce duration times from dates and frequencies from strings?

Supposing we have a list of individual dates of form below

||||||<50% style="TEXT-ALIGN: center"> Date of Birth (DOB) || Date on System|| Date Left study

17.02.40

12.05.00

02.08.73

18.01.05

24.03.08

Patient data is often entered in the form of dates. For example date of birth and date that a particular person started a study and, if applicable, left it. We may wish to work out how long a person was a participant in a study. To do this we first of all need a cut-off date which is either the date when the study ended (studyEdate) for continuing participants or the date of drop out (DateFinished) for people who left the study.

There are various commands for telling SPSS what the number in a column represents a date. The DATE.DMY command defines a date in the spreadsheet.

COMPUTE studyEdate = DATE.DMY(30/11/2010).
EXE.

Existing columns of data which represent dates may be flagged to SPSS using various formats (found by clicking the variable view tab and 'Type' next to the column of interest. Data may also be defined as dates using syntax.

FORMATS dob DateFinished DateonSystem studyEdate (DATE8).
EXECUTE.

We can the work our duration time in the study using the DATEDIF command which works out the difference in days or months between two dates. We firstly thought need to tell SPSS to use the study end date for people who did not drop out or the date they finished if they did. Since these two dates are entered in separate columns we can do this using the SYSMIS command which tells SPSS if there is no date finished date to use the study end date as the cut-off date. We can now compute the number of days in the study for dropouts and non-dropouts.

COMPUTE DFMISS = SYSMIS(DATEFINISHED).
IF (DFMISS EQ 1) usertime = DATEDIF(studyEdate, DateonSystem, "months")/12.
IF (DFMISS EQ 1) usertime2 = DATEDIF(studyEdate, DateonSystem, "days").

IF (DFMISS EQ 0) usfintime = DATEDIF(DateFinished, DateonSystem, "months")/12.
IF (DFMISS EQ 0) usfintime2 = DATEDIF(DateFinished, DateonSystem, "days").

We can also work out the ages of the participants when they entered the study in a similar way using the TRUNC function to round down.

COMPUTE age = TRUNC(DATEDIF(DateonSystem, DOB, "days")/365.25).
COMPUTE yearsuse = usertime2/365.25.
EXE.

We can now obtain summary statistics for durations of (non-)dropouts and age.

DESCRIPTIVES
  VARIABLES=usertime2 USFINTIME2 AGE
  /STATISTICS=MEAN STDDEV MIN MAX .
  • [:FAQ/Messy/rates: Obtaining frequencies from string variables]

None: FAQ/Messy (last edited 2013-03-08 10:17:24 by localhost)