Sunday, 2 August 2015

Preparing ESM data using SPSS syntax

I'm currently figuring out how to use SPSS syntax (rather than Excel) to pre-process my experience sampling data, and thought I'd share some useful (and basic!) pieces of syntax I've learned along the way. 

For those unfamiliar with experience sampling, it involves multiple participants reporting on their experiences on several occasions. In my study, it involves ~60 participants answering questionnaires on their phones 6 times each day for 7 days, which means 42 observations per participant and 2,520 observations in total (assuming 100% compliance), or more realistically, assuming 75% compliance, around 1,900 reports. That's a lot of data - a good sign that syntax will help streamline things.

I will be using MPlus to run multilevel modelling analyses, but before I can do that, I need to get the data into shape using SPSS.

1. How to Clean Data

You can compute any number of variables to flag various problems that may crop up in your data. Then, you can compute a dichotomous "valid" variable to filter out all problematic responses. I adapted this basic procedure from McCabe, Mack and Fleeson's (2012) chapter, and cobbled together other bits of syntax I hunted down via Google. For example, here are a few of the problems I'm flagging:

  • responses with no baseline data (i.e., they're not actually part of my study)
  • participants who dropped out due to technical difficulties
  • too many identical responses (e.g., if someone answered all 5s)
  • too few valid responses

Here's what the basic procedure looks like:

Compute nobaseline = 0.
If id = 'ID1' or id = 'ID2' nobaseline = 1.
Execute.

*note that you don't need the quotation marks if your id variable is in numerical form; mine happen to be in string form

Compute dropped = 0.
If id = 'ID1' or id = 'ID2' dropped = 1.
Execute.

Count NumZeros=Var1 to Var10 (0).

Count NumOnes=Var1 to Var10 (1).
...
Count NumNines=Var1 to Var10 (9).
Count NumTens=Var1 to Var10 (10).
Execute.

Compute tooidentical=0.
If NumZeros > 20 tooidentical=1.
If NumOnes > 20 tooidentical=1.
...
If NumNines > 20 tooidentical=1.
If NumTens > 20 tooidentical=1.
Execute.

*this means, for example, if a participant responds with 20 "zeros" out of a possible 23 questions, the data is suspicious

Now creating a variable that will be used to filter out invalid responses, and adding the above conditions:

Compute valid = 1.
If (nobaseline = 1) valid = 0.
If (dropped = 1) valid = 0.
If (tooidentical = 1) valid = 0.
Execute.

Ok, so now that you've marked various problems, how do you find out how many valid responses are remaining? First, you want to apply the filter you've created, to filter out invalid responses (i.e., valid = 0):

USE ALL.
FILTER BY valid.
EXECUTE.

To find out how many total valid responses and participants there are remaining (for reporting, rather than data-cleaning purposes), this syntax identifies duplicate cases:

MATCH FILES
  /FILE=*
  /BY id
  /LAST=PrimaryLast.
VARIABLE LABELS  PrimaryLast 'Indicator of each last matching case as Primary'.
VALUE LABELS  PrimaryLast 0 'Duplicate Case' 1 'Primary Case'.
VARIABLE LEVEL  PrimaryLast (ORDINAL).
FREQUENCIES VARIABLES=PrimaryLast.
EXECUTE.

The total frequency corresponds to the total valid response count, whereas the number of primary cases tells you how many valid participants there are.

What about the number of valid responses per person? We need to use the aggregate function and break it by id; this creates a new variable with each participant's final valid response count:

AGGREGATE
  /OUTFILE=* MODE=ADDVARIABLES
  /BREAK=id
  /responsecount=N(id).

And we can then use it to compute another "problem data" variable:

Compute toofewtotal=0.
If responsecount < 20 toofewtotal=1.
Execute.

And now add it to our dichotomous valid data variable, and filter them all out!

If (toofewtotal = 1) valid = 0.
Execute.

FILTER BY valid.
EXECUTE.

At this stage, before actually deleting any data, you'll probably want to save a version of this file with all exclusions marked, but with all data still retained.

2. How to Centre Variables Within-Person

For experience sampling data, generally you'll want to centre each variable around a person's mean. This lets you separate out the trait and state effects. The state component will then have a mean of 0, where positive deviations (e.g., 1.2) are greater than the participant's average levels, and negative deviations (e.g., -0.8) are less than their usual levels.

At this stage, I'm assuming you've computed scales already from the raw item scores. Now, to centre the variables and thereby create aggregate "trait" summaries:

AGGREGATE
  /OUTFILE=* MODE=ADDVARIABLES
  /BREAK=id
  /Var1Centred=MEAN(Var1).
  /Var2Centred=MEAN(Var2).

Now, to compute state component, just subtract the trait component from the raw scale mean:

compute Var1State=Var1-Var1Centred.
compute Var2State=Var2-Var2Centred.
execute.

3. How to Add a Time Lag

You can't really infer causality from most experience sampling data, since it's usually correlational (unless you're measuring the effects of an intervention delivered in real-time). However, it is possible to add a time lag to at least meet one of the requirements for causality - precedence (i.e., a change in the theorised cause preceded a change in the outcome of interest).

The important thing is to set the condition that the lag only applies if the id at the current time point is the same as the id at the previous time point. Otherwise, the lag will cross participants. You'll also want to mark these missing values (in this case, I've chosen -99).

IF (id=lag(id))var1lag=lag(var).
if $casenum = 1 or id ne lag(id) var1lag = -99.
IF (id=lag(id))var2lag=lag(var).
if $casenum = 1 or id ne lag(id) var2lag = -99.
execute.

For example, if the participant responded "7", "8" and "9" at T1, T2 and T3, then the lagged variable will be -99, 7, 8.

In the lagged analyses, you'll need to remove the lines of data with missing values. So, create a filter:

Compute lagged = 1.
execute.

If varlag1 = -99 lagged = 0.
If varlag2 = -99 lagged = 0.
execute.

4. Creating an Easy Filter for Level 2 Analyses

There might be some analyses where you only want to look at level 2 (between-person) variables. As an easy way to "virtually" collapse the data set (i.e., being able to select one row per participant), you can compute and apply another filter:

Compute level2 = 0.
if $casenum = 1 or id ne lag(id) level2 = 1.
execute.

*flags the first case for each participant

FILTER BY level2.
EXECUTE.

5. Saving a Pared-Down Data File

Once I get to MPlus, I don't really want my raw variables getting in the way. So, here's a way to save only the variables that you need for your analyses:

SAVE OUTFILE= '/Users/Jessie/Desktop/esmmplus.sav'/KEEP= id Var1 Var2 Var3 etc... level2 lagged /COMPRESSED.

*the format of the file directory will be different on a PC

Concluding Thoughts: The Joys of Syntax

When I first started learning how to pre-process this data, I was introduced to some cool tricks and functions on Excel, including Pivot Tables and VLOOKUP. And there are still a couple of steps I'll probably use Excel for, including extracting elements from a string date to convert it into a general date format (unless there is a way to do this in SPSS?).

Overall, however, it's now clear that SPSS syntax is a far more efficient and less clunky way to prepare data. For one thing, by writing the syntax ahead of time I can simply run the syntax as soon as I have my data on hand. For another, it means that things are better-documented: I, my supervisor and anyone who wants to see the syntax and data will be able to see exactly how the data was prepared, and what decisions were made. If any errors are made during data preparation, we can trace it back to the syntax. Finally, this documentation means that the next time I need to pre-process experience sampling data or teach someone else how to do it, I can just modify, disseminate and reuse the syntax I've already written, again saving time (and the need to remember various manual Excel steps).

1 comment:

  1. Do I have to center the variable before using them in HLM? Since you can choose "add variables centered to group-mean" or "add variable centered to grand-mean" in HLM, you might not to center before using HLM. Or is this possibility in HLM just to demonstrate that the added variabels are still centered to group mean or grand mean?

    ReplyDelete