Have an idea?

Visit Sawtooth Software Feedback to share your ideas on how we can improve our products.

Identifying bad respondents


I was wondering how to identify bad/random respondents for my CBC experiment (including 2 CBC exercises and some demographic questions). I have read about RLH, completion time and straightliners, but I am not really sure how to use these criteria in practice.

- How should you compute RLH? I understand how you can compute the median, but I am not sure about the calculation of the 95% percentile and whether you should exclude or include respondents below that RLH. And should you use both the median and the 95% percentile RLH, or one of them?

Also, I have 2 CBC exercises. So should I do the HB analysis to get the RLH for both exercises separately (for both the random respondents and my own data), and then calculate the median/95% percentile RLH for both exercises separately (with the random respondents data) and use the calculated numbers for my HB analysis of my 2 exercises separately (comparing it to my own data)? Or should I first combine the 2 exercises and then calculate the RLH on the random respondents with an HB analysis of both exercises at the same time? (If so, the latter, how could I do that?)

- What would you say is a 'fast completion time'? Is there any rule for that?

- Should you look the straightliners up by yourself or is there any tool to easily see which respondents are straightliners?

Are there any more criteria to identify these bad respondents?

Thanks in advance!
asked Apr 20 by -
edited Apr 20
As mentioned in the paper, one way to identify "bad" respondents is to remove those with poor RLH values. To do so, you run an artificial study in Lighthouse Studio with let's say n = 200 respondents and then you identify the 95th percentile value in Excel. An example to help you better imagine what to expect: e.g. minimum robotic RLH value (in the sample of 200 respondents) is 0.32 and maximum is 0.49. A 95th percentile may be e.g. 0.47 --> In your real study, you exclude all respondents that have RLH values <0.47

Regarding speeders: In other posts I read about 40% or 50% faster than the median survey time. There is no rule for that, it is up to you to define that. But this should be a good rule of thumb

Straightliners: I guess this is totally depending on your context. In your study you might include something like contradicting questions (not obvious) as a test, and if people fail, you can exclude them.
Thank you very much, that is really helpful!
I was wondering one more thing: should you first remove respondents based on RLH, then calculate the median survey time based on that new dataset where the respondents that have RLH values lower than the 95th percentile are already removed, then remove speeders, and then after that remove the straightliners (e.g. I saw one respondent who only answered 4 --> the none option)?

Or is it better to see which respondents you need to remove based on the RLH, see which respondents you need to remove based on the time (where you calculate the median time across all complete ones, so before removing the ones based on RLH), see which respondents are straightliners and then after that compare which respondents you would need to remove in the 3 cases and remove those? (so maybe there is a respondent who needs to be removed based on the RLH as well as time)

Hope my question is clear to you :)

Actually I have on more question, I am sorry. For calculating the median time, I look at the sys_SumPageTimes, as I understood that that is a better measure than sys_ElapsedTime right? I have a few respondents who have really unreasonable times like 14 hours or something, should I exclude those respondents when calculating the median time or not?
I'm having the same issue currently. I'm tending to first exclude respondents with poor RLH values (as I guess it does not matter how long it takes for them, poor RLH is poor RLH), and afterwards exclude the speeders. And then all other exclusion criteria. I think there is no perfect solution to this.

Yes sys_sumPageTimes would be the time. However I noticed something very interesting yesterday which is an open question: https://legacy.sawtoothsoftware.com/forum/30072/discrepancy-between-sumpagetimes-and-cumulated-page-times

I would not exclude respondents with long survey times. A respondent might make a lunch break, or start your survey and go back finish it the next day. So I'd recommend to only exclude the speeders
Oh yeah that makes sense!

Thanks! That is indeed interesting. After you wrote this, I have looked at my data at one respondent and saw that the difference you write about in your question is also the case there. I wouldn't know how this is possible either..

Thanks for your help! I will reply on Keith's answer below whether he is able to help us with our issues mentioned above. Maybe he could give us the answer :)

1 Answer

0 votes
We discuss three methods for identifying random respondents in this white paper:  https://sawtoothsoftware.com/resources/technical-papers/diagnostics-for-random-respondents-in-choice-experiments
answered Apr 20 by Keith Chrzan Platinum Sawtooth Software, Inc. (102,700 points)
Thank you!
Hi Keith, could you please help Danny and me out with our issues that we described above? :)