Have an idea?

Visit Sawtooth Software Feedback to share your ideas on how we can improve our products.

Clean data set by respondents with inconsistent answers to identical holdout questions

Dear Sawtooth team,

I am wondering if I should remove respondents, who provided inconsistent answers to my identical holdout tasks (of a total of 1300 respondents approx. 380 respondents provided inconsistent answers).

I have run a HB analysis with
(a) data set reduced by fastest respondents (=time needed for the entire survey not only CBC)
Pct. Cert.: Current: 0.755 , Average: 0.759
Fit statistics: Current: 0.734, Average: 0.738
Avg Variance: Current: 3.384, Average: 3.474
Parameter RMS: Current: 3.110, Average: 3.095
(b) data set reduced by fastest respondents and respondents who inconsistantly answered identical holdout taks:
Pct. Cert.: Current: 0.761, Average: 0.763
Fit statistics: Current: 0.740, Average: 0.741
Avg Variance: Current: 3.581, Average: 3.447
Parameter RMS: Current: 3.070, Average: 3.094
(c) data set reduced by fastest respondents and respondents who answered more than 1 choice task in 4 or less seconds:
Pct. Cert.: Current: 0.7666, Average: 0.765
Fit statistics: Current: 0.744, Average: 0.744
Avg Variance: Current: 3.324, Average: 3.324
Parameter RMS: Current: 3.107, Average: 3.107
(d) data set reduced by fastest respondents, respondents who answered more than 1 choice task in 4 or less seconds and respondents who inconsistantly answered identical holdout taks:
Pct. Cert.: Current: 0.756, Average: 0.755
Fit statistics: Current: 0.735, Average: 0.734
Avg Variance: Current: 2.912, Average: 2.970
Parameter RMS: Current: 2.903, Average: 2.923

Do you compare rather current or average values?

Based on the indicators for model fit, I would assume to not reduce my sample by the respondents, who answered inconsitantly?

How would you use the holdout task in your analysis (I included three, two of them being the same)?

Thank you very much in advance!

Best regards
Mel
asked Jan 24, 2020 by Mel (250 points)

1 Answer

+1 vote
Hi, Mel.

We typically don't look at the RLH statistic in aggregate for determining which respondents might be inconsistent responders.  We do look at it at the respondent level, and we have a recommended way of doing so described on our LinkedIn Sawtooth Software User's Group, here:  https://www.linkedin.com/pulse/identifying-consistency-cutoffs-identify-bad-respondents-orme/?trackingId=pI%2FkvpPoo65MNbNIuYtwsQ%3D%3D

I agree that you may not want to use your holdouts  to establish inconsistency.  A certain amount of inconsistency is assumed by the logit model and if your two identical holdout questions use product concepts that happen to have similar total utilities, we would expect more inconsistencies rather than fewer.   Again, the LinkedIn article noted above is a better way to measure respondent inconsistency.  

I do pay attention to the Percent Certainty as a measure of aggregate fit and based on that result as you report above, I also wouldn't eliminate the respondents who answered your two identical holdouts inconsistently.
answered Jan 24, 2020 by Keith Chrzan Platinum Sawtooth Software, Inc. (102,700 points)
I agree with Keith.  Good respondents who are answering CBC questionnaires carefully will not always answer a repeated CBC holdout task the same way.  For many respondents, it's nearly a tie which concept within the set is the best.  Even among samples that have been cleaned in other ways for speeding and consistency, we see respondents only answering the same way perhaps 75-80% of the time in CBC triples or 65-70% of the time from quads.
Thank you Keith and Bryan!
Two more questions:
(1)Would you typically report/compare current or average values of the fit indicators (Perc. Cert., RLH ..)?

(2)I have run my HB estimates using the default setting of DF=5 and Var=1. When I change this setting according to your recommendations based on your meta analysis (What Are the Optimal HB Priors Settings for CBC and MaxDiff Studies?) the fit indicators decrease. Should I however use the adjusted priors?

Thank you so much in advance!
In terms of the online display, we see current statistics (for the current iteration) and average statistics (which represent an exponential moving average where if my memory is correct the last 500 or so iterations contribute about 1/2 or more of the weight).  I don't tend to use either.  I usually look at the average results across all the used draws.  If I'm in a hurry, I'll refer to the "average" results that are printed out to the screen on the last iteration of the full analysis.

Regarding question number 2, are you saying the fit to holdout tasks decreases by changing the DF & prior Var?  Or, are you saying the internal fit to the tasks used in HB estimation decrease?  If you have a large number of holdout tasks (say, 5 or more) and you are saying the holdout hit rate decreases due to changing the priors, then I'd say pay attention to that and don't go that direction.  But, if you are saying that the internal RLH and Pct Cert fit (to the tasks used in estimation) decrease, then that is to be expected if you decrease the prior variance.  You shouldn't tune the HB priors to the data used in utility estimation (where increase the prior variance will always lead to better fit, even if it means overfitting).  Some sort of holdout procedure should be done instead.
Thank you for the fast answer, Bryan! That helped me a lot!

I was referring to the internal fit, so your answer solves my question :)

Regarding the fit to holdout tasks: I only included 3 holdout tasks in my study of which 2 are identical. I guess, that are too less holdouts to check for holdout hit rate? And even it it would be possible, which tool can I use? Do I have to use the Simulator to predict individual choices to holdout tasks and then compare those to my actual answers or is there an other option within the HB analysis tool? :)
3 holdouts, where two of them were identical, probably doesn't give enough evidence to override the norms (regarding prior DF and prior variance) expressed by the meta analysis in which we looked at dozens of CBC datasets.
So I'll keep the prior DF and prior Variance as presented in your meta analysis results.
However, how exactly do I check how well my model predicts my holdout tasks (which tool can I use)? Or would you not test that at all with only two different holdouts?
There isn't an automatic way to check fixed holdout predictive validity (holdouts that you added to your CBC questionnaire) in our software.  What we do is to specify these holdout scenarios as new simulation scenarios in our simulator.  We also tabulate the raw answers to the holdout questions so we know what percent of respondents picked each concept in each holdout.  Then, we compare the predictions from the simulator to the actual choice probabilities from the respondents to the holdouts.  Mean Absolute Error or Mean Squared Error for the predicted versus actual concept probabilities is typically the route we take.
Perfect! Thank you so much, Bryan!
In the mean while I have conducted the simulation.
Holdout question one (as I asked this exact holdout question twice I took the average share here):
actual shares:
Concept 1: 4.3%
Concept 2: 16.4%
Concept 3: 2.9%
None: 76.5%

Simulation shares:
Concept 1: 5.8%, Std Err 0.3%, CI (5.3%, 6.4%)
Concept 2: 14.9%, Std Err 0.6%, CI (13.8%, 16.1%)
Concept 3: 3.4%, Std Err 0.3%, CI (2.9%, 3.9%)
None: 75.8%, Std Err 0.9%, CI (74.0%, 77.6%)

I have done the same for holdoutquestion two.

(1) Both, predictions for Concept 1 and 2 do not lie in the CI. Would you however say that the model is good enough? Or would you continue to try to improve the prediction accuracy for the holdouts by adding f. ex. interactions or further covariates to the HB analysis.  

(2) In this case, how would you calculate the mean absolute error? Is the following approach correct?
MAE=  (1.5%+1.5%+0.5%+0.7%)/4=1.05

Thank you very much in advance!
Mel, your MAE calculation is correct, and that's the statistic folks usually look at when they use holdouts to assess the quality of the model.  I would not be surprised or concerned to see actual shares fall outside of the predicted confidence intervals.
Great! Thanks for the super fast reply!
And would you say a MAE of 1.05 is good?
If you had a larger number of holdouts, you would see that some of them have larger MAE and some smaller.  1.05% is good, but it's only one observation and without having seen a MAE based on a larger number of holdouts, it's difficult to get too excited about it.  But to answer your question, yes, all by itself MAE of 1% is pretty good (though a MAE of 1% based on 10 holdout questions would have been more impressive).
Thanks a lot, Keith! :)
...