There are multiple ways to statistically test the utility of one level vs. another (within the same attribute). Very helpfully, Bryan Orme did a presentation on this at the recent Amsterdam conference a few months ago (loved it!) (explanations are also on this forum). However, I have noticed that in practice the statistical tests can give very difference answers which I find very curious.
Specifically, a paired samples t-test makes the most sense from a frequentist's perspective (we have observations from the same person for both combinations). From a bayesian perspective, there is the simple "count the draws" technique, which makes quite a bit of sense.
My question is if HB is perhaps too sensitive in it's wobbly nature. Both the paired samples t-test and the bayesian method are based on the utilities run through the same HB algorithm, so if it is indeed too sensitive, then we're over-stating the statistical power. I've found that a 1% difference in shares estimated is significant at the 99% confidence level.