Specifically, a paired samples t-test makes the most sense from a frequentist's perspective (we have observations from the same person for both combinations). From a bayesian perspective, there is the simple "count the draws" technique, which makes quite a bit of sense.

My question is if HB is perhaps too sensitive in it's wobbly nature. Both the paired samples t-test and the bayesian method are based on the utilities run through the same HB algorithm, so if it is indeed too sensitive, then we're over-stating the statistical power. I've found that a 1% difference in shares estimated is significant at the 99% confidence level.