Have an idea?

Visit Sawtooth Software Feedback to share your ideas on how we can improve our products.

F-Ratio Consensus Clustering

Dear Sawtooth-Community,

I tried to segment my customers based on their part-worth utilities (zero-centered diffs) and have a question concerning the F ratios.

For example, I obtain a very high reproducibility for a 3-cluster solution, but the F ratio is lower as in case of a 2-cluster solution. Should I choose the solution based on the reproducibility or on the F ratio?

Furthermore, how can I calculate the F ratio for a specific variable? For instance, in your CCEA Manual on page 25, how do I arrive at an F ratio of 58.52 for the variable "Acceleration time 0-"? In case of part-worths, how can I calculate an F ratio over 3 attribute levels, i.e., an F ratio for the overall attribute (Level 1: 50, Level 2: 30, Level 3: 10 - what is the F ratio for the overall attribute)?

Finally, which clustering validation method do you recommend? I calculated the silhouette as suggested by Retzer but end up with an average width of approx. 0,25 - even though my ensemble is highly diverse and achieves an adjusted reproducibility of virtually 100 percent.

Thank you very much for your answers and your help!
asked Jun 3, 2014 by Arnold
edited Jun 3, 2014

1 Answer

0 votes
Dear Arnold,

My thoughts on your three questions:

1.  All else being equal, F-ratios tend to decrease as the number of clusters increases.  This is also true for reproducibility - it tends to fall as the number of clusters rises but when it spikes up it may suggest potential segment structure.  I would pay more attention to a spike in reproducibility than I would to F falling as it should.  

2.  F is a statistic that comes from analysis of variance (ANOVA).  Of course you can read about this in a standard textbook or even on Wikipedia.  F is the ratio of the variance between segments to the the variance within segments - the higher it is for a variable, the more differentiated the are segments on the given variable.  I am not aware of an easy way to compute this for multiple variables using the outputs you have available in CCEA but I suppose you could compute the relevant variances you would need in SAS or Excel or some other software.

3.  I've looked at a lot of silhouette plots.  Few of the data sets I've seen used in marketing research perform well in silhouette analysis.   I am not sure what kinds of data produces high silhouette scores but I suspect it may not be survey research data.  I often see scores below 0.50 and even less than the 0.25 you see.  Your average silhouette score is borderline between no structure and weak structure but this is true of many of the commercial segmentation studies I've seen, even very successful ones.
answered Jun 4, 2014 by Keith Chrzan Platinum Sawtooth Software, Inc. (97,975 points)