Have an idea?

Visit Sawtooth Software Feedback to share your ideas on how we can improve our products.

CCEA k-means cluster and choice of variables (levels) from ACBC data

Dear all,

I am currently trying to create clusters from the data I gathered in my ACBC. The thing with my HB estimates is, that price overall has an  extremely high importance on average (50%) while the rest of 7 attributes shares the other 50%. This is probably based on the fact, that the price range I chose for the product analyzed covers the full-range from cheapest to highest prices in the market, consequently the majority of respondetns has strong negative utilities for the upper price range.
This however becomes an issue with the clustering: if I run an analysis on all levels including the pricebreakpoints, the resulting f-statistics from the clustering for the different levels are varydrastically and especially price is extremely high. to give you an idea for 3-group cluster:

for non-price levels (24), the f-statistics is avg. at around 30 and ranges from til 60. for price breakpoints (10) it ranges from 1 to 500 (!) and has average of roughly 200.

If I get the concept of f-statitistics correct, then it says how much a particular variable (level) influences the clustering. If I now take my values, the clusters should be somewhat random on non-price levels and basically price is the only differentiator as here, partworth difference are the highest.

I am not sure if these results are useful or if it may make sense if I took the results from a run without price. These clusters (wihtout price) are somewhat more diverse on the non-price levels and better interpretable I would say. I am scared however, that if I used this solution, then I may disrespect the fact, that for this product price is a main driver as utilities vary tremendously and ignoring this attribute may give me "crude" results.

So, is it ok to leave out attributes that were used in the ACBC and for the HB estimation during the clustering? Especially when this attribute (price) has the by far highest importance? Wouldnt clusters in this case have extreme variances on the price levels, and would this be a "valuable" result?

Any ideas?

Many thanks Lukas
asked Nov 27, 2017 by Luke (280 points)
edited Nov 27, 2017 by Luke

1 Answer

0 votes
First, we strongly recommend you use the zero-centered diffs utilities when clustering for ACBC or CBC, not the raw utilities.  Using raw utilities can be very bad, because different respondents can have very different scaling of the utilities (magnitude of differences from zero).

Also, I'm assuming you are dropping the "None" utility and only clustering based on the other attribute levels in your study.  Using the None utility in the clustering for ACBC is not advised.

Next, it seems you are using piece-wise function for price.  In that case, I'd recommend using the utilities for the endpoints and a few interior points, such that the total number of levels of price used is similar to the number of levels for other attributes in your study.

If you forget to use zero-centered diffs and instead use the raw HB utilities, this can also be bad for piecewise price, since these utilities are not constrained to be zero-centered.  These raw piecewise price utilities can shift around quite a bit (shifting positive or negative) based on how strong the raw None utility is.
answered Nov 27, 2017 by Bryan Orme Platinum Sawtooth Software, Inc. (176,815 points)
Hi Bryan,

thanks for your answer. To your comments: I read all the recommendations and use indiviudal ZC diffs partworths from HB estimation as well as I dropped the none utility. With regards to piecewise, you are right, I did use all steps. I can try another run with just a few points and start endpoint, however I doubt this will solve my issue. Please see below my ZCdiff utilities (mean=total) of 2 out 7 non-price attributes and its levels as an example (I didnt put more to make the comment not too long, but they are somewhat similar in magnitude/ importance) and price with its piecewise steps, as well a clustering I made based on all variables.
As you can see from the utilities at the start and end endpoints and the differences in clusters, the extremely varying f-statistics do make sense from a theoretical perspective (I guess?), but does it actually make any sense if the only decider for a cluster is price? on the other hand, does it make sense to exclude a variable like this?
Label    Total    Cluster1    Cluster2    Cluster3
A1L1    12.55    4.58    20.78    25.55
A1L2    -12.55    -4.58    -20.78    -25.55
A2L1    -57.79    -40.11    -54.54    -95.61
A2L2    21.22    15.18    22.05    33.34
A2L3    36.57    24.93    32.49    62.27
A7L1    -37.09    -22.44    -49.63    -62.08
A7L2    16.77    12.32    22.61    23.51
A7L3    20.32    10.12    27.02    38.57
PRICE: 4.99     80.38    106.13    44.99    42.04
PRICE: 13.99    77.64    102.17    44.92    40.66
PRICE: 17.99    66.66    85.64    39.13    38.98
PRICE: 20.99    61.86    78.45    37.90    37.64
PRICE: 23.99    54.49    69.93    31.84    32.09
PRICE: 26.99    42.03    53.91    18.88    27.15
PRICE: 30.99    17.98    19.57    0.78    21.86
PRICE: 36.99    -11.38    -20.92    -2.84    4.74
PRICE: 46.99    -72.27    -82.55    -52.70    -59.24
PRICE: 70.99    -317.38    -412.32    -162.90    -185.92

thanks again,

If you indeed used just the extreme two prices to represent the Price attribute in your basis variables, then maybe Price is just an extremely strong driver of the segmentation (since it's just such an important attribute).  

However, if you believe that you are artificially making Price so important by the calculations involving the total possible range of price across all other options, then you can reduce the effect of Price on your segmentation solution by multiplying the two extreme price utilities (that are being used as basis variables) by 1/2, 1/3, 1/4, etc.  Then, re-run the clustering.

Or, another approach to give each basis variable equal weight (if that is your desire) is to normalize the basis variables, such that each basis variable has a mean of 0 and variance of 1.
Despite being very important (price) and diverse among respondents it would be overestimated. I tried your approach of weighting and it seems to work fine. Is there any rule /recommendation what weighting my make sense / should not be breached, e.g. not more than 1/10 etc.?  As I read there was a weighting option in the old CCEA version. Do you still have documentation on the weighting?


This stuff can become very "art-filled" with creativity, leading to an infinite number of possible approaches and solutions.  So, although we are using a quantitative, statistical technique, it becomes very subjective.  Beauty is in the eye of the beholder.  And, the most important thing is that the segmentation serve the client well and lead to good strategic thinking.

The "normalization" pre-processing procedure built into our CCEA software indeed zero-centers and gives a variance of 1 to all basis variables.  It is a way to give each variable equal weight in the solution.

However, some might say that if price is much more important than other attributes from the conjoint solution, then it should be allowed to have greater variance and greater impact on the clustering as-is.

Also be aware that using more variables within a dimension (such as more levels of price) gives greater weight to the price variable in a clustering procedure.  Similarly, if two variables are highly related, then they will overweight that underlying (latent) factor in the clustering procedure.