Have an idea?

Visit Sawtooth Software Feedback to share your ideas on how we can improve our products.

Segmentation on HB utilities for categorical levels

If I had to run a segmentation on HB part worths (zdiffs) for categorical levels, how can I make sure that my segment solution are not biased on attributes that have more levels in it?
Having more levels in a categorical attribute means having more input variables for that specific attribute in segmentation. Therefore my segment solutions would be more about variability within bigger attributes than between attributes.

Think of 3 categorical attributes with 20, 4 and 4 levels.
I would have 20+4+4=28 input variables for segmentation. 20 out of these 28  ( or 71% of all input variables) would be about just 1st attribute.
Therefore my segments would be more about variability within this first attribute.
But I want my segments to be more about between-attribute variability than within attribute variability. I want all these 3 attributes to be equally important in my segmentation.

How can I do that?  

I am thinking about doing a segmentation within each categorical attribute separately first. This would classify a set of utilities within this attribute into several common patterns. This would results in a single categorical variable for each attribute as opposed to having several input variables for the same attribute. That way I remove the effect of # of levels in this attribute.
Then I run the final segment solution on these newly created categorical variables.
All categorical attributes treated that way would have the same weight in the final segment solution.

But this is labor-intensive and questionable approach .
So my questions are:

1) Does this approach makes sense at all?
2) Is there an easier alternative way to handle this bias?

asked Oct 19, 2015 by furoley Bronze (885 points)
retagged Oct 20, 2015 by Walter Williams

1 Answer

+1 vote
I haven't sen that done before but I suppose it could work.  And the ability of CCEA to use custom ensembles would be a natural way for you to run the final analysis:  just treat your several solutions as individual members of an ensemble and let the algorithm do its thing.  

Getting to that point would be labor intensive for you, however, I agree.  

An old-fashioned way of giving more weight to under-represented attributes is just to replicate their columns in the database.  For example, let's say you have a 10 level attribute and a 2 level attribute.  If you want to give the two attributes equal weight  you could just make 4 more copies of the columns containing the utilities for the 2-level attribute.  Now you have 10 columns for your 10 level variable and 10 columns for your two-level variable.  I've always thought of this approach as being a bit clunky, but perhaps it's a fit for your need.
answered Oct 19, 2015 by Keith Chrzan Platinum Sawtooth Software, Inc. (97,975 points)
Thank you Keith! As usual it's a real pleasure discussing above and beyond things with you.
I never thought of using ensemble CCEA in the final run - I was planning to use Latent Gold on categorical data.

It is a very interesting idea of using ensemble for simple categorical data. It opens a new pandora box - using CCEA for any categorical data type input.  Or even more - for mixed data input.

If that is doable and valid, there should be publications and Noble prize for that
Yes, while the cluster analysis part of CCEA (the part that was the original CCA) likes metric inputs, the ensembles part uses categorical variables (cluster membership) as inputs.  It doesn't really allow you to mix variable types in a single analysis.  

For that you could run separate analyses for your categorical and metric variables and then combine them in an ensemble; or you could create segments from your metric variables then add those segment assignments to your categorical variables in an ensembles analysis.  Alternatively you could go the Latent Gold route or use one of the R cluster analysis packages that allow you to use a distance metric called Gower.
Don't forget that the importance of an attribute perhaps determines its effect on the final impact in a cluster solution more than number number of levels it has.  That's why most researchers don't worry about the issue you are raising very much.  Of course, this assumes you aren't post-processing the data to standardize each of the basis variables you are submitting to clustering.  If you were to do that, then indeed the number of levels an attribute has would have a big impact on its weight in the clustering outcome.