Have an idea?

Visit Sawtooth Software Feedback to share your ideas on how we can improve our products.

Getting negative Silhouette numbers when combining 2 ensemble sets in CCEA

Sorry to bother you again.
 I followed advice on segmentation combining categorical and scale variables that has been suggested here several times: First dummy-code categorical variables and run CCEA Ensemble on these alone. Save the ensemble set.  Do the same on scale variables and save the second ensemble set. Then  combine 2 ensemble sets manually and run a new ensemble analysis on a custom combined ensemble file.

Got consensus solutions. Checked Silhouette numbers for highest reproductability solutions.  Most segments have negative Silhouette numbers. Hit rate from MDA are also low. Individual solutions were better. I wonder if consensus solution results should be used (not their individual ensemble files) as input for a combined ensemble file .  

Please give any clue why this ensemble combo run is much much worse than 2 separate  inputs. Something might be wrong with the algorithm if it produces negative Silhouette. It might be zero but never seen it being negative
Please help
asked Jul 14, 2016 by furoley Bronze (885 points)
retagged Jul 18, 2016 by Walter Williams

1 Answer

+1 vote
Dear furoley,

I see negative silhouette scores when I have a segmentation that's really not working well.  If your categorical variables just aren't very related to your metric variables,  then it's possible for the metric variables by themselves to produce high quality clusters and for the categorical variables by themselves to produce high quality clusters while combining the two together produces poor clusters.  I've seen this happen often enough that I suspect this is what's going on in your data.

Did you include in the mixed ensemble several solutions from the metric data and several from the categorical data?  I would probably do that rather than have an ensemble of the one best metric solution and the one best categorical solution.

I imagine you may already have done this but I'd run some correlations to see how related are your categorical and metric variables.  I suspect they'll be low, and if they are there's not a lot you can do to create decent segments - any algorithm you try will starve from lack of useful input to work with.  If they're not so low, maybe next you could try using the Gower distance metric (works for both metric and categorical variables and it's available in the R package "daisy").
answered Jul 14, 2016 by Keith Chrzan Platinum Sawtooth Software, Inc. (97,375 points)
Thank you Keith. Yes you are right. PLS regression between categorical and scale variable showed single digit variance explained. The link is very weak.

But I used the whole ensemble file (70 solutions) from metric variable and the whole ensemble file (another 70 solutions) fro dummy coded categorical variables. So my combined ensemble was 140 solutions.  Are you suggesting I do some cherry-picking from resulting consensus solutions on each side when creating the combined ensemble file, as opposed to using the whole ensemble file , both of them??
No, I think if your linkage is that weak that there's just not enough in common between the categorical and metric variables to think that you're going to find a strong segmentation solution based on the two sets of more or less unrelated inputs.