Testing the CBC Design

 

Top  Previous  Next

Introduction

 

In CBC, a design refers to the sum total of the task descriptions across all respondents.  The design contains information about the combinations of attribute levels that make up the product concepts within the tasks.  The design is saved to a design file that you upload to your web server.  Optimally efficient CBC designs can estimate all part-worths with optimal precision; meaning that the standard errors of the estimates are as small as possible, given the total observations (respondents x tasks), the number of product concepts displayed per task, and respondent preferences.

 

CBC's random design strategies generally result in very efficient designs.  These designs are not optimally efficient, but are nearly so.  In the case of large sample sizes, a large number of questionnaire versions in the design file, and no prohibitions, one can confidently field a questionnaire without testing the design.

 

However, there are conditions that can result in inefficient designs.  Sometimes, a design can be so inefficient as to defy all attempts to compute reasonable part-worth utilities.  We have heard of entire data sets with hundreds of respondents going to waste because the user neglected to test the design.

 

Therefore, it is imperative to test your design whenever any of the following conditions exist:

 

any prohibitions are included

sample size (respondents x tasks) is abnormally small

the number of versions you plan to use is few

 

The Test Design option simulates "dummy" respondent answers and reports the standard errors (from a logit run) along with D-efficiency.  

 

Our Test Design approach assumes aggregate analysis, though most CBC users eventually employ individual-level estimation via HB.  That said, CBC's design strategies can produce designs that are efficient at both the aggregate and individual levels.

 

Prohibitions are often the culprit when it comes to unacceptable design efficiency.  If your prohibitions result in unacceptably low design efficiency under the Complete Enumeration or Balanced Overlap Methods, you should try the Shortcut or Random design strategies.  These latter two methods are less constrained than the more rigorous former ones, and will sometimes result in higher design efficiencies in the case of many prohibitions.

 


Testing the Efficiency of Your Design

 

When you choose Test Design from the CBC exercise Design tab, CBC automatically tests the design and displays the results within the results window.  CBC automatically generates simulated (dummy) respondent answers (default n=300) appropriate for advanced design testing.

 


The Frequency Test

 

When you generate a design or when you run Test Design, a preliminary counting report is shown:

 

           CBC Design: Preliminary Counting Test

         Copyright Sawtooth Software

 

         Task Generation Method is 'Complete Enumeration' using a seed of 1

         Based on 10 version(s).

         Includes 1000 total choice tasks (10 per version).

         Each choice task includes 3 concepts and 6 attributes.

 

     ---------------------------------------------------------------

       Att/Lev  Freq.  

         1 1      75  Brand A

         1 2      75  Brand B

         1 3      75  Brand C

         1 4      75  Brand D

 

         2 1     100  1.5 GHz

         2 2     100  2.0 GHz

         2 3     100  2.5 GHz

 

         3 1     100  3 lbs

         3 2     100  5 lbs

         3 3     100  8 lbs

 

         4 1     100  60 GB Hard Drive

         4 2     100  80 GB Hard Drive

         4 3     100  120 GB Hard Drive

 

         5 1     100  512 MB RAM

         5 2     100  1 GB RAM

         5 3     100  2 GB RAM

 

         6 1     100  $500

         6 2     100  $750

         6 3     100  $1,000

 

 

For each level, the number of times it occurs within the design is counted and provided under the column titled "Freq."  Optimally efficient designs show levels within each attribute an equal number of times.  Designs do not have to have perfect balance to be quite efficient in practice.  

 

A two-way frequency count is also reported.  This describes how often each level of each attribute appears with each level of each different attribute.

 

You should run Test Design (described below) to assess more accurately the precision of your design given your expected sample size.

 


Test Design

 

Rather than just analyzing your design via counting, Test Design (Advanced Test) estimates the absolute precision of the parameter estimates under aggregate estimation, given your experimental design and expected sample size.  Test Design is useful for both standard and complex designs that include interactions or alternative-specific effects.  It also reports a widely accepted measure of design efficiency called D-efficiency, which summarizes the overall relative precision of the design.

 

The estimated standard errors are only absolutely correct if the assumptions regarding the underlying part-worths and the error in responses are correct.  Technically, the utility balance among the concepts within those sets also affects overall design efficiency, and thus respondents' preferences need to be known to assess the efficiency of a design.  However, most researchers are comfortable planning designs that are efficient with respect to uninformative (zero) part-worth utility values, and that is the approach we take.

 

Test Design simulates random (dummy) respondent answers for your questionnaire, for as many respondents as you plan to interview.  The test is run with respect to a given model specification (main effects plus optional first-order interactions that you specify).  

 

To perform Test Design, you need to supply some information:

 

Number of Respondents

% None (if applicable to your questionnaire)

Included Interaction Effects (if any)

 

With this information, CBC simulates random respondent answers to your questionnaire.  Using random respondent answers is considered a robust approach, because it estimates the efficiency of the design for respondents with unknown preferences.

 

Simulated respondents are assigned to the versions of your questionnaire (the first respondent receives the first version, the second respondent the second version, etc.).  If you are simulating more respondents than versions of the questionnaire, once all versions have been assigned, the next respondent starts again with the first version.  If your study includes a None alternative, then the None is selected with expected probability equal to the value you previously specified.  

 

Once the data set has been simulated, Test Design performs an aggregate logit (MNL) run, estimating the effects you selected (by default, only main effects are considered).  Sample results are shown below:

 

Logit Report with Simulated Data

------------------------------------------------------------

Main Effects: 1, 2, 3, 4, 5, 6, 7

Interactions: 1x6

 

Build includes 300 respondents.

 

Total number of choices in each response category:

 

     Category Number Percent

----------------------------------------------------

            1    787  21.86%

            2    753  20.92%

            3    778  21.61%

            4    792  22.00%

            5    490  13.61%

 

There are 3600 expanded tasks in total, or an average of 12.0 tasks per respondent.

               

        Std Err    Attribute Level

 1      0.03186    1 1 Brand A

 2      0.03182    1 2 Brand B

 3      0.03195    1 3 Brand C

 4      0.03215    1 4 Brand D

 

 5      0.02638    2 1 1.5 GHz

 6      0.02639    2 2 2.0 GHz

 7      0.02654    2 3 2.5 GHz

 

 8      0.02638    3 1 3 lbs

 9      0.02631    3 2 5 lbs

10      0.02645    3 3 8 lbs

 

11      0.02655    4 1 60 GB Hard Drive

12      0.02620    4 2 80 GB Hard Drive

13      0.02644    4 3 120 GB Hard Drive

 

14      0.02644    5 1 512 MB RAM

15      0.02661    5 2 1 GB RAM

16      0.02615    5 3 2 GB RAM

 

17      0.02610    6 1 $500

18      0.02647    6 2 $750

19      0.02669    6 3 $1,000

 

20      0.04975    Brand A by $500

21      0.05101    Brand A by $750

22      0.05124    Brand A by $1,000

23      0.05083    Brand B by $500

24      0.05040    Brand B by $750

25      0.05060    Brand B by $1,000

26      0.05050    Brand C by $500

27      0.05104    Brand C by $750

28      0.05094    Brand C by $1,000

29      0.05038    Brand D by $500

30      0.05102    Brand D by $750

31      0.05164    Brand D by $1,000

 

32      0.04862    NONE

 

The strength of design for this model is: 3,256.006

(The ratio of strengths of design for two designs reflects the D-Efficiency of one design relative to the other.)

 

Note for testing design efficiency that we report only the standard errors of the estimates.  Their utility values (the effects) are random and therefore are not of interest.  Details regarding the logit report may be found in the section entitled Estimating Utilities with Logit.

 

The beginning of the report lists the effects we are estimating (main effects for attributes 1 through 6, plus the interaction effect between attributes 1 and 6).  All random tasks are included in estimation.

 

Next, we see that 300 respondents each with 12 tasks were simulated using random responses, taking into account the expected probability for None.  We specified that the None would be chosen with 15% likelihood, and indeed the None percentage is very close to that (13.61%).  If we had used more respondents, the probability would have been even closer to 15%.  The remaining choices are spread approximately evenly across the four other alternatives in the questionnaire.

 

Next, the standard errors from the logit report based on the random responses is shown.  The standard errors reflect the precision we obtain for each parameter.  Lower error means greater precision.  This design included no prohibitions, so the standard errors are quite uniform within each attribute.  If we had included prohibitions, some levels might have been estimated with much lower precision than others within the same attribute.

 

For our simulated data above, the levels within three-level attributes all have standard errors around 0.026.  The one four-level attribute has standard errors for its levels around 0.032.  We have obtained less precision for the four-level attributes, since each level appears fewer times in the design than for the three-level attributes.  The interaction effects have standard errors around 0.051.

 

Suggested guidelines are:

 

Standard errors within each attribute should be roughly equivalent

Standard errors for main effects should be no larger than about 0.05

Standard errors for interaction effects should be no larger than about 0.10

Standard errors for alternative-specific effects (an advanced type of design) should be no larger than about 0.10

 

These criteria are rules of thumb based on our experience with many different data sets and our opinions regarding minimum sample sizes and minimum acceptable precision.  Ideally, we prefer standard errors from this test of less than 0.025 and 0.05 for main effects and interaction effects, respectively.  These simulated data (300 respondents with 12 tasks each) almost meet that higher standard for this particular attribute list and set of effects.

 


D-Efficiency

 

D-efficiency summarizes how precisely this design can estimate all the parameters of interest with respect to another design, rather than how well the design can estimate the utility of each level of each attribute (as with the simpler default test).  D-efficiency is described in an article by Kuhfeld, Tobias, and Garratt (1994), "Efficient Experimental Design with Marketing Research Applications," Journal of Marketing Research, 31 (November), 545-557.

 

To arrive at D-efficiency, we should define a few terms:

 

Xt= design matrix for task t with a row for each alternative
xi= ith row of Xt
pi= probability of choice of alternative i
v= probability-weighted means of rows:  v = sigmai  pi xi      
Zt= matrix with ith row zi  = pi1/2 ( xi - v)
Z= matrix made by appending all Zt matrices

 

Z'Z is known as the "Information Matrix"

The determinant of Z'Z measures the strength of the design.

 

Because the magnitude of the determinant of Z'Z depends on the number of parameters estimated, to provide a measure of strength independent of p we consider the pth root of the determinant:

 

 |Z'Z|1/p

 

Where Z is the probability-centered design matrix, Z'Z is the "Information Matrix," and p is the number of parameters estimated.

 

The pth root of the determinant doesn't result in a single value bounded by 0 and 1.0 (as with the simpler test efficiency report), and this value is meaningless without reference to the same computed for comparison design.  This value also depends on the number of respondents x tasks, so when comparing two designs, it is important to hold the number of respondents x tasks constant.  We use the term "efficiency" to compare the relative strengths of two designs.  The relative D-efficiency of one design with respect to the other is given by the ratio of the pth root of the determinants of their information matrices.  The design with a larger value is the more efficient design.  (Note: we only consider the precision of parameters other than the "None" parameter when computing the strength of the design.)

 

Consider design A with no prohibitions and design A' with prohibitions.  The pth root of the determinant of the information matrix is computed for both (holding the number of respondents, tasks, concepts per task, and None % constant).  If design A' has a value of 2,500 and design A has a value of 3,000, design A' is 2,500/3,000 = 83.3% as efficient as design A.  The inclusion of prohibitions resulted in a 1 - 0.833 = 16.7% loss in efficiency.

 


Efficiency for Specific Parameters (Attribute Levels)

 

Sometimes, you may be more concerned about the efficiency of the design for estimating a specific parameter (such as the utility for your client's brand) rather than an overall efficiency of the design across all parameters.  Let's assume that your client asked you to implement prohibitions between the client's brand name and other levels.  Further assume that the overall relative strength of the design with prohibitions relative to the design without prohibitions is 97%.  On the surface, this seems like little overall loss in efficiency.  However, you note that the standard error (from the logit report using simulated data) for your client's brand was 0.026 prior to implementing the prohibition, but 0.036 afterward.  The relative efficiency of the design with prohibitions relative to the non-prohibited design with respect to this particular attribute level is:

 

 a2/b2

 

Where b is the standard error of the estimate for the client's brand name after the prohibition and a is the standard error prior to the prohibition.  In this case, the relative design efficiency of the prohibited compared to the non-prohibited design with respect to this particular level is:

 

 0.0262/0.0362 = 0.52

 

And the impact of these prohibitions on estimating your client's brand utility is more fully appreciated.

 

Additional Note:  The pattern of random answers (respondent preferences) for random respondent data will have a small, yet perceptible, effect on the reported Strength of Design (relative D-Efficiency).  For the purpose of estimating the absolute standard errors of the parameters, we suggest using the same number of dummy respondents as you plan to achieve with your study.  But, for comparing the merits of one design to another, you can reduce the effect of the random number seed by increasing the dummy respondent sample size significantly, such as to 5000 or 10000 respondents.  Increasing the sample size will greatly reduce the variation of the Strength of Design measure due to the random seed used for respondent answers.

Page link: http://www.sawtoothsoftware.com/help/lighthouse-studio/manual/index.html?hid_web_cbc_designs_6.html