CBC HB Estimation Settings

Respondent ID

If estimating utilities using Latent Class or HB, you should select the Respondent ID to use for identifying the respondents in the utility files and segment membership files. If using the utilities within the SMRT system, make sure that the variable chosen leads to unique numeric values in the range of 0 to 999999999. Our other two market simulators (online simulator and choice simulator) can handle unique integers or unique alphanumeric strings up to 255 characters long.

Respondent Filter

Filters allow you to perform analysis on a subset of the data. Categories of the sys_RespStatus (Incomplete, Disqualified, or Qualified/Complete) variable may be selected for inclusion in the analysis. Using the Manage Filters... additional filters may be selected from variables in your study, or new variables that you build based on existing variables or combinations of existing variables.

Weights

Counts, Logit, and Latent Class can use weights to give some respondents more importance in the calculations than others. You may select an existing variable to use as a weight, and values associated with each case are applied as weights. Or, you may specify weights to apply to categories within an existing variable (such as Male=1.25, Female=0.75) if you use the Analysis | Create New Segments... dialog. Weights may not be applied during HB estimation.

Tasks

This area allows you to select which tasks to include in analysis. Fixed (holdout tasks) by default are omitted from analysis. You may wish to omit other tasks as well, such as the first task as a warm-up.

Attribute Coding

This dialog lets you select which attributes to include in your analysis, and how to code them (part-worth or linear coding). If you select linear coding, a single utility coefficient is fit to the attributes, such as a slope for price, speed, or weight.

Linear Coding of Quantitative Attributes:

When you specify that an attribute should be coded as a linear term, a single column of values is used for this attribute in the independent variable matrix during utility estimation. A weight (slope) is fit for that independent variable that provides the best fit. A column opens up for you to specify the Value used in the estimation matrix for each level of the quantitative attribute. By default, these are 1, 2, 3, etc. However, you should specify values that metrically correspond to the quantities shown to respondents. For example, if respondents saw levels of "1 lb., 2 lb, and 6 lb." then the values associated with those levels should be 1, 2, and 6. Please note that CBC will automatically zero-center any values you specify when creating the independent variable matrix. So, values of 1, 2, 6 will be converted to values of -2, -1, and +3 in the independent variable matrix prior to utility estimation.

If using linear coding and HB, please note that level values should be specified in the magnitude of about single digits to lead to quick and proper convergence. In other words, rather than specifying 10000, 40000, 70000 one should specify 1, 4, 7. And, rather than specify 0.01, 0.04, 0.07, one should specify 1, 4, 7.

Interactions

Sometimes estimating a separate set of utility values for each attribute (main effects) does not fit the data as well as when also fitting selected interaction effects. This occurs when the utilities for attributes are not truly independent. We encourage you to consider interaction terms that can significantly improve fit. But, we caution against adding too many interaction terms, as this can lead to overfitting and slow estimation times.

The interaction between two attributes with j and k levels leads to (j-1)(k-1) interaction terms to be estimated. But, when the utilities are "expanded" to include the reference levels in the reports and utility files, a total of jk interaction terms are reported.

Constraints

If certain attributes have levels with known utility order (best to worst, or worst to best) that you expect all respondents would agree with, you may decide to constrain these attributes so that all respondents (or groups, in the case of logit or latent class) adhere to those utility constraints. For more information, please see the Latent Class or CBC/HB manuals.

Estimation Settings

Iterations

Number of iterations before using results

This determines the number of iterations that will be done before convergence is assumed. The default value is 10,000, but we have seen data sets where fewer iterations were required, and others that required many more (such as with very sparse data relative to the number of parameters to estimate at the individual level). One strategy is to accept this default but to monitor the progress of the computation, and halt it earlier if convergence appears to have occurred. Information for making that judgment is provided on the screen as the computation progresses, and a history of the computation is saved in a file named studyname.log. The computation can be halted at any time and then restarted.

Number of draws to be used for each respondent

The number of iterations used in analysis, such as for developing point estimates. If not saving draws (described next), we recommend accumulating draws across 10,000 iterations for developing the point estimates. If saving draws, you may find that using more than about 1,000 draws can lead to truly burdensome file sizes.

Save random draws

Check this box to save random draws to disk, in which case final point estimates of respondents' betas are computed by averaging each respondent's draws after iterations have finished. The default is not to save random draws (have the means and standard deviations for each respondent's draws accumulated as iteration progresses). If not saving draws the means and standard deviations are available immediately following iterations, with no further processing. We believe that their means and standard deviations summarize almost everything about them that is likely to be important to you.

However, if you choose to save draws to disk for further analysis, there is a trade-off between the benefit of statistical precision and the time required for estimation and potential difficulty of dealing with very large files. Consider the case of saving draws to disk. Suppose you were estimating 25 part worths for each of 500 respondents, a "medium-sized" problem. Each iteration would require about 50,000 bytes of hard disk storage. Saving the results for 10,000 iterations would require about 500 megabytes. Approximately the same amount of additional storage would be required for interim results, so the entire storage requirement for even a medium-sized problem could be greater than one gigabyte.

Skip factor for saving random draws (if saving draws)

This is only applicable when saving draws to disk. The skip factor is a way of compensating for the fact that successive draws of the betas are not independent. A skip factor of k means that results will only be used for each kth iteration. Recall that only about 30% of the "new" candidates for beta are accepted in any iteration; for the other 70% of respondents, beta is the same for two successive iterations. This dependence among draws decreases the precision of inferences made from them, such as their variance. If you are saving draws to disk, because file size can become critical, it makes sense to increase the independence of the draws saved by conducting several iterations between each two for which results are saved. If 1,000 draws are to be saved for each respondent and the skip factor is 10, then 10,000 iterations will be required to save those 1,000 draws.

We do not skip any draws when draws are "not saved," since skipping draws to achieve independence among them is not a concern if we are simply collapsing them to produce a point estimate. It seems wasteful to skip draws if the user doesn't plan to separately analyze the draws file. We have advocated using the point estimates available in the .HBU file, as we believe that draws offer little incremental information for the purposes of running market simulations and summarizing respondent preferences. However, if you plan to save the draws file and analyze them, we suggest using a skip factor of 10. In that case, you will want to use a more practical number of draws per person (such 1,000 rather than the default 10,000 when not saving draws), to avoid extremely large draws files.

Skip factor for displaying in graph

This controls the amount of detail that is saved in the graphical display of the history of the iterations. If using a large number of iterations (such as >50,000), graphing the iterations can require significant time and storage space. It is recommended in this case to increase the number to keep estimation running smoothly.

Skip factor for printing in log file

This controls the amount of detail that is saved in the studyname.log file to record the history of the iterations. Several descriptive statistics for each iteration are printed in the log file. But since there may be many thousand iterations altogether, it is doubtful that you will want to bother with recording every one of them. We suggest only recording every hundredth. In the case of a very large number of iterations, you might want to record only every thousandth.

Data Coding

Total task weight for constant sum data

This option is only applicable if you are using allocation-based responses rather than discrete choices in the data file. If you believe that respondents allocated ten chips independently, you should use a value of ten. If you believe that the allocation of chips within a task are entirely dependent on one another (such as if every respondent awards all chips to the same alternative) you should use a value of one. Probably the truth lies somewhere in between, and for that reason we suggest 5 as a default value. A data file using discrete choices will always use a total task weight of 1.

Include 'none' parameter if available

We generally recommend always estimating the none parameter (but perhaps ignoring it during later simulation work). However, you can omit the "none" parameter by unchecking this box. In that case, any tasks where None has been answered are skipped. The None parameter (column) and None alternative are omitted from the design matrix.

Tasks to include for best/worst data

If you used the best/worst input option, you can select which tasks to include in utility estimation: best only, worst only, or best and worst.

Code variables using effects/dummy coding

With effects coding, the last level within each attribute is "omitted" to avoid linear dependency, and is estimated as the negative sum of the other levels within the attribute. With dummy coding, the last level is also "omitted," but is assumed to be zero, with the other levels estimated with respect to that level's zero parameter.

Since the release of CBC v1 in 1993, we have used effects-coding for estimation of parameters for CBC studies. Effects coding and dummy coding produce identical results (within an additive constant) for OLS or logit estimation. But, the part worths estimated using effects coding are generally easier to interpret than for dummy coding, especially for models that include interaction terms, as the main effects and interactions are orthogonal (and separately interpretable).

For HB analysis (as Rich Johnson pointed out in his paper "The Joys and Sorrows of Implementing HB Methods for Conjoint Analysis,") the results can depend on the design coding procedure, when there is limited information available at the unit of analysis relative to the number of parameters to estimate. Even though we have introduced negative prior correlations in the off-diagonal elements of the prior covariance matrix to reduce or eliminate the problem with effects coding and the "omitted" parameter for extreme data sets, there may be cases in which some advanced analysts still prefer to use dummy coding. This is a matter of personal preference rather than a choice whether one method is substantially better than the other.

Miscellaneous

Target acceptance

This is used to set the target rate at which new draws of beta are accepted (the jump size is dynamically adjusted to achieve the target rate of acceptance). The default value of 0.3 indicates that on average 30% of new draws should be accepted. The target acceptance has a range between 0.01 and 0.99. Reports in the literature suggest that convergence will occur more rapidly if the acceptance rate is around 30%.

Starting seed

The starting seed is a value used to seed the random number generator used to draw multivariate normals during estimation. If a non-zero seed is specified, the results are repeatable for that seed. If the seed is zero, the system will use the computer clock to randomly choose a seed between 1 and 10000. The chosen seed will be written to the estimation log. When using different random seeds, the posterior estimates will vary, but insignificantly, assuming convergence has been reached and many draws have been used.

Advanced

Covariance Matrix

Prior degrees of freedom

This value is the additional degrees of freedom for the prior covariance matrix (not including the number of parameters to be estimated), and can be set from 2 to 100000. The higher the value, the greater the influence of the prior variance and more data are needed to change that prior. The scaling for degrees of freedom is relative to the sample size. If you use 50 and you only have 100 subjects, then the prior will have a big impact on the results. If you have 1000 subjects, you will get about the same result if you use a prior of 5 or 50. As an example of an extreme case, with 100 respondents and a prior variance of 0.1 with prior degrees of freedom set to the number of parameters estimated plus 50, each respondent's resulting part worths will vary relatively little from the population means. We urge users to be careful when setting the prior degrees of freedom, as large values (relative to sample size) can make the prior exert considerable influence on the results.

Prior variance

The default is 1 for the prior variance for CBC/HB and ACBC/HB for each parameter, but users can modify this value. You can specify any value from 0.1 to 999. Increasing the prior variance tends to place more weight on fitting each individual's data, and places less emphasis on "borrowing" information from the population parameters. The resulting posterior estimates are relatively insensitive to the prior variance, except 1) when there is very little information available within the unit of analysis relative to the number of estimated parameters, and 2) the prior degrees of freedom for the covariance matrix (described above) is relatively large.

Use custom prior covariance matrix

HB uses a prior covariance matrix that works well for standard CBC and ACBC studies. Some advanced users may wish to specify their own prior covariance matrix. Check this box and click the icon to make the prior covariance matrix visible. The number of parameters can be adjusted by using the up and down arrows on the Parameters field, or you may type a number in the field. The number of parameters needs to be the same as the number of parameters to be estimated. Values for the matrix may be typed in or pasted from another application such as Excel. The user-specified prior covariance matrix overrides the default prior covariance matrix as well as the prior variance setting.

Alpha Matrix

Most users will not change the default alpha matrix. Advanced users may specify new values for alpha using this dialog.

Covariates are a new feature with our latest versions of HB. More detail on the usefulness of covariates in HB is provided in the white paper, "Application of Covariates within Sawtooth Software's CBC/HB Program: Theory and Practical Example" available for downloading from our Technical Papers library at www.sawtoothsoftware.com

Use default prior alpha

Selecting this option will use a default alpha matrix with prior means of zero and prior variances of 100. No demographic variables will be used as covariates.

Use a custom prior alpha

Users can specify their own prior means and variances to be used in the alpha matrix. The means and variances are expanded by clicking the icon.

The number of parameters for the means and variances can be adjusted by using the up and down arrows of the Parameters field, or you may type a number in the field. The number of parameters needs to be the same as the number of parameters to be estimated (k-1 levels per attribute, prior to utility expansion). Values for the matrix may be typed or pasted from another application such as Excel.

Use Covariates

HB allows demographic variables to be used as covariates during estimation. The available covariates can be expanded by clicking the icon. If you have just merged the variable to be used as a covariate and it doesn't appear on the list, click the Refresh list link.

Individual variables can be selected for use by clicking the 'Include' checkbox. The labels provided are for the benefit of the user and not used in estimation. Each covariate can be either Categorical or Continuous.

Categorical covariates such as gender or region are denoted by distinct values (1, 2, etc.) in the demographic file. If a covariate is categorical, the number of categories is requested (i.e. the number of genders would be two: for male and female). The number of categories is necessary since they are expanded using dummy coding for estimation.

Continuous covariates are not expanded and used as-is during estimation. We recommend zero-centering continuous covariates for ease of interpreting the output.

You may wish to open the ExerciseName_hb_alpha.csv file to interpret the estimated alpha parameters associated with the covariates. For each utility parameter, an intercept as well as effects for the covariates is reported, per iteration (see the labels in this file to help intepret the data). You should ignore the initial iterations prior to convergence (typically the first 20,000) and focus your analysis on the draws of alpha after convergence.

Advanced Output Options

By default, Lighthouse only writes out the HB output that most users would need. Select any additional files here in this dialog that you want the software to write out.

The studyname_alpha.csv file contains the estimated population mean for part worths. There is one row for each recorded iteration. The average part worths are "expanded" to include the final levels of each categorical attribute which were temporarily deleted during estimation.

The studyname_meanbeta.csv file is only created if you have specified constraints. There is one row for each recorded iteration. The average part worths are "expanded" to include the final levels of each categorical attribute which were temporarily deleted during estimation.

The studyname_covariances.csv file contains the estimated variance-covariance matrix for the distribution of part worths across respondents, for each saved iteration. Only the elements on-or-above the diagonal are saved.

The studyname_utilities.csv file contains point-estimates of part worths or other parameters for each respondent, along with variable labels on the first row.

The studyname_draws.csv file contains estimated parameters for each individual saved from each iteration, if you specified that it should be created. Values in the studyname.hbu file are obtained by averaging the draws found in the studyname_draws.csv file. This file is formatted like the studyname_utilities.csv file except that it also includes a column for the draw. It can be very large because it contains not just one record per respondent, but as many as you decided to save - perhaps thousands.

The studyname.hbu file contains point-estimates of part worths or other parameters for each respondent. (Detailed file format shown further below.)

The studyname_priorcovariances.csv file contains the prior covariance matrix used in the computation.

The studyname_stddev.csv file is only created if you elect not to save random draws. In that case, it contains the within-respondent standard deviations among random draws. There is one record for each respondent, consisting of respondent number, followed by the standard deviation for each parameter estimated for that respondent.