Model building strategies

From
Jump to: navigation, search

Model building aims to select the variables that will result in the best model to explain the observed data. The model building will be based both on methods, experience and common sense. The epidemiologist, not the software package, is responsible for the analysis and model-building process.

The most frequent approach to model building is to achieve the smallest model (number of variables) that still explain the data. The smallest is chosen because it is also the more stable. Another objective is also to provide the best possible control of confounding within the data set.

The selection of variables should start with a careful univariate analysis of each variable. This involves defining if the variable is best described as dichotomous, polytomous or continuous and verifying linearity assumptions. This also involves, before the logistic regression analysis, doing a careful stratified analysis by means of 2xn contingency tables. This provides a unique way to look at the data (what is in each cell of 2x2 tables, including zeros).

Once the univariate analysis is completed, we will select all variables with a statistical test leading to a p-value below a predefined cut-off level. A cut-off level of p-value < 0,25 is often used. We should also include all variables we believe are important to biological or public health. According to the literature, the use of a more conservative or traditional level (p-value < 0,05) does not always allow for identifying all variables known to be important. One should also remember that a group of variables that are not individually important in the model may play a collective role (confounding).

Several methods can be used to assess the fit of the best model. They include:

  • forward or backward step-by-step approach monitored by the analyst,
  • stepwise forward or backward (the software uses a precise algorithm to add or drop variables),
  • the best subset method.

Following the achievement of the best model fit, the importance of each variable should then be verified by comparing the crude association and the results of the model, including a comparison of confidence intervals and their statistical significance. The process of adding, fitting, dropping refitting continues until all variables in the model are judged either statistically or biologically important.

Once we have a model with all relevant variables, we should consider whether interaction terms should be added. This implies that categories or linearity assumptions have been verified for polytomous and continuous variables.

FEM PAGE CONTRIBUTORS 2007

Editor
Fernando Simon
Original Author
Alain Moren
Contributors
Arnold Bosman
Fernando Simon

Contributors