# A Clinician-Educator's Roadmap to Choosing and Interpreting Statistical Tests

^{1}Department of Internal Medicine, Yale University School of Medicine, New Haven, CT, USA

^{2}Department of Biostatistics, The Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA

**This article has been corrected.**See J Gen Intern Med. 2006 September; 21(9): 1009.

## Abstract

As educators seek confirmation of successful trainee achievement, medical education must move toward a more evidence-based approach to teaching and evaluation. Although medical training often provides physicians with a general background in biostatistics, many are not prepared to apply these skills. This can hinder clinician educators as they wish to develop, analyze and disseminate their scholarly work. This paper is intended to be a concise educational tool and guide for choosing and interpreting statistical tests aimed toward medical education assessment. It includes guidelines and examples that clinician-educators can use when analyzing and interpreting studies and when writing methods and results sections of reports.

**Keywords:**medical education, educational research, statistics, faculty development

As accreditation bodies seek confirmation of successful trainee achievement, ^{1}^{, }^{2} medical education must move toward a more evidence-based approach to teaching and evaluation. ^{3}^{, }^{4} To meet these challenges, educators must have knowledge and skills in developing, analyzing and disseminating educational interventions as part of their scholarly work. Effective development and evaluation require a fundamental knowledge of study design and statistical methods. Although medical training often provides physicians with a general background in epidemiology and biostatistics, many physicians are not prepared to apply these skills. ^{5}^{– }^{7}

While an effort has been made to help educators apply epidemiology to educational research, ^{8} we found no references that help educators understand how to use statistical tests to evaluate educational interventions. This paper is intended to be a concise educational tool and guide for choosing a statistical test during medical education assessment and for interpreting and analyzing educational studies without relying on mathematical theory. To provide a framework for understanding statistical concepts and to illustrate the decision-making process needed to choose a statistical test, we present an educational intervention detailing the hypothesis testing, data analysis, and interpretation of the results. Examples of statistical tests recently used in the educational literature are provided in Appendix 1, and statistical terms appearing in boldface are defined in Appendix 2.

## BACKGROUND CONSIDERATIONS

Before determining which statistical test to use, one must consider study hypotheses, study design, number of study groups, whether groups are matched or paired for certain characteristics, type of outcome data, and how data are distributed in the sample. A checklist of questions addressing these areas is provided in Table 1. First, we present a sample educational intervention to illustrate the statistical concepts presented later in the text.

### Hypothetical Example: Study Design and Methods

We developed a 1-month curriculum to improve second-year medical students' physical examination skills, interpersonal skills and confidence level. We conducted a randomized controlled trial in which half of the class received the new curriculum and the other half served as controls. We collected information regarding student age, gender, and college major. We evaluated all students' physical examination and interpersonal skills using a standardized patient exam 1 week after the intervention (Note: for simplicity, we will consider only one station of a standardized patient exam). We assessed the number of relevant physical examination maneuvers performed correctly by each student (total of 6 manuevers), a 20-item interpersonal score rated by the standardized patient, and whether the patient would recommend the student to a friend. Each interpersonal item was rated on 5-point Likert scale (1 =poor, 5 =excellent). We assessed each intervention student's confidence level in performing physical examination techniques before and after the curriculum using a 4-point Likert scale (1 =not very confident, 4 =very confident).

We used a Student's *t*-test to compare the mean number of physical examination maneuvers performed correctly and the Wilcoxon rank-sum test to compare overall interpersonal scores between groups. We used the Wilcoxon signed-rank test to compare intervention students' confidence level before and after the curriculum. To assess the relationship between student characteristics and the likelihood of being recommended to a friend, we performed simple logistic regression.

With a sample size of 60 students in each group, the study had 80% power to detect a difference of 1.2 maneuvers between the intervention and control groups in the mean number of relevant physical examination maneuvers performed correctly.

### Statistical Overview

Statistics is the scientific use of data to describe and draw inferences about true associations or phenomena by assessing the strength of the evidence for or against a hypothesis. It is used to make predictions and comparisons about a larger population based on data collected from a smaller sample. Since we usually cannot test an entire population (e.g., all second-year medical students), we must rely on sample data to guide our understanding of the truth. How well our sample represents the larger population determines how **generalizable** our findings are.

Data collected in any study are subject to variation. Some variation comes from random error and some from statistical error (measurement variation). **Bias** can be introduced in any stage of the study from its development to reporting of the results. ^{9} The goals of any study should include decreasing bias and minimizing error.

### Variable Types

Studies generally have 2 variable types: the response variable (also called the outcome or **dependent variable**) and the explanatory variable (also called a **covariate** or **independent variable**). These variables can be quantitative or qualitative in nature. **Quantitative variables** are numerical and can be continuous or discrete. **Continuous variables** have no gaps in the values (e.g., age), whereas **discrete variables** have gaps (e.g., the number of study participants). **Qualitative variables** describe certain attributes and are either ordinal or nominal. **Ordinal variables** have an implicit ranking associated with them (e.g., Likert scales), whereas **nominal variables** are descriptive and cannot be ordered (e.g., college major). The types of dependent and independent variables used to make comparisons influence what statistical tests are needed.

### Study Design

The appropriate use of statistics depends upon the research question(s) being asked. These questions and study hypotheses influence the study design and should be determined before conducting a study. Two types of study designs are commonly used in research: observational and experimental. **Observational studies** examine groups at one or more points in time (e.g., case-control, cross-sectional, and cohort studies). **Experimental studies**, or controlled trials, allocate participants to one or more groups and make comparisons across groups to assess differences in outcomes. Our study was a randomized controlled trial. Random allocation involves chance in the assignment of participants to intervention and control groups. This avoids a potential bias called selection bias that may be present if group assignment is known, as is often the case in observational studies. Selection bias can produce comparison groups that are different from each other from the study onset. This can limit the interpretation and generalizability of the study results.

The study design and the type of comparison group influences the statistical analyses performed. If the study uses a pre-post design, each participant is assessed by the same instrument at different points in time. The results obtained for each individual during different measurements are more likely to be highly correlated than the results of 2 randomly selected participants. Statistical analyses in this case should be performed using **paired methods** such that each participant serves as his/her own comparison. Our study requires the use of paired methods to assess differences in student confidence level before and after the intervention.

### Exploratory Data Analysis (Descriptive Statistics)

The first step in any analysis is to explore the data collected to ensure that they are reasonable, accurate and not affected by measurement or recording errors. Exploratory data analysis, or **descriptive statistics**, is a method of organizing, summarizing and displaying data. It includes calculating measures of central tendency (e.g., **mean** and **median**) along with measures of dispersion (e.g., **standard deviation** and **interquartile range**). Graphically displaying the data in histograms, **stem-and-leaf plots** or **box-and-whisker plots** will also aid in assessing patterns of dispersion and can identify potential outlying values that may influence study results. Understanding the type of data collected and how it is dispersed helps determine which types of statistical analyses can be performed.

### Confirmatory Data Analysis (Inferential Statistics)

Confirmatory data analysis, or **inferential statistics**, uses estimation and hypothesis testing to assess the strength of the evidence, make comparisons, make predictions and draw conclusions about a population based on the sample data. Types of inferential statistics include **bivariate analyses** that investigate relationships between 1 dependent and 1 independent variable, and **multivariable analyses** that investigate relationships between 1 dependent and multiple independent variables while controlling for the possible **confounding** influence of several independent variables on the dependent variable. In our example, we use bivariate analyses to compare differences in interpersonal scores between groups and multivariable analyses to quantify the association of student characteristics with the interpersonal score.

The results of inferential statistics are reported according to the type of data collected and the statistical test or method used to determine the result (e.g., mean number of physical examination maneuvers performed correctly in each group using a Student's t-test). Results are also described by a level of **statistical significance** expressed as a **P-value** or estimated with a confidence interval (CI).

### Hypothesis Testing

In hypothesis testing, the **null hypothesis** is a statement of no effect or no association. The null hypothesis regarding our main study goal would be: Participants and controls do not differ in the mean number of relevant physical examination maneuvers performed correctly at the end of the curriculum. The alternative hypothesis is that there is a difference.

Two types of errors can occur when making conclusions regarding the null hypothesis: **Type I error** and **Type II error**. A Type I error refers to rejecting the null hypothesis when the null hypothesis is true (false positive). A Type II error refers to accepting the null hypothesis when it is false (false negative). The goal is to minimize the probability of making a Type I error. Most studies set this probability, known as the significance level, at .05. In statistical tests, *P*-values are calculated as the probability of obtaining an outcome as extreme or more extreme than the observed study result under the assumption that the null hypothesis is true. If the *P*-value is less than the significance level, the result is considered statistically significant (e.g., *P* <.05). When statistical significance is not observed, either the null hypothesis is true (i.e., no difference really exists) or the sample size was not large enough to detect a difference (i.e., insufficient statistical **power**). The relationship between sample size, **effect size**, and statistical power is important to consider and is described elsewhere. ^{10}^{, }^{11}

Although *P*-values are used ubiquitously in the literature, they have several limitations. *P*-values do not indicate the strength or direction of the association, nor do they provide a direct interpretation of the results. For this reason, a 95% confidence interval (CI) associated with the result should be used when possible. A 95% CI indicates 95% certainty that the interval contains the true value. The true value refers to the outcome that we would expect if we could test the entire population. In our example, we wanted to determine whether there was a difference in the mean number of relevant physical examination maneuvers performed correctly between groups. The 95% CI for the true difference in mean scores was 0.85 to 1.7 suggesting that the true difference lies approximately in the range of 1 to 2 maneuvers. Studies with larger sample sizes and less variation will have narrower CIs indicating more precision in the results. Those with smaller sample sizes and higher variation will have larger CIs indicating less precision.

Before conducting a study, determination of statistical significance and clinical (practical) significance should be made. To do this, one needs to define the magnitude of detectable difference that would provide a meaningful change in outcome. In some studies, statistical significance may be reached due to large sample size, but the practical significance of the outcome may not be noteworthy. On the contrary, statistical significance may not be reached due to low sample size, but the outcome may be clinically relevant. In our example, we wished to see if the intervention improved the average number of physical exam maneuvers performed correctly by students. We needed to ascertain in advance, either from other research or practical experience, the increase in average number of exam maneuvers that would constitute a meaningful change in results, and establish a sample size that would allow statistical detection of this change.

### Data Distribution

The distribution of data assessed during exploratory data analysis helps determine whether **parametric** or **nonparametric tests** should be used to make comparisons. Parametric tests are based upon the assumption that the data are sampled from a known population distribution (Note: we will consider only the **normal (bell-shaped) distribution** for continuous outcome data and the **binomial distribution** for dichotomous outcomes). If continuous outcome data in a sample are **skewed** toward either higher or lower values, or if the sample size is small, nonparametric tests should be used. Ordinal variables are usually analyzed using nonparametric tests; however, parametric tests can be used when values of separate variables are summed together to produce a total score which follows a normal distribution (e.g., summing each student's 20-item interpersonal ratings to obtain an overall score). Nonparametric tests use ranked observations rather than the actual values and do not assume that the shape of the distribution is known. ^{12} These tests are more conservative, but are important to use when parametric considerations do not hold.

## SELECTING THE APPROPRIATE STATISTICAL TEST

We will use the steps outlined in Table 1 and the diagrams in Appendix 1 to illustrate how to select the appropriate statistical test for each of the 4 study hypotheses.

## Hypothesis 1

Participants and controls do not differ in the mean number of relevant physical examination maneuvers performed correctly at the end of the curriculum.

*Study design and study question*: Randomized controlled trial comparing 2 unpaired groups (intervention and control students) (Appendix 1b).*Outcome variable*: The number of relevant physical exam maneuvers performed correctly is handled as a continuous variable for analysis purposes.*Distribution of the outcome variable*: The distribution of the number of physical exam maneuvers for each group plotted on a histogram appeared normally distributed, suggesting a parametric test should be used.*Statistical test*: Student's*t*-test.*Results*: The mean number (standard deviation) of relevant physical examination maneuvers performed correctly by the intervention group was 4.4 (1.1) compared with 3.1 (1.1) for the control group,*P*<.0001, 95% CI for the true difference in means (0.85 to 1.7).*Interpretation*: Our*P*-value suggests a highly statistically significant difference, a difference that is unlikely due to chance alone, in mean number of physical examination maneuvers performed between groups. The 95% CI for the true difference in means also indicates a significant difference as it does not include the value of 0 (which would suggest that each group performed similarly). Thus, we reject the null hypothesis and conclude that the intervention students scored higher than the controls.

## Hypothesis 2

Participants and controls do not differ in their overall interpersonal scores at the end of the curriculum.

*Study design and study question*: Randomized controlled trial comparing 2 unpaired groups (intervention and control students) (Appendix 1b).*Outcome variable*: The overall interpersonal score is the sum of the 20-item interpersonal scores rated on a 5-point Likert scale. This score is continuous ranging from 20 to 100.*Distribution of the outcome variable*: Although the outcome is continuous, the distribution of the scores plotted on a histogram appeared skewed toward higher values, suggesting a nonparametric test should be used and the median rather than the mean for the summary measure.*Statistical test*: Wilcoxon rank-sum test.*Results*: The median number (interquartile range, IQR) of the interpersonal score for the study students was 78 (IQR 66 to 94) compared with 73 (IQR 66 to 84) for the control students,*P*=.07. (The*P*-value in this case refers to the test of the difference in the distribution of ranked scores as assessed by the Wilcoxon rank-sum test and not the direct comparison of median scores. There is no analog of the 95% CI for this test).*Interpretation*: The*P*-value is not statistically significant and the interquartile ranges overlap. Thus, we cannot reject the null hypothesis and conclude that our curriculum did not improve interpersonal skills.

## Hypothesis 3

Participants' confidence level in performing physical examination maneuvers does not differ before and after the curriculum.

*Study design and study question*: Pre-post design comparing 1 paired group (intervention students before and after the curriculum) (Appendix 1c).*Outcome variable*: The confidence level is measured on a 4-point Likert scale and is an ordinal variable.*Distribution of the outcome variable*: The distribution of the confidence level plotted on a histogram is nonnormally distributed suggesting a nonparametric test should be used.*Statistical test*: Wilcoxon signed-rank test.*Results*: The median number (IQR) for the confidence level of students before the intervention was 2 (2 to 3) compared with 3.5 (3 to 4) after the course,*P*<.0001.*Interpretation*: The*P*-value suggests a statistically significant difference between pre and postintervention ratings. The IQRs show minimal overlap between the two scores which also supports a statistically significant difference. Thus, we reject the null hypothesis and conclude that the intervention was successful at improving students' confidence.

## Hypothesis 4

No association exists between a student's age, gender, and college major with the patient's recommendation of the student to a friend.

*Study design and study question*: Randomized controlled trial quantifying the association between 3 independent variables with the outcome variable (patient's recommendation) (Appendix 1g).*Outcome variable*: The recommendation is dichotomous (yes or no).*Distribution of the outcome variable*: The distribution of the outcome variable is binomial.*Statistical test*: A simple logistic regression was used to test the hypothesis of no association between each individual covariate with recommendation. A more advanced analysis would extend this to a multiple logistic regression where potential**confounding variables**could be controlled for in the analysis.*Results*: For each increase in 1 year of age, the odds are reduced by 1% that the student will be recommended to a friend (odds ratios [OR] =0.99; 95% CI, 0.83 to 1.19),*P*=.99. Compared with males, females have a 25% decrease in the odds of being recommended, (OR =0.75; 95% CI, 0.33 to 1.69),*P*=.49. Compared with science majors, nonscience majors have a 23% decrease in the odds of being recommended (OR =0.77; 95% CI, 0.28 to 2.11),*P*=.61.*Interpretation*: For each of the hypotheses, there was no statistically significant association between the covariate and the outcome as observed by the large*P*-values and 95% CIs overlapping the value one. Thus, we cannot reject each null hypothesis of no association between each student characteristic and the likelihood of recommendation by the standardized patient. This may be due to insufficient statistical power in our study.

## FINAL CONSIDERATIONS

This paper illustrates the decision-making processes clinician-educators can use to select statistical tests for interventions with 2-group comparisons. Examples of comparisons between 3 or more groups, correlations, and different regression analyses can be found in Appendix 1. Other tests or analyses may be needed depending on the research question of interest. Studies using observer ratings should be analyzed for **interrater** and/or **intrarater reliability** to assess consistency of results. When multiple comparisons will be performed, researchers may need to adjust the significance level to a smaller value (e.g., *P* =.001) to decrease the probability of finding a statistically significant result by chance alone. When performing **regression analyses**, certain assumptions must be checked to assess whether a specific regression model is appropriate and whether the potential for **confounding** and **effect modification** by certain covariates should be considered. ^{13}

With this guide, we hope to provide educators with a tool for improving the quality of medical education research conducted and presented in the literature. To obtain appropriate advice for both statistical design and analyses, we suggest the consultation of a statistician early in a study. Other resources such as textbooks and references for clinical research ^{10}^{, }^{11} may be needed to address areas not covered in this paper.

## Supplementary Material

#### Appendix A

^{(130K, doc)}

#### Appendix B

^{(42K, doc)}

## REFERENCES

**Society of General Internal Medicine**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (80K) |
- Citation

- Professional development: a guide to the educator's portfolio.[Am J Physiol. 1996]
*Carroll RG.**Am J Physiol. 1996 Dec; 271(6 Pt 3):S10-3.* - A systematic review of faculty development initiatives designed to improve teaching effectiveness in medical education: BEME Guide No. 8.[Med Teach. 2006]
*Steinert Y, Mann K, Centeno A, Dolmans D, Spencer J, Gelula M, Prideaux D.**Med Teach. 2006 Sep; 28(6):497-526.* - Advancing educators and education by defining the components and evidence associated with educational scholarship.[Med Educ. 2007]
*Simpson D, Fincher RM, Hafler JP, Irby DM, Richards BF, Rosenfeld GC, Viggiano TR.**Med Educ. 2007 Oct; 41(10):1002-9. Epub 2007 Sep 5.* - Twenty-first century learning for teachers: helping educators bring new skills into the classroom.[New Dir Youth Dev. 2006]
*Wilson JI.**New Dir Youth Dev. 2006 Summer; (110):149-54, 21-2.* - To what extent do educational interventions impact medical trainees' attitudes and behaviors regarding industry-trainee and industry-physician relationships?[Pediatrics. 2007]
*Carroll AE, Vreeman RC, Buddenbaum J, Inui TS.**Pediatrics. 2007 Dec; 120(6):e1528-35.*

- PubMedPubMedPubMed citations for these articles

- A Clinician-Educator's Roadmap to Choosing and Interpreting Statistical TestsA Clinician-Educator's Roadmap to Choosing and Interpreting Statistical TestsJournal of General Internal Medicine. Jun 2006; 21(6)656

Your browsing activity is empty.

Activity recording is turned off.

See more...