NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

National Research Council (US) Committee on an Assessment of Research Doctorate Programs; Ostriker JP, Kuh CV, Voytuk JA, editors. A Data-Based Assessment of Research-Doctorate Programs in the United States. Washington (DC): National Academies Press (US); 2011.

## A Data-Based Assessment of Research-Doctorate Programs in the United States.

Show detailsThis appendix explains in detail how the various parts of the rating and ranking process for graduate programs fit together and how the process is carried out. Figure J-1 provides a graphical overview of the entire process and forms the basis for this appendix. The appendix addresses each of the boxes in Figure J-1 separately, starting at the top and generally working downward and to the right. The topics in this appendix include:

- a summary of the sources of data used in the rating and ranking process,
- the survey (S)-based weights, the regression (R)-based weights, and the details of the calculations of the endpoints of the 90 percent ranges
- the simulation of the uncertainty in the weights by random-halves sampling,
- the simulation of the uncertainty in the values of the program variables,
- the combination of the simulated weights for the significant program variables with the simulated standardized values of the program variables to obtain simulated rankings, and
- the resulting 90 percent ranges of rankings that are the primary rating and ranking quantities that we report.
- a description of an alternative ranking methodology that combines measures of interest to the user.

## THE METHOD FOR CALCULATING THE R AND S RANKINGS

### The Three Data Sets

The empirical basis of the NRC ratings and rankings are the three data sets indicated in the three unlabeled boxes at the top of Figure J-1. The first, denoted by **X**, is the collection of faculty *importance measures* that were derived from data that were collected in the faculty questionnaire. The data in **X** are used to derive the *direct or survey-based (S) weights* discussed more extensively below. The second data set, denoted by **P**, is the collection of the values of the 20 *program variables* that were collected from various sources for each program. The data in **P** are used in the final ratings and rankings of the programs and are discussed in greater detail below. The third, denoted by **R**, is the collection of *ratings of programs by faculty raters*. These ratings were made separately from the faculty questionnaire and involved only a sample of programs from each field and only a sample of faculty raters from that field. This sample of faculty ratings plays a crucial role in the derivation of the *regression-based weights*, discussed more extensively below.

### Box (1b) The Direct Weights From the Faculty Questionnaire^{1}

Let us turn first to the *survey (S) or direct weights* in box (1b) in Figure J-1, leaving boxes (1) and (1a) to the later discussion of how the uncertainty in these data was simulated.

The faculty questionnaire asks each graduate-program faculty respondent to indicate how important each of 21 characteristics is to the quality of a program in his or her field of study.^{2} This information is then used to derive the *survey (S) or direct weights* for each surveyed faculty member, as described below.

The original 21 program characteristics listed on the faculty questionnaire are shown in Table J-1, and they were divided into three categories—faculty, student, and program characteristics. Of the original 21, there are 20 for which adequate data were deemed to be available to use in the rating process, and these 20 data values for each program became the 20 *program variables* used in this study to which we repeatedly refer.

Faculty respondents were first asked to indicate up to four characteristics in each category that they thought were “most important” to program quality. Each characteristic that was listed received an initial score of 1 for that faculty respondent. These preferences were then narrowed by asking the faculty members to further identify a maximum of *two* characteristics in each category that they thought were the most important. Each of these selected characteristics received an additional point, resulting in a score of 2. Given this approach, at most, 12 of the program characteristics can have a non-zero value for any given faculty member; and of these 12, 6, at most, will have a score of 2, and the rest will have a score of 1. At least 8 program characteristics will have a score of 0 for each faculty respondent, more than 8 would be zero if the respondent selected less than 4 as the “important” or 2 as the “most important” characteristics. A final question asked faculty respondents to indicate the *relative importance* of each of the three categories by assigning them values that summed to 100 over the three categories.^{3} For each faculty respondent, his or her *importance measure* for each program characteristic was calculated as the product of the score that it received times the relative importance value assigned to its category. Finally, the 20 importance measures for each faculty respondent were transformed by dividing each one by the sum of his or her importance measures across the 20 program variables.

We will use the following notation consistently: *i* for a *faculty respondent*, *j* for a *program* in a field, and *k* for one of the 20 *program variables*. Thus, *x _{ik}* denotes the measure of importance placed on program variable

*k*by faculty respondent

*i*. The values,

*x*, are nonnegative and, over

_{ik}*k*, sum to 1.0 for each faculty respondent

*i*. The

*importance measure vector*for faculty respondent

*i*is the collection of these 20 values,

The entries in these *x**-*vectors are non-negative and sum to 1.00. Denote the vector of *average importance weights*, averaged across the entire set of faculty respondents in a field, by

The mean value, * _{k}*, is the average weight of the importance given to the

*k*

^{th}program variable by all the surveyed faculty respondents in the field. The averages, {

*}, are the*

_{k}*direct or survey*-

*based weights*of the faculty respondents because they directly give the average relative importance of each program variable, as indicated by the faculty questionnaire responses in the field of study. Thus, the final 20 importance measures of the program characteristics for each faculty respondent are non-negative and sum to 1.0.

### Boxes (2b), (4) The Regression-Based Weights

We next consider the processes in boxes (2b) and (4) in Figure J-1 that lead to the *regression-based weights*. Again, we leave boxes (2) and (2a) to our later discussion of how we simulated the uncertainty in these data.

The regression-based weights represent our attempt to ascertain how much weight is implicitly given to each program variable by faculty members when they rate programs by using their own *perceived* quality of the programs they are rating. We used linear regression to predict average faculty ratings from the 20 program variables and interpreted the resulting regression coefficients as indicating the *implicit importance* of each program variable for faculty ratings. This is different from the survey or direct weights that were just described. We have broken down the process of obtaining the regression-based weights into the three parts indicated by boxes (2b) and (4) which we now discuss in turn.

#### Box (2b) The average ratings for the sampled programs

The ratings data in R of Figure J-1 are the ratings given by the sampled faculty members to the sample of programs that they were requested to rate. A randomly selected faculty member, *i*, rates a randomly selected program, *j*, on a scale of 1 to 6 in terms of his or her *perception* of its quality. Denote this rating by *r _{ij}*. The matrix sampling plan used was designed so that a sample of up to 50 of the programs in a field was rated by a sample of the graduate faculty members in that same field. Each rater rated about 15 programs, and none rated his or her own program. On average, each rated program was rated by about 44 faculty raters. The rater sample was stratified to ensure proportionality by geographic region, program size (measured by number of faculty), and academic rank. The program sample was stratified to ensure proportionality by geographic region and program size.

R is the array of all the values of *r _{ij}*. Note that R is an

*incomplete array*because many faculty members who responded to the questionnaire did

*not*rate programs and many programs in a field were

*not*rated, except for the small fields. Box (2b) indicates that we compute the average of these ratings for program

*j*, and denote this average rating by

*. Because each program’s average rating is determined by a different random sample of graduate faculty raters, it is highly unlikely that any two programs will be evaluated by exactly the same set of raters. Denote the*

_{j}*vector*of the average ratings for the sampled programs in a field by

**.**

The values of the average ratings in ** are the ***dependent variable* in the regression analyses used to form the regression-based weights.

#### Box (4) The program variables and standardizing

Denote the value of program variable *k* for program *j* by *p _{jk}*, and define the vector of all program variables for program

*j*by

and the array with rows given by *p** _{j}* by P. A cursory examination of the program characteristics listed in Table A-1 shows that they are on

*different scales*. For example, the number of publications per faculty member (numbers in the fives and tens), the median GRE scores of entering students (numbers in the hundreds), and the percentage of entering students who complete a doctoral degree in 10 years or less (fractions) are reported in values that are of very different

*orders of magnitude*. If these values are left as they are, the size of any regression coefficient based on them will be influenced by

*both*the importance of that program variable for predicting the average ratings (which is what we are interested in), as well as the scale of that variable (which is arbitrary and does not interest us). The program variables with

*large values*, such as the median GRE scores, will have very small coefficients to reflect the change in scale in going from GRE scores (in the hundreds) to ratings (in the 1 to 6 range). Conversely, program variables with

*small values*, such as proportions, will have larger regression coefficients to reflect the change in scale in going from numbers less than 1 to ratings (in the 1 to 6 range).

To avoid the ambiguity between the influence of the scale and the real predictive importance of a variable, we needed to modify the values of the different program variables so they have *similar scales*. This would ensure that program variables with the same influence on the prediction of faculty ratings would have similar regression-coefficient values. Our solution is the very common one of *standardizing* the *p _{jk}*-values by subtracting their mean across the programs in a field and dividing by the corresponding standard deviation. This will result in program variables that have the same mean (0.0) and standard deviation (1.0) across the programs in the field. In this way, no program variable will have substantially larger or smaller values than any other program variable across the programs in a field. For the regressions of box (4), the standardization was done only over the programs that were sampled for rating.

We denote the values of the standardized program variables with an asterisk (*p _{jk}** and

**P***). Two program variables (Student Work Space and Health Insurance) were coded as 1 (present) or −1 (absent). We felt that there was no need for additional standardization of these two program variables and they were not standardized to have mean 0 and variance 1.

The standardized program variables for the sampled and rated programs served as the *predictor or independent variables* in the regressions that lead to the regression-based weights.

#### Box (5a) The regressions and the regression-based weights

The statistical problem addressed in box (4) is to use ** and ****P*** as the *dependent* and *independent* variables, respectively, in a linear regression, to obtain the vector of regression-based weights, **, using least squares. It should be noted that only the data in ****P*** for the *sampled* programs are used. The data for the non-sampled programs in **P*** are not used in this step of the process.

Two immediate problems arise. These are: (1) the number of observations (i.e., the number of sampled programs in a field) is 50 or less, while the number of independent variables (i.e., the program variables in **P***) is 20, and (2) a number of the program variables are correlated with each other across the programs in a field. This is less than an ideal situation for obtaining *stable* regression coefficients. There are too few observations to hope for stable estimates of the coefficients for 20 variables. The fact that these variables are also correlated does not help matters either. If we had ignored these two problems, least-squares regression methods would have tended to assign coefficients rather arbitrarily to one particular variable or to other variables that are correlated with it, and how this worked out would depend on which programs were included in the sample of rated programs. The resulting unstable regression coefficients would have been unusable for our purposes.

For example, as expected, when we fit a linear model that included all 20 of the program variables, we found that for a number of the variables, the coefficients and their signs did not make intuitive sense. However, we found, as expected, that they made more sense when we used various step-wise selection methods for reducing the number of variables used as predictors. With only 50 cases, we had to expect that we could not use all 20 variables in the prediction equations without adjustments.

After examining a variety of approaches, we settled on using a backwards, step-wise selection method applied to the 20 *principal component* (PC) variables formed from the 20 program variables (rather than using the original 20 program variables). The regression coefficients obtained for the remaining PC variables were then transformed back to scale of the original 20 program variables, with the result that all 20 program variables now had non-zero coefficients, but these coefficients were subject to several linear constraints implied by the deleted PC variables.

The principal component variables are linear combinations of the original 20 program variables that have two properties: (1) they are uncorrelated in the sample, and (2) they can give exactly the same predictions as do the original variables—that is, every prediction equation that is possible with the original variables is also possible to form using the PC variables, using different regression coefficients. The PC variables are usually ordered by their variances from largest to smallest, but this plays no role here. There are as many PC variables as there are original variables—in our case, 20.

If we denote the array of original 20 standardized variables for the sample of rated programs as **P***, then the corresponding array of the 20 PC variables, **C**, is given by the matrix multiplication, **C** = **P*****V**, where **V** is the 20 by 20 orthogonal matrix specified by, among other things, the *singular value decomposition* of **P***. After the regression coefficients are estimated using the PC variables, we get back to the coefficients for the original standardized variables in **P*** by transforming the vector of regression coefficients by the transformation, **V**.

Our step-wise use of the PC variables proceeded as follows. We begin with a least-squares prediction equation, predicting ** from ****C**, that includes all of the PC variables. Then a series of analyses is performed, with one PC variable at a time being left out of the prediction equation; the PC variable that has the least impact on the fit of the predicted ratings (as measured by its t-statistic) is removed. This process is repeated, removing one PC variable each time, until the remaining PC variables each add statistically significant improvements to the fit of the predictions of the ratings (at the 0.05 level). The result is a set of regression coefficients, the *PC coefficients*, **.. which predict the sample of program ratings from a subset of the PC variables, i.e.,**

In Equation 5, the caret denotes estimation. Moreover, for the PC variables that have been eliminated during the backwards selection process, the corresponding PC-coefficients, **_{k}**, are zero. These zeros mean that we are setting the

*coefficients*of certain

*linear combinations of the original variables*to zero rather than setting the coefficients for some of the original program variables to zero. This was regarded as a virtue, because we did not

*necessarily*eliminate any of the original program variables from the prediction equation used to find the regression-based weights. By proceeding this way, we are not forced to give a zero weight to one of two collinear variables in the step-wise procedure. Instead, both collinear variables will typically load onto the same principal components and get some weight when the matrix

**V**is applied to the PC coefficients to obtain the coefficients for the original program variables, i.e.,

In the same way, the matrix of estimated variances and covariances of **, obtained from the least-squares output, may be transformed to the corresponding matrix for ****.**^{4}

The regression coefficient for the *k*^{th} program variable, denoted by **, is the ***regression-based weight* for program characteristic *k* as a predictor of the average ratings of the programs by the faculty raters, and ** = (**_{1} _{2},..., _{20}).

The predicted perceived quality rating for a sampled program can be expected to *differ* somewhat from the actual average rating for that program. For example, for the two fields studied in *Assessing Research Doctorate Programs: A Methodology Study*, the root-mean-square deviation between the predictions and the average ratings was 0.42 on a 1-to-6 rating scale for both mathematics and English. In addition, the (adjusted) *R*^{2} of the regressions of average ratings on measured program characteristics was 0.82 for mathematics and 0.80 for English. These values indicate that the predictions account for about 80 percent of the variability in average ratings. We regarded this as satisfactory levels of agreement between predicted and actual to use these methods in this study.

These results show that the *predicted* perceived quality ratings agree fairly well with the *actual* ratings. However, these results do not indicate how well a prediction equation that was based on a *sample of programs* will reproduce the predictions of the equation for the *whole population of programs* in a field. The data for mathematics, reported in *Assessing Research Doctorate Programs: A Methodology Study*, indicate that using 49 programs did a reasonably good job of reproducing the predictions based on the whole field of 147 mathematics programs.^{5} Thus, we decided that in developing the regression-based ratings, we would use a sample of 50 programs from a field if it had more than 50 programs and use almost all of the programs in fields with 50 or fewer programs. When there were fewer than 30 programs in a field, it was combined with a larger discipline with similar direct weights for the purposes of estimating the regression-based weights.^{6} In two cases, computer engineering and engineering science and materials, there were fewer than 25 programs, and these fields were not ranked, although data are reported for all 20 characteristics.^{7}

There is one final alteration in the values of ** that needs to be mentioned. The survey-based or direct weights, {*** _{k}*}, have absolute values that sum to 1.0. This is not necessarily true of the regression coefficients, {

*}. The scale of*

_{k}*mk*depends on both the scale of

*p*and the scale of the average ratings, {

_{jk}*}. We decided, because initially our intent was to*

_{j}*combine*these two sources of the importance of the various program variables, that they needed to be on similar scales. We decided to force them

*both*to sum to 1.0 in absolute value

^{8}. This allows the direct and regression-based weights to have negative values where they arise, typically in the regression-based weights, without requiring anything complicated to deal with this. Using the sum of absolute values allows the sign of the regression-based weights to be determined by the data rather than by an a priori hypothesis. Thus, we divided each regression coefficient,

*, by the sum of the absolute values of all the regression coefficients. In this way, both the direct and regression-based weights are fractional values, mostly positive but some negative, whose absolute sums equal 1.0*

_{k}^{9}.

### Boxes (1), (1a), (2) and (2a) Simulating the Uncertainty in the Direct and Regression-Based Weights

The survey-based (S) or direct weight vector, **, is subject to uncertainty; that is, a different set of respondent faculty would have led to different values in ****. Disagreement among the graduate faculty on the relative importance of the 20 program variables is the source of the uncertainty of the direct or survey-based weights. The average ratings of the sampled faculty in **** are also subject to uncertainty; a different sample of raters or programs would have produced different values in ****. One way to reflect this uncertainty is to use the sampling distributions of **** and ****. There are various ways that these sampling distributions may be realized. We chose an empirical approach that made no assumptions about the shapes of the various distributions involved, but this allowed us to use computer-intensive methods to let the sampling variability of both **** and **** influence the final ratings and rankings. We examined two empirical approaches, Efron’s ***bootstrap* and a *random-halves* (RH) procedure suggested by the committee chairman. We found that both gave very similar final results in terms of the final ranges of rankings and ratings. The bootstrap requires taking a sample of *N* with replacement from the relevant empirical distribution. The RH procedure requires taking a sample of *N*/2 without replacement from the same empirical distribution. We chose to use the RH procedure because it cut the sampling computations in half, is fairly easy to explain, and as far as we could tell, gave essentially the same results as the bootstrap for ranking and rating.

#### Boxes (1) and (2) The random halves procedure

The RH procedure for both ** and **** are nearly the same, and with the same justifications. ****X** is a complete array whose rows denote the *N* faculty respondents, while **R** is an incomplete array whose rows denote the *n* sampled faculty raters for a field. In the case of **X**, the RH procedure requires a random sample of size *N*/2 of the *faculty respondents*. In the case of **R**, the RH procedure requires a random sample of size *n*/2 of the *faculty raters*. Repeated draws from these random half samples are then used to simulate the uncertainty in ** and ****, respectively.**

Alert readers may worry that these half samples will exhibit *too much* variability in the resulting averages; after all, a half sample has only half the number of cases as a full sample— and the bootstrap always takes a full sample of *N* or *n*. The explanation of why a half sample without replacement has essentially the same variability as a full sample with replacement is most easily seen by considering the variance of the mean of a sample without replacement from a finite population. It is well known from sampling theory that the variance of the mean from a sample of size *N*/2, from a population of size *N* is, essentially,

That is, because of the “finite sampling correction,” the variance from a random half sample without replacement is exactly the same as the variance of a random sample of twice the size with replacement (there is a small “*N* versus *N* – 1” effect that Formula 11 ignores). This is why the bootstrap and the RH methods give such similar results in our application to the uncertainty of the direct weights. There are other reasons to expect the RH method to produce a useful simulation of the uncertainty of averages.^{10}

The same reasoning applies to the RH sampling of the faculty raters in **R** to simulate the uncertainty in the average ratings, **, used to obtain the regression-based weights. The procedure was to sample a random half of all raters for programs in a field and compute the average rating for each program from that half sample.**

The regression-based weights are subject to uncertainty from *two* sources. The first is the uncertainty arising from sampling the faculty raters and, as indicated above, the RH sampling directly addresses this source. The second is from using average ratings from a sample of programs rather than all the programs to develop the regression equation from which the regression-based weights are derived. In the discussion of box (4), above, we gave our reasoning for believing the sample of 50 programs is adequate, and how we pool the data from other related fields when the number of programs in a field is smaller than 50. In addition, while the use of ratings for a sample of programs has the practical value of reducing the workload of the faculty raters, our *implicit* use of the predicted average ratings, {*M _{j}*}, from Equation 5 above, rather than actual average ratings, {

*}, also reduces some of the uncertainty due to the sampling of the programs to be rated. For these two reasons, we believe that this second source of uncertainty is not as important as that simulated by the RH procedure for the uncertainty in the average ratings, and consequently, for the regression-based weights,*

_{j}**.**

We always drew the RH samples 500 times, and those for ** were statistically independent of those for ****. This gives us 500 replications of the direct or survey-based weights and 500 replications of the regression-based weights.**

### Boxes (3) and (3a) Incorporating Uncertainty into the Program Variables

In addition to the uncertainty in the survey-based (direct) and regression-based weights discussed above, there is also some uncertainty in the values of the program variables themselves. Some of the 20 program variables used to calculate the ratings also vary or have an error associated with their values due to year-to-year fluctuations. Data for five of the variables (publications per faculty, citations per publications, GRE scores, Ph.D. completion, and number of Ph.D.’s) were collected over time, and averages over a number of years were used as the values of these program variables. If a different time period had been used, the values would have been different. To express this type of uncertainty, a *relative error factor, e _{jk}*, was associated with each program variable value,

*p*. The relative error factor was calculated by dividing the standard deviation over the series by the square root of the number of observations in the series, and then dividing that number by the value of the variable

_{jk}*p*. For example, the publications per faculty variable is the average number of allocated publications per allocated faculty over 7 years, and a standard error value was calculated for this variable as SD/√7. This standard error was then divided by the value of the publications per faculty variable to get the relative error factor for this program variable.

_{kj}For the other 15 program variables that are used in the ratings, no data on variability were directly obtained during the study, and we assigned a relative error of 0, 0.1 or 0.2 to these variables. The relative error for the variables Student Workspace and Health Insurance were given an error of 0, because they were thought to have little or no temporal fluctuation over the interval considered; and for Percent of Faculty Holding Grants, the error assigned was 0.2, because an examination of data from the *National Science Foundation Survey of Research Expenditure* indicated this to be an appropriate estimate. The remaining 12 program variables were assigned a relative error of 0.1. Each program had its own relative error factor for each program variable, *e _{jk}*.

Just as we had simulated values from the sampling distributions of ** and **** via RH sampling, we also wanted to reflect the uncertainty in the values of the program variables themselves rather than using the fixed values, {***p _{kj}*}, in computing program ratings. We did this in the following way. The value,

*p*, was

_{kj}*perturbed*by drawing randomly from the Gaussian distribution,

*N*(

*p*, (

_{kj}*e*)

_{k}p_{kj}^{2}). This distribution has a mean equal to the variable value

*p*and a standard deviation equal to the relative error,

_{kj}*e*, times the variable value,

_{k}*p*. Thus, the entire array

_{kj}**P**is randomly perturbed to a new array, . This perturbing process is repeated 500 times, and each one is standardized to have mean 0.0 and standard deviation 1.0 for each of the 20 program variables to produce 500 standardized arrays, *.

### Boxes (5b) and (5c) The Ninety Percent Ranges of the S and R Rankings

In box (5b) we have already calculated 500 replications of the survey-based weights and in box (5c) we have done the same for the Regression-based weights for the given field [from box (2b)] and from 500 replications of the steps in boxes 5b and 5c we have 500 replications of the standardized perturbed version of **P** that contains the program variable data for all of the programs to be rated in the field.

For either measure, denote the *k ^{th}* replication of

*R*by

_{j}*R*

_{j}^{(k)}. To obtain the

*k*

^{th}replication of the

*rankings*of the programs, sort the values of

*R*

_{j}^{(k)}over

*j*from high to low and assign the rank of 1 to the program with the highest rating in this set. In case of tied ratings, we use the standard procedure in which the ranks are averaged for the tied cases, and the common rank given to the tied programs is the average of the ranks that would have been given to the tied set of programs. For each of the replications of the ratings, there is a corresponding replication of the rankings of the programs, resulting in 500 replications of the ranking of each program.

Instead of reporting a single ranking of the programs in a field, we report the ninety percent range of the rankings for each program. This is an interval starting with the rank that was at the 5th percentile in the distribution of the 500 replications of the ranks for the given program, and ending at the 95th percentile of this distribution. The interpretation of the ninety percent range is that it is *range that covers the middle ninety percent of the rankings* and reflects the uncertainty in the survey-based (direct) and regression-based weights and in the program data values five percent of a program’s rankings in our process are less than this interval and five percent are higher. The interval itself represents what we would expect the typical rankings for that program to be, given the uncertainty in the process and the ratings of the other programs in the field.^{11} These ninety percent ranges are reported for the R and S measures, as well as for the three dimensional measures.

## AN ALTERNATIVE APPROACH TO CONSTRUCTING RANKINGS: COMBINING THE R AND S MEASURES

The prepublication version of the revised Methodology Guide appeared in July 2009 and explained the methodology developed by the committee at that time, that is, one that combined the R-based and S-based measures in a way that will be described below. In July 2009, the committee had estimated ranges of rankings for only a handful of fields and assumed that this method of estimation would be generally satisfactory. In theory it is, but when applied to data for additional fields it became clear that there were some fields for which the range of program rankings based on the S-measure differed considerably from that based on the R-measure. Further, the committee came to view any set of ranges of rankings that it might develop as illustrative, that is, any range of rankings depended critically on the characteristics chosen and the weights applied to those characteristics. The R- and S- based ranges of rankings were two examples of data-based ranking schemes, but there are others. In fact, the dimensional measures described in the body of this Guide, are an example^{12}. The technical description of further steps that the committee carried out to obtain ranges of rankings using the combined measures are described in this section—beginning with an alternative conceptual diagram.

### Boxes (6) and (7) The Combined Weights

To motivate our method of combining of the direct and regression-based weights, we start by describing the direct and regression-based *ratings*. Remembering that the standardized values of the program variables for program *j* are denoted by *p _{jk}**, the

*direct rating*for program

*j*, using the average direct weight vector,

**, is**

*X*, is given by

_{j}The *regression-based rating* for program *j*, using the regression-based weight vector, **, is ***M _{j}*, is given by

Note that the regression-based rating is a linear transformation of the predicted ratings used to obtain the regression-based weights, because the constant term of the regression is deleted, and the weights have been scaled by a common value so that their absolute sum is 1.0. The procedure for computing regression-based ratings can be used for any program, sampled or not, in the given field. Simply use *M _{j} as* defined in Equation 7 above, where {

*p**} comes from the data for program

_{jk}*j*and the {

*} are the regression-based weights based on the sample of programs and raters.*

_{k}^{13}

We combined the direct ratings with the regression-based ratings as follows. Let *w* denote a *policy weight* and form the following *combination* of the direct and regression-based ratings:

The *policy weight*, *w*, is chosen in box (5) of Figure J-1, and is the amount the regression-based ratings are allowed to influence the combined rating, *R _{j}*. When

*w*= 0, the regression-based rating has

*no*influence on the

*R*. When

_{j}*w*= 1, the

*R*s are

_{j}*totally*based upon the regression-based ratings. Any

*compromise value*of

*w*is somewhere between 0 and 1.

We did not actually form both the direct and regression-based ratings in our work. Instead, we exploited the simple linear form of these given by:

where the combined weight, * _{k}*, is given by

The representation of the combined rating given in Equations 9 and 10 is a linear combination of the program variables that uses the *combined weights*, {* _{k}*} defined in Equation 10. The combined weight

*is applied to the*

_{k}*k*

^{th}standardized program characteristic,

*p*for each

_{jk}**k*, and then all 20 of these weighted values are summed to obtain the final combined rating for program

*j*.

However, because both * _{k}* and

*are subject to uncertainty, we made one additional adjustment to Equation 10 that is described below, following the discussion of how we simulated the uncertainty in both the direct weights and in the average ratings used to form the regression-based weights.*

_{k}#### Box (7) Using the optimal fraction to combine the direct and regression-based weights

In deriving the ranges of ratings that reflect the uncertainty in * _{k}* and

*, simulated values,*

_{k}*m*, and

_{k}*x*, are drawn from the sampling distributions of

_{k}*, and*

_{k}*, respectively, using independent RH samples from the appropriate parts of*

_{k}**R**and

**X**. These two simulated values are to be combined to form a simulated value,

*f*, for

_{k}*in Equation 11. However, the simple weighted average in Equation 11 only reflects the effect of the policy weighting,*

_{k}*w*, and ignores the fact that both

*m*, and

_{k}*x*are independent random draws from distributions, rather than fixed values. We want to combine

_{k}*m*, and

_{k}*x*in such a way as to bring the simulated value,

_{k}*f*, as close as possible to

_{k}*on average, and in a way that will also reflect the policy weight,*

_{k}*w*, appropriately. This section outlines our approach to choosing the

*optimal fraction*to apply to

*m*to achieve this. The optimal fraction is the amount of weight applied to

_{k}*m*that minimizes the mean-square error of

_{k}*f*, treating

_{k}*as a target parameter to be estimated.*

_{k}First, consider a general weighting, *f _{k}*(

*u*), that uses a fraction,

*u*. This weighting has the form

By construction of the RH procedure, the mean of the distribution of *m _{k}* is

*(the regression coefficients that are obtained when the data from all*

_{k}*n*faculty raters are used). Similarly, the mean of the distribution of

*x*is

_{k}*, the mean importance value that is obtained when the data from all*

_{k}*N*faculty respondents are averaged. We may regard

*f*(

_{k}*u*) as an estimator of

*, given by*

_{k}The problem then is to find the value of *u* that will minimize the mean-square error (MSE) of *f _{k}*(

*u*) given by

where, in Equation 14, the notation, E(*f _{k}*(

*u*) –

*)*

_{k}^{2}denotes the

*expectation*or

*average*taken over the independent RH distributions of

*and*

_{k}*. The MSE is a measure of the combined uncertainty in*

_{k}*f*(

_{k}*u*).

The MSE in (14) can be written as

The point of re-expressing Equation 14 as Equation 15 is that now when the squaring is carried out, all of the terms except the squared ones have zero expected values and can be ignored. If we denote the variance of the sampling distribution of * _{k}* by σ

^{2}(

*) and the variance of*

_{k}*by σ*

_{k}^{2}(

*), then Equation 15 becomes*

_{k}It is now a straightforward task to differentiate Equation 16 in *u*, set the result to zero, and solve for the optimal *u*-value, *u*_{0}* _{k}*, which we call the

*optimal fraction*. This calculation results in

The optimal fraction in Equation 12 has some useful and intuitive properties. It takes on the value *w* when there is no uncertainty about the direct and regression-based weights. Moreover, *w* has no influence on the optimal fraction when * _{k}* and

*are equal. In that case, the direct weights and regression-based weights on the*

_{k}*k*program characteristic are the same, and the optimal fraction combines the two simulated values in a way that is inversely proportional to their variances, so that the value with less variation gets more weight. Note also, that the value in Equation 12 is the same for all of the RH simulated values of

^{th}*m*and

_{k}*x*.

_{k}The two variances in Equation 12, σ^{2}(* _{k}*) and σ

^{2}(

*), may be found in standard ways. The value of σ*

_{k}^{2}(

*) is given by*

_{k}where *N _{F}* denotes the number of faculty in the field who supply direct weight data, and σ

^{2}(

*x*) denotes the variance of the individual direct weights given to the

_{k}*k*program variable by these faculty respondents. The value of

^{th}*σ*

^{2}(

*) is obtained from the regression output that produces*

_{k}*when the data from all faculty raters in a field are used. Its square root, σ(*

_{k}*)is the standard error of the regression coefficient,*

_{k}*. Finally, because we rescaled the*

_{k}*so that their absolute sum was 1.0, the same divisor must be applied to σ(*

_{k}*)to put it on the corresponding scale.*

_{k}If we now replace the *u* in Equation 17 with *u*_{0}* _{k}* given in Equation 17, we then obtain the combined weight that optimally combines the two simulated values of the weights,

*m*, and

_{k}*x*, into the combined rating, given by

_{k}where

and *u*_{0}* _{k}* is given by Equation 17. The vector of optimally combined weights is denoted by

*f*_{0}

^{14}.

The values of *R*_{0}* _{j}* from Equations 19 and 20 are used as the 500 simulated values of the combined ratings for the purposes of determining the ranking interval ranges for each program that is discussed below.

In performing the RH sampling to mimic the uncertainty in the direct and regression-based weights, it should be emphasized that the random half samples from **X** and **R** were statistically independent. This is our justification for assuming that the random draws, *m _{k}*, and

*x*, are statistically independent in the calculation of the optimal fraction,

_{k}*u*

_{0}

*.*

_{k}^{15}

As a final point, we did realize that the approach to calculating the optimal fraction described above did not take into account any correlation between the direct and regression-based weights for *different* program variables. We did examine a method that did, but it simply produced a matrix version of Equation 12 that reduced to the procedure we used when the program variables were uncorrelated, but was otherwise difficult to implement with the resources available to us.

### Box 8. Eliminating Non-Significant Program Variables

After we have obtained the 500 simulated values of the combined weights by applying Equations 17 and 20 to the 500 simulated values for the direct and regression-based weights, we were in a position to examine the distributions of these 500 values of the combined weights for each program variable. The distributions of the combined weights for some of the program variables did not contain zero and were not even near zero. However, other program variables had combined weight distributions that did contain zero. If zero is inside the middle 95 percent of this distribution, we declare the combined weight for that program variable to be *non-significant* for the rating and ranking process (in analogy with the usual way that distributions of parameters are tested for statistical significance). If the combined weight for a program variable is not significantly different from zero, the variable for that coefficient is dropped from further computations. This elimination of program variables required us to recalculate everything above box (8) in Figure J-2. The eliminated program variables are ignored in calculating the direct and regression-based weights for the other variables. New RH samples are drawn, the direct weights are retransformed so that the absolute sum of the remaining direct weights was 1.0, the regressions are re-run using the reduced set of program variables as predictors, and new optimal fractions are computed to combine the direct and regression-based weights. Finally, the 500 simulated combined coefficients are again tested for statistical significance from zero. This process is repeated until a final set of combined weights, each of which is significantly different from zero, is obtained. Only after this testing and retesting process is performed are the final sets of 500 combined coefficients ready for use in the computation of the intervals of rankings that are discussed in box (5) of Figure J-1. The values for the combined weights that correspond to the eliminated variables are set to 0.0 in each of the final 500 simulated values of *f*_{0}. These 500 vectors of combined weights are used in the production of the ratings that are used to produce the final intervals of rankings for each program, as discussed later.

Empirically, the examination of three fields suggests that this process has two useful effects. First, the middle of the inter-quartile ranges of rankings of programs is changed very little, so that the ranges before eliminating nonsignificant program variables and those after this elimination are centered in nearly the same places^{16}. Second, the widths of these inter-quartile ranges are slightly reduced or are unchanged. These are the effects that we would expect from eliminating variables that are having only a noisy effect on the ranking and rating process, and for this reason, we have continued to include box (8) in our rating and ranking process. Nonetheless, the inter-quartile intervals do shift more markedly than the medians, when estimated coefficients are set to zero—largely for those departments near the middle of the rankings. This is because quartile estimates are more variable than median estimates. There are even rare instances in which the intervals calculated both ways do not overlap.

From this point on, the calculation of the ranges of rankings is carried out as described in the section about the R-and S- ranges of rankings.

The importance of program attributes to program quality is surveyed in Section G of the faculty questionnaire.

The number of student publications and presentations was not used because consistent data on it were unavailable. The direct or survey-based and regression-based weights were calculated without it.

The faculty task can be thought of as asking faculty how many percentage points should be assigned to each category. The sum of the percentage point weights adds up to 100.

If the weights from the R and S measures were to be combined, the variances from this matrix would be used later [in box (6) of the computation of combined weights] in the computation of the “optimal fraction” for combining the survey-based and regression-based weights.

See Appendix G of Assessing Research Doctorate Programs: A Methodology Study, National Research Council (2003)

The fields for which this was done were:

Small Field | Surrogate Field |
---|---|

Aerospace engineering | Mechanical engineering |

Agricultural economics | Economics |

American studies | English literature |

Astrophysics and astronomy | Physics |

Entomology | Plant science |

Forestry | Plant science |

Food science | Plant science |

Theatre and performance | English literature |

Ranges of rankings are not provided for three fields that were in the original taxonomy: 1)Languages, Societies, and Cultures, for which the sub-fields were too diverse to it as a coherent field; and 2)Engineering Science and Materials and 3) Computer Engineering, which fell below the minimum of 25 programs to permit the calculation of rankings for a field. The committee had not anticipated this when it developed the taxonomy, or the fields would not have been included as a separate field.

We use the absolute value here because, for time to degree, a higher value should receive a negative weight. Note that normalization has no effect on relative rankings, since it is simply a linear transformation.

The estimated standard deviations of the {* _{k}*}, obtained in standard ways from the regression output, were also divided by this sum to make them the correct size for use in the process of combining the direct and regression-based weights, discussed below.

The random-halves procedure has a place in the statistical literature, but with other names. It is an example of the “deleted-d” jackknife as described in Efron and Tibshirani, (1993) *An Introduction to the Bootstrap.* New York: Chapman and Hall. p. 149, with d = n/2. It is described by Kirk Wolter in a private communication as an example of the “balanced repeated replication” or “balanced half samples,” and described in Wolter, K. M. (2007) *Introduction to Variance Estimation*., 2^{nd} ed. New York: Springer-Verlag.

In an earlier draft of this guide, we chose an inter-quartile range, but this choice, rather than some other range (eliminating the top and bottom quintile, for example) is arbitrary. The current approach uses broader ranges which result in greater overlap of ranges, but has the advantage of covering most of the rankings a program might achieve. The point of introducing uncertainty in our calculations is that we do not know the “true” ranking of a program. The purpose of presenting a ninety percent range is to provide a range in which a program’s ranking is likely to fall.

In most cases, it would not make sense to combine the dimensional measures because they yield differing results for most programs.

We have throughout estimated linear regressions. Is this assumption justified? We can only say that, empirically, we tried alternative specifications that included quadratic terms for the most important variables (publications and citations) and did not find an improved fit.

The fact that the raters for each field were a subset of those who answered the faculty questionnaire may confuse some into thinking that our independence assumption may not be justified. This is an unfortunate misunderstanding of the simulation of uncertainty in the rating and ranking process. It is the statistical independence of the two RH sampling processes that matters, nothing else.

Examination of the effect of this procedure gave correlations between the median rankings with and without the elimination of nonsignificant variables of .99.

## Footnotes

- 1
The importance of program attributes to program quality is surveyed in Section G of the faculty questionnaire.

- 2
The number of student publications and presentations was not used because consistent data on it were unavailable. The direct or survey-based and regression-based weights were calculated without it.

- 3
The faculty task can be thought of as asking faculty how many percentage points should be assigned to each category. The sum of the percentage point weights adds up to 100.

- 4
If the weights from the R and S measures were to be combined, the variances from this matrix would be used later [in box (6) of the computation of combined weights] in the computation of the “optimal fraction” for combining the survey-based and regression-based weights.

- 5
See Appendix G of Assessing Research Doctorate Programs: A Methodology Study, National Research Council (2003)

- 6
The fields for which this was done were:

__Small Field____Surrogate Field__Aerospace engineering Mechanical engineering Agricultural economics Economics American studies English literature Astrophysics and astronomy Physics Entomology Plant science Forestry Plant science Food science Plant science Theatre and performance English literature - 7
Ranges of rankings are not provided for three fields that were in the original taxonomy: 1)Languages, Societies, and Cultures, for which the sub-fields were too diverse to it as a coherent field; and 2)Engineering Science and Materials and 3) Computer Engineering, which fell below the minimum of 25 programs to permit the calculation of rankings for a field. The committee had not anticipated this when it developed the taxonomy, or the fields would not have been included as a separate field.

- 8
We use the absolute value here because, for time to degree, a higher value should receive a negative weight. Note that normalization has no effect on relative rankings, since it is simply a linear transformation.

- 9
The estimated standard deviations of the {

}, obtained in standard ways from the regression output, were also divided by this sum to make them the correct size for use in the process of combining the direct and regression-based weights, discussed below._{k}- 10
The random-halves procedure has a place in the statistical literature, but with other names. It is an example of the “deleted-d” jackknife as described in Efron and Tibshirani, (1993)

*An Introduction to the Bootstrap.*New York: Chapman and Hall. p. 149, with d = n/2. It is described by Kirk Wolter in a private communication as an example of the “balanced repeated replication” or “balanced half samples,” and described in Wolter, K. M. (2007)*Introduction to Variance Estimation*., 2^{nd}ed. New York: Springer-Verlag.- 11
In an earlier draft of this guide, we chose an inter-quartile range, but this choice, rather than some other range (eliminating the top and bottom quintile, for example) is arbitrary. The current approach uses broader ranges which result in greater overlap of ranges, but has the advantage of covering most of the rankings a program might achieve. The point of introducing uncertainty in our calculations is that we do not know the “true” ranking of a program. The purpose of presenting a ninety percent range is to provide a range in which a program’s ranking is likely to fall.

- 12
In most cases, it would not make sense to combine the dimensional measures because they yield differing results for most programs.

- 13
We have throughout estimated linear regressions. Is this assumption justified? We can only say that, empirically, we tried alternative specifications that included quadratic terms for the most important variables (publications and citations) and did not find an improved fit.

- 15
The fact that the raters for each field were a subset of those who answered the faculty questionnaire may confuse some into thinking that our independence assumption may not be justified. This is an unfortunate misunderstanding of the simulation of uncertainty in the rating and ranking process. It is the statistical independence of the two RH sampling processes that matters, nothing else.

- 16
Examination of the effect of this procedure gave correlations between the median rankings with and without the elimination of nonsignificant variables of .99.

- A Technical Discussion of the Process of Rating and Ranking Programs in a Field ...A Technical Discussion of the Process of Rating and Ranking Programs in a Field - A Data-Based Assessment of Research-Doctorate Programs in the United StatesBookshelf

Your browsing activity is empty.

Activity recording is turned off.

See more...