NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

National Research Council (US) Committee to Examine the Methodology for the Assessment of Research-Doctorate Programs; Ostriker JP, Kuh CV, editors. Assessing Research-Doctorate Programs: A Methodology Study. Washington (DC): National Academies Press (US); 2003.

Cover of Assessing Research-Doctorate Programs

Assessing Research-Doctorate Programs: A Methodology Study.

Show details

6Reputation and Data Presentation

INTRODUCTION

Since the first study of research-doctorate programs in 1925, users have focused on the reputational rating of programs as the primary measure for assessing the quality of doctoral programs. Even with the introduction of many quantitative measures in the 1982 and 1995 Studies, ratings of scholarly quality of faculty by other scholars in the same field and the resulting rankings of programs have remained the primary object of attention. Recognizing this fact, the Committee and its Panel on Reputational Measures and Data Presentation set as their task the development of procedures that would:

  • Identify useful reputational measures,
  • Select raters who have a knowledge of the programs that they are asked to rate,
  • Provide raters with information about the programs they are rating, and
  • Describe clearly the variation in ratings that result from a sample survey and present program ratings in a manner that meaningfully reflected this variation.

A useful reputational measure is one that reflects peer assessment of the scholarly quality of program faculty. Ideally, such a measure would be based only on the knowledge and familiarity of the raters with the scholarly quality of the faculty of the programs they are asked to rate and would not be directly influenced by other factors, such as the overall reputation of the program's institution (a “halo effect”) or the size or age of the program. Both the 1982 and the 1995 Studies presented correlations of reputation with a number of other quantitative measures. The next assessment should expand on these correlational analyses and consider including and interpreting multivariate analyses.

An example of an expanded analysis that would be of considerable interest is one that explores the relation between scholarly reputation and program size. The 1982 Study found a linear relation between scholarly reputation of program faculty and the square root of program size. Ehrenberg and Hurst (1998) also found a positive effect of program size. Both these analyses suggest that there is a point beyond which an increase in program size ceases to be associated with a higher reputational rating, but it is also clear that small programs are not rated as high as middle and large size programs. Further analyses along these lines would be useful.

The Committee believes that the reputational measure of the scholarly quality of faculty is important and consequential. A highly reputed program may have an easier time attracting excellent students, good faculty, and research resources than a program that is less highly rated. At the same time, reputation is not everything. Students, faculty, and funders need to examine detailed program directions and offerings to be able to assess the quality of a program for their particular objectives.

THE MEASUREMENT OF SCHOLARLY QUALITY OF PROGRAM FACULTY: PRACTICES AND CRITICISMS

The Reputational Measure of Scholarly Quality of Program Faculty

To obtain the reputational measure of scholarly quality, raters have been presented with lists of faculty and the number of doctorates awarded in a program over the previous 5 years. They were then asked to rate the programs:

  1. On a 3-point scale, their familiarity with the work of the program faculty,
  2. On a 6-point scale, their view of the scholarly quality of program faculty (a seventh category was included— “Do not know well enough to evaluate”).

For years, the use of a reputational survey to assess the scholarly quality of program faculty and the effectiveness of a doctoral program has attracted criticism. Critics cite program size as a factor that correlates with quality. The “halo effect” that raises the perceived quality of all programs in an institution that is considered to have a good reputation, the national visibility of a department or institution, and the “star effect” in which a few well-known faculty members in a program can also raise ratings. There are nonreputational measures by which individuals can assess programs, such as educational or research facilities and quality of graduate-level teaching and advising, but these are often not widely known outside the doctoral program, and raters would have limited information on which to make a judgment unless they are closely associated with it. In fact, the strong correlation between the reputational measure of scholarly quality of the program faculty (“Q”) and the effectiveness of the doctoral program in training scholars (“E”) present in past studies suggests that raters have little knowledge of educational programs independent from faculty lists.

Rater Selection

For the 1995 Study, a large enough number of raters was selected to provide 200 ratings for each program in nonbiological fields and 300 ratings for biological science fields. For example, if there were 150 programs in a nonbiological field, then 600 raters would be needed to provide the 200 ratings, since each rater was asked to rate 50 programs. In the biological sciences the number of raters needed to rate 150 programs was 750, since 60 programs appeared on each questionnaire and 300 ratings was the desired goal. The reason for this increase in raters and ratings stems from the realization by the last study committee that their taxonomy did not accurately describe fields in the biological sciences and, therefore, the field of some raters did not often match that of the programs they were asked to rate.

Raters in the 1995 Study were selected in an almost random manner with the following restrictions: at least one rater was selected from each program; the number of raters from a particular program was proportional to the size of the program; and if there were more than three, raters were selected on the basis of faculty rank, with the first chosen from among a pool of full professors, the second from among associate professors, and so on. The response rate for this sample was about 50 percent across the 41 fields in the study, and in many cases the more visible national programs received most of the responses with about 100 ratings. Programs at regional universities received fewer ratings, and in some cases scores could not be averaged after trimming. It was also noted that, by using the question that asked for a rater's “familiarity” with the program faculty and by weighting the response to the question concerning program quality by familiarity, ratings increased for the higher-rated programs and decreased for lower-rated programs. It appears that more reliable and useable ratings would result if rater familiarity were considered.

Program Information

The last two assessments provided raters with a limited amount of program information. Faculty names by rank were listed on the questionnaire, and for some fields, the number of program graduates over a 5-year period was also included. This information was provided to assist raters in associating researchers with their institutions, but based on a sample of raters who were asked to indicate the number of names they recognized, most raters recognized at most one or two faculty members in most programs. Thus, it may have been that only the most visible scholars and scientists determine reputational rating and faculty lists may have been of little assistance in providing information to help raters. Additional program information or cues might assist raters in assessing program quality.

Variability of Reputational Measures

Since the National Surveys of Graduate Faculty for past studies were sample surveys, there is a certain amount of variability in the results. If a different sample of raters had been selected, the ratings would, in general, have been different.1 This possible variability was described for past studies by estimating the confidence intervals for the scores of each program and displaying the results graphically to show the overlaps. However, this analysis was generally ignored by users and the rank order of the programs remained the focus of attention. An important remaining issue is the communication of uncertainty or variability of the ratings to users and the presentation of data that reflects the variability. Doing so can help to dispel a spurious impression of accuracy in ratings.

IS SCHOLARLY REPUTATION WORTH MEASURING?

While the 1995 Study has been criticized for many of the measures it reported, the major objection was its ranking of programs on the basis of scholarly reputation of program faculty. In particular, critics argued that few scholars know enough about more than a handful of programs in their discipline, that programs change more rapidly than the reputations that follow them, that response bias presents a false sense of program ratings, that reputation is dependent on program size, and that weak programs at well-known institutions benefit from a “halo effect.” On the other hand, reputations of programs definitely exist for individual programs as well as universities. Reputational standing is real in its consequences and has a strong correlation with other indicators of quality. Perceptions of program quality held by knowledgeable outsiders is important to deans, department chairs, and other administrators in designing and promoting their programs; to governing boards in allocating resources across programs; and to prospective students in choosing among programs. More importantly, reputational measures provide a benchmark against which other quantitative measures can be calibrated.

The Panel on Reputational Measures and Data Presentation took the criticisms of the reputational measure as a challenge, recognizing that the techniques used in earlier studies to generate reputational ratings were developed in an era when there were fewer doctoral programs, program faculty were less specialized, and the mission of most doctoral programs was the training of students for academic positions. Although many doctorate holders were taking nonacademic jobs at the time of the 1995 Study, the desire to maintain continuity with earlier studies dictated a continuation of the earlier methodology. These changes in the doctoral education environment made the task of developing a meaningful reputational measure more difficult, but at the same time the technological developments of the past decade make possible the use of online questionnaires to enhance and expand the scope of a survey. Modern database analysis methods also provide users with techniques to analyze the results of reputational surveys as well as the quantitative measures from the study to address their program, institutional, and research needs.

ADDRESSING ISSUES RELATED TO REPUTATIONAL MEASURES

The issues to be addressed fell in two major categories: 1) the development of procedures that would improve the quality of a reputational survey, and 2) the presentation of data from the reputational survey that would minimize spurious inferences of precision in program ratings by users.

Efforts to improve the quality of reputational surveys focused on having a more informed rater pool by either providing raters with additional information about the programs they were rating or matching the characteristics of raters with those of the programs. Matching raters to programs appears to be a good idea, but it introduces many complications, since the variety of missions and subfields present in any one of the fields in the taxonomy would rapidly create a multi-dimensional stratification of the rater pool and introduce unknown biases. Developing a large rater pool with few constraints would provide ratings that could be analyzed on the basis of program and rater characteristics. This would enable a better understanding of the process that generates reputational ratings. It would also provide a sufficient number of ratings so that institutions could evaluate the study findings based on a sample of ratings they judge to be meaningful. For example, a program could analyze only those program ratings from raters at peer institutions. This would also allow institutions to analyze their programs with particular subfield specializations against those in other similarly specialized programs to gain a more accurate assessment. This could be done through the use of an online data-extraction program where there is a quantitative database for each program, and certain data, such as the list of program faculty, could be linked to the database to provide information on faculty productivity and scholarship.

Beyond the issue of survey methodology is the issue of data presentation for all the measures, reputational and quantitative, from the study. For the 1995 Study the data were collated into a large publication consisting primarily of statistical results—tables for each field displayed data for various measures. This will no longer be possible considering the increase in the number of measures, programs, and fields. For the 1995 Study a CD-ROM was also produced that contained the raw data from different data sources which were intended to serve as a research tool for specialized analyses. While this basic data set will be available for the next study in electronic form, there will also be a public-use file for general users to access, retrieve, and analyze any program included in the study. The printed study would provide examples of analyses that could be conducted using the data.

MODELS OF REPUTATION

Another criticism of the reputational measure of scholarly quality is that it ages between studies and, since the study is conducted only every 10 years or so, users must rely on an obsolete measure of reputation during the interim period. In fact, reputational ratings change very slowly over time, but users might find it helpful to be able to approximate the effects of program changes on their reputational status. One approach would be to construct a statistical model of reputation, dependent on quantitative variables. Using that model, it would then be possible to predict how the range of ratings would change when a quantitative variable changed, assuming the other variables remained constant. The parameters of such a model would measure the statistical effect of both the intrinsic and standardized quantitative variables on the mean of the reputational variable for all programs in a field. This would permit a program to estimate the effect on reputation of, for example, shortening time to degree or increasing the percentage of faculty with research funding. Examination of outliers in this estimation would permit the identification of those programs for which such a model “underpredicts” or “overpredicts” reputation. Programs experiencing a “halo effect” would have a better reputation than that predicted by the quantitative variables in the model alone. A technical description of such a model and examples of it using data from the 1995 Study are shown in Appendix G. Such a model could be used to estimate ratings during the period between studies, if programs updated their quantitative information regularly on a study Website.

However, there is a cautionary note for this type of analysis. It assumes that the relationships (the parameters) of the model are invariant over time. Only the values of the program characteristics change. If there is sufficient change in program characteristics for a field during the period between assessments, the assumption will not be valid. At this time it is not possible to judge the effects of time on the model or the soundness of this analysis, but when data are collected for the next assessment it will be possible to compare the model parameters in Appendix G with those estimated using new data on the same characteristics. The current analysis is also limited by the number of characteristics for which data was collected for the 1995 Study, and since the next assessment will collect data on more characteristics, the model might be improved with an expanded data set and further refinement through subsequent assessments.

FINDINGS AND RECOMMENDATIONS

Why Measure Scholarly Reputation at All?

The large amount of data collected during previous assessments of research-doctorate programs has been widely used and, in particular, scholarly reputation is a significant component of the evaluation of faculty and programs that has consequences for student choices, institutional investments, and resource acquisition. Reputation is one part of the “reality” of higher education that affects a tremendous number of decisions—where graduate students choose to study, where faculty choose to locate, and where resources may flow. It also has a strong correlation with honorific recognition of faculty. Critics have given reasons for discounting the reputational rating, including many that were stated earlier, but it is the most widely quoted and used statistic from the earlier studies, and by using better sampling methods and more accurate ways to present survey results it can be a more accurate and useful measure of the quality of research-doctorate programs. Institutions use the reputational measure to benchmark their programs against peer programs. If the measure were eliminated, institutions would no longer be able to map changes in programs in this admittedly ill-defined, but important, respect. The reputational measure also provides a metric against which program resources and characteristics can be compared, as similar quantitative measures for similar programs are compared across a large list of institutions. While students were not considered to be potential users of past studies, they, in fact, used the reputational ratings in conjunction with the other measures in the reports to select programs for graduate study. Future studies should encourage this use by students and provide both reputational and quantitative measures to assist them in their decisions.

The care taken by the NRC in conducting studies is another factor to consider with regard to the retention of the reputational measure. NRC studies are subjected to a rigorous review process, and the study committee would be primarily composed of academic faculty, university administrators, and others whose work involves the judgment of doctoral program quality. This may be the only reputational study of program quality that limits raters of programs to members of the discipline being rated. The proposed study will go even further to ensure that the ratings are made by people who know the programs that they rate. Further, unlike studies conducted by the popular press, NRC ratings are not based on weighted averages of factors. The reputational measure is a measure of evaluation of scholarly reputation of program faculty alone. Quantitative measures are presented unweighted. Thus users can apply the data from the study to reflect their own preferences, analyze the position of their own programs, and conduct their own comparisons. This cannot be accomplished with weighted measures.

Recommendation 6.1: The next NRC survey should include measures of scholarly reputation of programs based on the ratings by peer researchers in relevant fields of study.

Applying New Methods For Data Presentation

The presentation of average ratings in previous surveys has led to an emphasis on a single ordering of programs based on these average ratings and has given a spurious sense of precision to program rankings. Using a different set of raters would probably lead to a different set of average scores and a different rank ordering of programs. This is demonstrated by the confidence interval analyses that appeared in the last two NRC study reports. However, variance in the ratings and rankings implied by the confidence interval analysis did not translate into the way the ratings (calculated to two decimal places) were used. To show the variance in a more direct way, modern statistical methods of data display, based on resampling, can be used to show that there is actually a range of plausible ratings and, consequently, a range of plausible rankings for programs. These methods show that it is not unusual for these ranges to overlap, thereby dispelling the notion that a program is ranked precisely number 3, for example, but, rather, that it could have been ranked anywhere from first to fifth.

The question then arises: What is the best way to calculate statistically the range of uncertainty for a program? This presentation would go beyond presentation of the mean and standard error. The panel investigated two statistical methods—Random Halves and Bootstrap—to display the variability of results for a sample survey. These techniques are discussed technically in Appendix G.

The Random Halves method is a variation of the “Jackknife Method,” where only half of the ratings are used for each draw and there is no replacement. For the next draw, a different half of the whole sample is taken and a mean rating calculated for that half. Again, a mean rating would be produced for each program after each draw, and a range of ratings would result after a large number of samples. The interquartile rating range would then be presented as the program rating.

The Bootstrap method would be applied by taking a random draw from the pool of raters equal to the number of responses to the survey, then computing the mean rating and ranking for each program. This would be done “with replacement,” i.e., a rater and the corresponding rating could be selected more than once. If this process were continued for a large number of draws, a range of ratings would be generated and a segment of that range for each program, such as the interquartile range, would be the range of possible ratings.

Both methods produce similar results if the number of samples taken is sufficiently large (greater than 50), since the variance of the average ratings for the two methods is nearly the same. It might be argued that neither method produces a true rating or ranking of a program by peers in the field, but unless the survey asked every person in the field to assess every program in the field and the response rate were 100 percent, the reputational rating would be subject to error. Presenting that error in a clear way would be helpful to users of the assessment.

An illustration of data presentation where the rankings are de-emphasized can be found in Chart 6–1A. The Random Halves method was applied using reputational survey data from the 1995 Study for programs in English Language and Literature. The data were resampled 100 times, and the programs were ordered alphabetically. Chart 6–1B is an example of the Bootstrap method applied to the same programs. Charts 6–2A and 6–2B present the same calculations for programs in mathematics. Tables 6–1 and 6–2 showing applications of Random Halves and Bootstrap methods can be found at the end of this chapter, following the charts.

Chart 6–1A. Interquartile Range of Program Rankings in English Language and Literature—Random Halves.

Chart 6–1A

Interquartile Range of Program Rankings in English Language and Literature—Random Halves.

Chart 6–1B. Interquartile Range of Program Rankings in English Language and Literature—Bootstrap.

Chart 6–1B

Interquartile Range of Program Rankings in English Language and Literature—Bootstrap.

Chart 6–2A. Interquartile Range of Program Rankings in Mathematics—Random Halves.

Chart 6–2A

Interquartile Range of Program Rankings in Mathematics—Random Halves.

Chart 6–2B. Interquartile Range of Program Rankings in Mathematics—Bootstrap.

Chart 6–2B

Interquartile Range of Program Rankings in Mathematics—Bootstrap.

TABLE 6–1A. Interquartile Range of Program Rankings in English Language and Literature—Random Halves.

TABLE 6–1A

Interquartile Range of Program Rankings in English Language and Literature—Random Halves.

TABLE 6–1B. Interquartile Range of Program Rankings in English Language and Literature—Bootstrap.

TABLE 6–1B

Interquartile Range of Program Rankings in English Language and Literature—Bootstrap.

TABLE 6–2A. Interquartile Range of Program Rankings in Mathematics—Random Halves.

TABLE 6–2A

Interquartile Range of Program Rankings in Mathematics—Random Halves.

TABLE 6–2B. Interquartile Range of Program Rankings in Mathematics—Bootstrap.

TABLE 6–2B

Interquartile Range of Program Rankings in Mathematics—Bootstrap.

The Committee favors the use of the Random Halves method over the Bootstrap Method, since it corresponds to surveying half the individuals in a rater pool and may be more intuitive to the users of the data. However either would be suitable. Both Random Halves, as a variation of the Jackknife Method, and Bootstrap are well-known in the statistics literature. Regardless of which technique is used, the interquartile range is then calculated in order to eliminate outliers. The results of either analysis could be presented in tabular or graphic form for programs listed alphabetically. These charts and tables are shown at the end of the chapter.

The use of either of these methods has the advantage of displaying variability in a manner similar to confidence interval computations in the past reports, without the technical assumption of a normal distribution of the data underlying the construction of a confidence interval. These methods provide ranges, rather than a single number, and differ from the presentation of survey results in the 1982 and 1995 Studies. The 1982 and 1995 Studies presented program rating as just one of the program characteristics in order to de-emphasize their importance. Tables in the 1982 Study presented the data in alphabetical order by institution, and in the 1995 Study programs were ordered by faculty quality ratings. However, in both cases ratings were quickly converted into rankings by both the press and academic administrators, and programs were compared on that basis. If used properly, there is value in the use of rankings over ratings, since raters use subjective and different distributions of programs across the scale and this effect can only be eliminated by renormalization (or standardization). Rankings have the advantage of all nonparametric statistical measures—they are independent of variable and shifting rater scales. Thus the Committee concluded that if methods, such as Random Halves or Bootstrap, were used to address the issue of spurious accuracy, some of the defects attributed to misuse of rankings would be alleviated. The committee that will actually conduct the next assessment will have the option of presenting the data in an alphabetical order or rank order of a measure, such as the average faculty quality rating, or by the ranking range obtained from either the Bootstrap or Random Halves methods.

Recommendation 6.2: Resampling methods should be applied to ratings to give ranges of rankings for each program that reflect the variability of ratings by peer raters. The panel investigated two related methods, one based on Bootstrap resampling and another closely related method based on Random Halves, and found that either method would be appropriate.

The Use and Collection of Auxiliary Data

Previous reputational surveys have not helped our understanding of the causes and correlates of scholarly reputation. Raters were selected randomly and were asked to provide a limited amount of personal data. For the 1982 Study a simple analysis showed that raters rated programs higher if they had received their doctorate from that institution. Other information that could influence raters are the number of national conferences they attended in the last few years or their use of the Internet. These data might help to explain general questions of rater bias and the “halo effect.” They may also be useful to programs and to university administrators in attempting to understand ratings and improve their programs.

New technologies such as Web-based surveys and matrix sampling allow us to add significant information on programs and on peer raters to allow better understanding of the causes and correlates of scholarly reputation. For example, statistical analyses could be conducted to relate rater characteristics to ratings. Beyond that, matrix sampling could be used to explore how ratings vary when raters are given information beyond just lists of faculty names.2

Recommendation 6.3: The next study should have sufficient resources to collect and analyze auxiliary information from peer raters and the programs being rated to give meaning and context to the rating ranges that are obtained for the programs. Obtaining the resources to collect such data and to carry out such analyses should be a high priority.

Survey Questions and Previous Survey

In the 1982 and 1995 assessments of research-doctorate programs three qualitative questions were asked of peer reviewers. These addressed the quality of the program faculty (Q), the effectiveness of the graduate program (E), and change in program quality in the past 5-year period (C). Only one question regarding the scholarly quality of the program faculty seemed to produce any significant results. The effectiveness question correlated highly with the quality question but did not appear to provide any other useful information. The results for the change question were also not significant, and the study committee in 1995 relied on a comparison of data and quality scores from the 1982 and 1995 Studies to analyze change in quality, in addition to change in program size and time to degree.

Recommendation 6.4: The proposed survey should not use the two reputational questions on educational effectiveness (E) and change in program quality over the past 5 years (C). Information about changes in program quality can be found from comparisons with the previous survey, analyzed in the manner we propose for the next survey.

The Selection of Peer Raters for Programs

Peer raters in a field were selected almost randomly, as described earlier, and only from the pool of faculty listed by the programs. Many Ph.D.s teach outside of research universities. While in some fields a large number of new Ph.D.s go into academic careers, this is far from universal. In many fields, such as those in engineering, a large number of doctorates go into industrial or governmental positions. How well the programs serve the needs of employers in these other areas has been a long-standing question. The 1995 Study investigated the possibility of surveying supervisors of research teams or human resource officers to determine their opinions on academic programs, but the conclusion was that many companies hire regionally and there did not appear to be a way to integrate the information into a useful measure.

The issue of expanding the rater pool has not been resolved and various constituencies have asked that peer raters for programs be drawn from a wider pool than from the academic programs being rated. This could be assisted, in part, if the next committee included members who could represent industrial and governmental research, as well as academic institutions that are not research universities. The pool of raters could be expanded to include: industrial researchers in engineering; government researchers in fields such as physics, biomedical sciences, and mathematics; and faculty at 4-year colleges. It might be possible to identify a pool of raters from these sectors through nominations by professional organizations whose membership extends beyond academics.

Recommendation 6.5: Expanding the pool of peer raters to include scholars and researchers employed outside of research universities should be investigated with the understanding that it may be useful and feasible only for particular fields.

Consideration of Program Mission

Doctoral programs and institutions have varying missions and they serve different student populations and employment sectors. While large institutions have the capacity for programs that span many subfields of a discipline, smaller institutions may be limited to developing excellence in only one or two subfields. Comparison of broad programs to such “niche” programs would possibly be biased by the visibility of broader programs. Similarly, programs may have as their mission the training of researchers for regional industries and would, therefore, not have the same national prestige as programs whose graduates go into academic positions. One main criticism of past assessments was that these factors were not taken into account.

Taking subfield differences and program mission into consideration in the selection of raters for the reputational survey appears to be an obvious way to obtain more meaningful results. However, fragmenting rater pools into many segments based, for example, on subfields, would complicate the survey process by expanding the current 56 fields in the taxonomy to several hundred and many more, if factors such as the employment sectors of the graduates were considered. A more manageable way to account for program mission and other factors would be to have a sufficiently diverse rater pool and collect data on the raters and program characteristics so that individual programs could make comparisons with like programs on the basis of ratings from raters who have knowledge of those programs.

Recommendation 6.6: Ratings should not be conditioned on the mission of programs, but data to conduct such analyses should be made available to those interested in using them.

Providing Peer Raters with Additional Information

It is clear from the familiarity and visibility measures used for past studies that raters generally have little or no knowledge on which to base their rating for many programs. The limited amount of program information provided to raters in the last study may not have been of assistance, since many of the raters in the sample were unable to identify any faculty member in programs that were rated in the lower half of the rankings. It is therefore unclear on what basis many ratings were made. It is possible that information provided to raters could influence their ratings, especially for lower-rated programs, but this phenomenon is not well understood. Since the reputational survey for faculty will probably be Web-based, there is an opportunity to provide a large amount of quantitative data, such as the honors of individual faculty members or their publication information, directly in the questionnaire as links to a database. Exploring this approach for a sample of the programs and raters might provide insight in the use and value of reputational surveys.

Recommendation 6.7: Serious consideration should be given to the cues that are provided to peer raters. The possibility of embedding experiments using different sets of cues given to random subsets of peer raters should be seriously considered in order to increase the understanding of the effects of cues.

THE EFFECTS OF THE FAMILIARITY OF PEER RATERS WITH PROGRAMS ON THEIR RATINGS

It is well-known that raters who are more familiar with a program will rate it higher than raters who are less familiar. This fact was demonstrated by weighting the ratings with responses to the familiarity question for the 1995 Study; however, those results were actually not used in compiling the final ratings. In fact, the only familiarity measure that was used for that study was a visibility measure for each program that gave the percentage of raters who gave “Don't know well enough to evaluate” or “Little or no familiarity” as one or more of their responses to the five questions. By comparing this measure with the faculty quality measure it is clear that the more highly ranked programs were more visible. While accounting for familiarity in compiling program ratings may not change the ranking of programs, it does provide validity to ratings by assigning some basis for the rating.

Recommendation 6.8: Raters should be asked how familiar they are with the programs they rate and this information should be used both to measure the visibility of the programs and, possibly, to weight differentially the ratings of raters who are more familiar with the program.

Footnotes

1
2

Doing this would confuse “reputation” with more detailed knowledge of faculty productivity and other factors, but learning whether such information changes reputational ratings would be important to understanding what reputational measures actually tell us. This issue is discussed in greater detail below.

*

Data from 1995 Study.

Copyright © 2003, National Academy of Sciences.
Bookshelf ID: NBK43477

Views

  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (7.5M)

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...