An Experimental Comparison of Statistical and Case History Methods of Attitude Research

III. Reliability of the Case History Ratings

Samuel Stouffer

Table of Contents | Next | Previous

Before attempting the present investigation, the writer did some previous experimenting.

First, the members of Professor Ellsworth Faris's graduate seminar in Social Attitudes were asked to write anonymously 1000-word accounts of their experiences in the use of liquor and of their feelings about liquor laws. The 12 documents received within the time set were read by four graduate students and ranked In the order of the apparent favorableness of attitudes toward prohibition. The ranks assigned by one pair of judges were added and correlated with the ranks assigned another pair. The mean Pearsonian intercorrelation of the ranks assigned by all possible combinations of pairs of judges was only +.49 raw. The documents, though apparently done conscientiously, suffered from lack of naiveté and from lack of the specificity which a good set of informal questions might have encouraged. Moreover, so many of the students tended to be drys that the range was too short for easy discrimination, The results were reported to the seminar and a number of useful ideas were forthcoming from Professor Faris and the students. As a result, a tentative set of informal suggestions for writing autobiographies, somewhat on the pattern of E. T. Krueger's [1] was prepared.


Second, a "dress rehearsal" of the projected experiment was hold with the members of one of Mr. Paul F. Creseey's Campus Classes in Sociology 110 as subjects. They took the Smith test and wrote autobiographical documents. An analysis of the results removed some of the skepticism which the writer and others had entertained as to the possibility of getting a sufficiently detailed and revealing study compressed into one thousand words. Twenty-three students took part. Different groups of the 23 case histories were distributed among six judges such that each judge read 16 papers and such that each paper was read by at least four judges. The judges were Professor Ellsworth Faris, Professor E. W. Burgess, Dr. Herbert Blumer, Mr. Paul F. Cressey, Mr. Frederick F. Stephan, and Mr. E. V. Stonequist, The papers were rated by placing a check mark on a graphic rating scale similar to that in Appendix A, page 70. A numerical score was assigned by measuring the distance from the left end of the scale to the check mark, in tenths of inches. The total of the raw scores given by Faris, Blumer, and Stonequist correlated +.96 with the total of the raw scores given by Burgess, Cressey and Stephan. The reliability of the Smith test was found to be + .98 (raw). The correlation between the test and the combined ratings of the six judges (four ratings on each case history) was +.86 (raw). These coefficients were not regarded very seriously, of course, because they were based on only 23 subjects, and also because the class appeared to be rather heavily loaded with extreme anti's and pro's. Several minor

( 21) changes were made, as the fruit of criticisms by the judges who read the papers and of some further statistical tabulation, and the present project was put under way.

Each student who was to write the documents used in the larger study was given the mimeographed set of suggestions shown in Appendix A, pages 66-68. The method of guaranteeing anonymity and at the same time rewarding prompt and apparently conscientious work was carefully explained. In establishing rapport and eliciting cooperation, the present writer was greatly helped by the instructors, who made the preparation of the documents a required assignment and pointed out the value of attitude research. The students were not told specifically that their documents would be compared statistically with their test sheets and self ratings. On the contrary, they were urged to disregard their answers to the test or their self ratings, just as far as they liked, in writing their papers. it was made particularly emphatic that consistency with their answers to the test or with their self-ratings was not a basis of giving a class-room grade for their work. What was desired, they were told, was simply a full, truthful paper. It was pointed out that considerable inconsistency in attitudes from one day to the next might occur with many people.

As mentioned previously, out of the 249 students present, 238 wrote autobiographical documents within the tine limit of about a week. Five other papers were turned in too late for use. This near unanimity of response seemed to forestall a possible bias due to selection of students who had had

( 22) the most striking experiences or who were most willing to write. Some of the 238 documents, of course, were much better than others. Though a few seemed too inadequate to do justice to the case study method, the investigator hesitated to assume responsibility for throwing them out. The whole 238 were handed over to the judges without alteration except for writing the sex and age of the author at the top of each document. If certain papers, of a sort which probably would be discarded in a practical application of the case study method, had been discarded in this Investigation, the degree of agreement between the test scores and judges' ratings probably would have been slightly higher than that which was found,

The four judges were graduate students chosen by a committee of the faculty in the Department of Sociology as among those best equipped, by technical experience in the interpretation of case materials, knowledge of the theoretical literature on attitudes, and quality of insight into human experience, to. make the judgments. The judges were Leonard S. Cottrell, Jr., Robert Faris, Everett V. Stonequist, and Edgar T. Thompson.

Each judge interpreted each of the 238 papers with respect to the author's attitude toward prohibition laws and with respect to his attitude toward drinking liquor. The only instructions on the graphic rating sheet [2] were:

( 23)

Please put one check mark () on each of the above lines at a point which seems to represent the writer's present attitude best. Use any standards of 'favorable' or 'unfavorable' which you wish — simply endeavor to judge this and all other papers by the same standards.

Two other suggestions were added verbally. One was to regard the line on which the judges-placed their check mark as a scale,[3] with the far extremes to be reserved only for these most extreme cases. The other was to formulate their judgments cumulatively while reading a given document and to come to a tentative decision before reading the last two paragraphs.[4]

As was anticipated, the judges had none too easy a task in making their ratings. "Just what is meant by liquor? Suppose a person detests whisky but likes a glass of wine now

(24) and then — how can one resolve this into a single judgment as to his attitude toward liquor?" "Just what is meant by prohibition laws? Suppose a person favors state-wide prohibition but opposes national prohibition— how can one resolve this into a single judgment as to his attitude toward prohibition laws?" "What definition of attitudes shall be used?” Careful consideration of several formulations which might provide a uniform standard among the judges for considering such questions led to the conclusion that an effort to define such a standard more specifically would lead only into deeper water. Therefore, the judges were thrown back on the original instructions to use any standards which they wished, but simply to judge all the papers by the same standards. This meant, of course, the renunciation of effort to draw significant conclusions from a comparison of the absolute markings of the various judges as to their central tendency, dispersion, or shape of frequency distribution. But, if a judge was consistent in judging all papers by the same standard a high linear or non-linear correlation between the judges' relative ratings would be just as possible theoretically as if the absolute markings were comparable. [5]

( 25) The judges-had about three weeks in which to read the papers in their "spare hours." While this length of time

( 26) tended to reduce errors due to fatigue, it tended presumably to work against holding a consistent standard throughout. This might have been checked somewhat by asking the judges to reread at the end some of the papers which they had rated at the beginning. It seemed impracticable, however, because the task of reading and carefully interpreting 238 papers was long and tedious, in spite of the intrinsic interest of the stories related. The investigator is under a deep debt of obligation to these four graduate students for their earnest cooperation.

After the papers were returned by the judges, each rating was expressed as a number by measuring the distance from the left of the scale to the check mark, in tenths of inches. The units were made finer than would have been necessary if the appropriate class intervals of a frequency distribution of the ratings could have been determined in advance„ The ratings of a given judge were then converted into standard scores, by ex pressing them as a deviation from their mean and dividing by their standard deviation.[6] There was little significant differ-

(27 ) -ence between the means assigned by the four judges, but the standard deviation of Faris' scores was less than that of the others. The means and standard deviations were, respectively Cottrell, 2.376, 1.518; Faris, 2.443, .857; Stonequist, 2.516, 1.3971; Thompson, 2.517, 1.595.

With respect to attitudes toward prohibition laws, the six intercorrelations between the standard scores of the ratings of each possible pair of the four judges were computed. The standard scores of each pair were then added and the three intercorrelations of each pair with each other were computed. As shown by Table 2, the average intercorrelations of one judge's ratings with another judge's was +.87. The average Intercorrelation of the composite ratings of two judges with the composite ratings of the other two was +.92. Either using the average of six individual score correlations as a base and estimating the reliability of a composite based on the combined scores of four judges instead of the scores of an individual judge, or using the average of three composite score correlations as a base and estimating the reliability of a composite based on the combined scores of four judges instead of the composite scores of two judges, one gets a reliability co-

(28) efficient of +.96. [7]

Table 2 Correlations of Judges' Ratings on Attitude Toward Prohibition Laws in 238 Case Histories
Judges Correlation
Cottrell v Faris +.84
Cottrell v Stonequist +.91
Cottrell v Thompson +.83
Faris v Stonequist +.86
Faris v Thompson +.87
Stonequist v Thompson +.88
Average +.87
Cottrell and Faris v Stonequist and Thompson +.93
Cottrell and Stonequist v Faris and Thompson +.94
Cottrell and Thompson v Stonequist and Faris +.92
Average +.92
Average reliability of composite ratings of four judges, estimated by Spearman-Brown formula, using as a base either the above averages carried out to three decimal places +.96


At face value, a reliability coefficient of +.96 seems high in view of the apparent difficulty of inferring an attitude from a document only about a thousand words long. An examination of the correlation tables in Appendix B will show, however, that there tended to be a rather heavy loading at the extremes. This justifies a warning about too confident generalization of these results to other studies. This is not to say, of course, that the correlation coefficient is a seriously inaccurate description, on the whole, of the reliability found in this particular study, by this particular group of judges, under this particular set of conditions. Of course, many other studies are necessary in order to determine the variability of the reliability under other sets of conditions. [8]

( 31)

This measure of reliability does not take directly into account the possible shifts in ratings of an individual judge. It merely shows the extent to which the judges agreed on the basis of one rating by each. If a judge could have rated all the papers twice, with an interval of several months elapsing to reduce memory effect, his two ratings probably would have given a somewhat more dependable index than a single rating. If the result of this would have been to raise the reliability coefficient, the validity coefficient, corrected for attenuation, between the test scores and the composite ratings of the four judges might have been slightly lowered, unless the increased reliability of the judges' ratings produced a corresponding increase in the raw validity coefficient.

It will be observed that the reliability of the judges' ratings is not a direct measure of the reliability of the case histories themselves, No scheme for splitting the ease histories into parts and correlating the judgments on attitudes from each part seemed practicable. Moreover, it is perhaps the essence of the case study method that the interpretation is made on the basis of the total configuration of activities and feelings woven into the document.. It is quite conceivable that a few of the papers, either because of inadequate vocabulary, distorted emphasis on certain minor events, or deliberate invention of a fictitious story, might have deceived all four of the judges alike. Naturally, careful efforts were made by the present writer and by the instructors to reduce this effect by impressing the students with the importance of sincerity and

( 32) with the need of care in giving an accurate total impression by a well-balanced selection of concrete details. The latter caution seemed vital in this investigation particularly, since the documents had to be short. It was feared that some students might be tempted to telescope the description of their later experiences if they found that they had used up most of their thousand words in describing childhood events. Therefore, it was asked that 500 words be written about the period before coming to college and 500 words about the period after coming to college. In the mimeographed suggestions this admonition appears: 

Important. While writing, please don't bother about how long your document is getting. After you finish, read over what you have written. Make sure that the picture you make would give a stranger, who never met you, a truthful, undistorted impression of your experiences and attitudes, past and present.

Later, if you find that you have written too much, as you possibly will, go through your first draft again. Cut out all unnecessary phrases and cut out the incidents which throw the least light on your experiences and attitudes.

In conclusion, the students were advised:

In writing this paper, the most important factor is sincerity. You need have no hesitation in telling intimate things, for what you write is strictly anonymous. It is not necessary to dress this up in fine English. A simple, frank narrative of your experiences and feeling will make the best contribution to social science that a document of this type can make.

The extent to which this advice was taken will have to be judged from the documents themselves. A careful reading of the 30 cases in Appendix C, p 132, selected not by virtue of their vividness or human interest or sincerity but by a strictly mechanical method explained in the introduction to Appendix C,

(32) may help the research worker, by whatever intuitive process or systematic check of internal evidence he has faith in, to form at least a fragmentary conception of their value.[9] It was impracticable to check the accuracy of the documents by means of personal interviews or other external means at the disposal of the user of the case study method in some Investigations. In some instances, of course, an overstatement or even a deception, when it can be detected, makes the document just as useful for interpretative purposes as if it were accurate. And it is an error to assume that trained interpreters of case material take every statement at its face value. It is the investigator's opinion, corroborated by the judgment of others more competent than he in evaluating documents of this type, that the 238 case histories used in this study are, on the whole, a faithful effort to present a true picture.

One can be quite certain, however, that if a document intentionally or unintentionally misled all the judges alike this ordinarily would have resulted in lowering and not raising the validity coefficient found between the test scores and the judges' ratings. It would have this effect because it would decrease the raw-correlation between the test scores and judges' ratings without a compensating decrease in the reliabil-

(33) -ity coefficient used in the denominator of the formula for correction for attenuation, An exception would be a case in which the student was shrewd enough by some sort of second sight to guess accurately his score on the Smith test and then to write a story deliberately intended to lead the judges to a conclusion which would correspond to his test score. At least five reasons may be suggested why this was rather unlikely. First, no premium was placed on doing this; second, the students were not told exactly what use would be made of their documents; third, they had to fill out the test so rapidly that it was unlikely that they had time to appraise their answers as a whole or to remember many of the specific opinions endorsed; fourth, a successful attempt at manufacturing consistency would imply that the person could estimate with close accuracy his relative attitude score in filling out a test about whose complicated scoring methods he knew nothing; and fifth, it would imply that he know enough about his associates to devise a case history which would be rated by judges unknown to him at a certain relative position on an unknown scale with close accuracy.

A more serious possibility is that the judges tended to agree because they were all familiar with the theoretical points of view on the subject of attitudes as taught in the Department of Sociology at the University of Chicago. This, if true, might be an important limitation on the generality of the results of this experiment. The validity coefficient might show merely that attitudes as measured by the test scores corre-

( 34) -spond rather closely to attitudes inferred by judges indoctrinated with a special point of view on attitudes. Of course, it is not necessarily a reflection on the case study method if technical theoretical knowledge is essential to an interpretation of the documents any more than it would be a reflection upon research in biology because a microscopic worker needed technical theoretical knowledge for an interpretation of what went on beneath his lens. If the interpretation is to be of causal factors, the need for technical theoretical knowledge seems obvious But were ratings of the four judges who read these 238 documents forced into a peculiarly high agreement because these judges had somewhat the same theoretical training? Entangled with the question is also the problem of the influence of judges' individual bias on the subject of prohibition.[10]

It was thought that if two other judges could be secured, one of whom was known to be decidedly dry and one of whom was known to be decidedly wet and neither of whom had extensive knowledge of the theoretical literature on attitudes, a crucial experiment might be made. If the ratings of these two judges agreed with one another rather closely and agreed rather closely also with the ratings of the four graduate students, one could have some confidence that the factors of

(35) special theoretical knowledge and of personal bias were fairly well controlled.[11]

The cooperation was enlisted of Dr. George Safford, superintendent of the Illinois Anti-Saloon League, and Mr. Smil Thiele, secretary and director of the Illinois Association Opposed to Prohibition. It seemed too much to ask them to read the entire group of 238 documents. A sample of about a hundred cases was decided upon. In order to make the experiment as decisive as possible, it was arranged to make the sample conform to a uni-modal symmetrical distribution of Smith test scores, A theoretical distribution with a population of 99 was set up by use of an approximaton to the binomial

( 36) expansion. Code numbers were drawn by lot by Mr. Paul F. Cressey from the code numbers of subjects falling within each class interval of the test scores. The result, of course, was a group of papers which in no sense was representative of the entire 238, since most of the extreme papers, about which agreement among the graduate student judges had been relatively uniform, were automatically discarded. But by. this means it was sought to make agreement difficult rather than easy. The two outside judges were not informed of the nature of the distribution of test scores or any other scores in the papers given to them, nor were they given any instructions other than those given to the four graduate students as reported above, pages 23-24.

Generously Dr. Safford and Mr. Thiele gave their time to a study of these 99 documents and rated them on the graphic rating scale. The ratings of each man with respect to attitudes toward prohibition laws were correlated with those of the other and also with the ratings of each of the four graduate student judges. In addition, the six intercorrelations of the ratings of the four individual graduate student judges were recomputed on the basis of the 99 cases. The results are summarized in Table 3. The average correlations ranged between +.71 and +.80, and the difference, based on 99 cases, is too small to be statistically significant.[12] The average correlation

(37) of the ratings of one student judge with those of each of the other three student judges is very little higher than the average correlations of the ratings by either Thiele and Safford with the ratings of each of the four student judges. The latter correlations, together with the correlation of +.65 between Safford and Thiele, seem too high, when compared with the others, to permit the assumption that the attitude of the judges themselves toward prohibition or a peculiarly uniform technical training in interpretation of attitudes was responsible in any large way for the agreement in ratings by the four graduate students.

Table 3 Comparison of Correlation Coefficients of Ratings of Six Judges on Attitudes Toward Prohibition Laws Based on 99 Cases, with Extreme West and Extreme Dry Cases Eliminated
  Cottrell Faris Stonequist Thompson Safford Thiele
Cottrell .. +.72 +.82 +.72 +.66 +.63
Faris +.72 .. +.80 +.80 +.74 +.69
Stonequist +.82 +.80 .. +.79 +.72 +.78
Thompson +.72 +.80 +.79 .. +.74 +.75


+.75 +.77 +.80 +.77 +.72 +.71

The correlation between Safford and Thiele was +.65

In conclusion, it seems conservative to say that in this particular investigation the judges of the case history documents showed a high agreement in their ratings, an agreement rather accurately reflected in the reliability coefficient found.


  1. See E. T. Krueger, Autobiographical Documents and Personality, Ph. D. Thesis, University of Chicago, September, 1925, Chapter VIII.
  2. See Appendix A, page 69.
  3. For a discussion of the advantages of the graphic rating scale method of scoring see Max Freyd, "The Graphic Rating Scale," Journal of Educational Psychology, February, 1923, pp. 83-102.
  4. In the last two paragraphs the subjects had been asked to sum up in a few words their own conceptions of their attitudes toward prohibition laws and drinking, respectively. A preliminary inspection of the documents suggested that in some cases this rational formulation varied considerably from the inference a reader might make from the narrative preceding it. For an interesting example of this see Cases 12 and 18 in Appendix C. The judges were asked to consider the last two paragraphs, of course, in making their final decision, but, by making a tentative decision without reading the conclusion, to avoid giving the rational formulation excessive weight. This caution seemed desirable, because the fatigue of reading a large number of papers at a sitting might lead one gradually and unwittingly to place too much reliance on a convenient and succinct closing statement, instead of performing the more arduous task of formulating a cumulative judgment from the entire narrative.
  5. There has been considerable skepticism about the likelihood of agreement by competent judges in their interpretations of case history material. For example, Read Rain, in the Journal of Educational Sociology, November, 1929, pp. 155-56, writes: When, if ever do life histories and diaries become valid data for science? ....Whenever they furnish materials which are clearly enough defined and frequent enough in occurrence so that a number of competent observers, working independently, can arrive at like conclusions both as to the existence and meaning of the defined data. When we apply such a rigid methodological criterion, it is evident that most so-called 'scientific results from the use of life documents, life stories, interviews, diaries, autobiographies, letters, journals, etc., are pure poppy-cock. Two independent observers will seldom find the same 'facts' in a given document and still more seldom draw the same conclusions from it. Conclusions will usually vary as much as two independent analyses of the same dream. This does not deny the possibility of deriving valid scientific generalizations from such materials. It merely means that, so far, the methods of using such materials are too subjective, moralistic, valuational, personal, unique, and unstandardized to be scientifically valid." Professor Bain has been especially dubious about interpretation of such a subjective factor as an attitude. Feeling that neither the case method nor tests really get at attitudes as subjectively defined, he would abandon the attempt, and deal with material more strictly on a behavioristic level. His position is set. forth, with numerous bibliographical references, In "An Attitude toward' Attitude Research," American Journal of Sociology, May, 1928, pp, 940-57. Professor Ellsworth Faris discussed some of the principal objections raised by Professor Bain and others, and maintained that the study of attitudes as conceived subjectively is fundamental to personality research and that valid research can be done. - Faris, American Journal of Sociology,op. cit., especially pp. 274-77.
    One rather unrelated field in which extensive experimentation has been begun on the agreement of judges in the interpretation of written documents is that of the essay type examinations. Widely varying results are reported, depending on the method in which the experiments have been set up. The earlier work seemed to show greater unreliability than some more carefully controlled later experiments. Among the most extensive studies which may be cited are, W. S. Monroe and L. B. Souders, "The Present Status of Written Examinations and Suggestions for Their Improvement," University of Illinois Bulletin, XXI: 13, Bureau of Educational Research Bulletin No. 17, 1923; G. M, Ruch, et al, "Objective Examination Methods in the Social Studies, Chicago, 1926;' and S. G. Brinkley, Values of New Type Examinations in the High School with Special Reference to History," Teachers College, Columbia University Contributions to Education, No. 161, 1924. Odell, Educational Examinations and New Type Tests, New York,1928, gives a critical bibliography of about 100 selected titles dealing more or less specifically with this problem. Reliability coefficients found vary all the way from minus values almost to unity. The median reliability found by Monroe and Souders, on ratings by pairs of judges of essay type examinations was +.65. Brinckley found an average reliability of .46 for 10-minute examinations in history, + .56 for 15 1/2-minute examinations, and +.72 for 31minute examinations. One must be cautious in comparing reliability coefficients in various studies without a full knowledge of the experimental methods used and of the variability and shape of the frequency distributions.
  6. In order to retain the maximum number of significant digits and still keep the number small enough so that all columns could be punched on a Hollerith card, the standard deviation was expressed in class intervals instead of original units and the number 1 was added to each standard score to eliminate plus and minus signs. All the composite scores given in the correlation tables are computed from standard scores in this form.
  7. The agreement of estimates, using both bases, is incidentally a rather interesting partial check on the applicability to this data of the Spearman-Brown formula, which is based on the assumption of equal standard deviations and equal correlation coefficients. One of the three standard deviations differs rather widely from the others, and there is a range in the correlation coefficients of +.83 to +.91. Nevertheless, the correlation between the composite scores of two judges with the composite scores of two other judges is almost exactly what is predicted by the Spearman-Brown formula based on the intercorrelations of individual judges.
  8. The worst example of the effect on the correlation coefficient of the loading of the extremes is seen in Table 10 in Appendix B, in the correlation between the ratings of Cottrell and Thompson. The correlation coefficient is +.83. If the distribution should be mutilated by cutting it into four quadrants at the means, the correlation found in any quadrant would be rather low. This is precisely what might have happened if the experimental group had comprised solely a group of rather decided drys or solely a group of rather decided wets. The effect on the correlation coefficient of mingling heterogeneous pairs is shown algebraically by Yule, Introduction to Statistics, 8th edition revised, pp. 218-19. Table 10 is a sufficient violation of the principle of homogeneity to make the use of correlational methods rather dubious. The other distributions are somewhat better, while, (what is, after all, the vital thing) the distributions of the composite standard scores of one pair of judges correlated with the composite standard scores of another pair of judges seem quite satisfactory for correlational computation. This will serve to make clear, perhaps, why a correlation like that found in the present investigation is hardly to be expected in another experiment with a narrower range and different distribution of attitudes.
  9. All of the 238 case histories have been placed in the archives of the Department of Sociology of the University of Chicago. Although obtained for the special purpose of comparing two methods of research, they may be found to contain some material of interest to the case study worker seeking light on the processes of attitude formations
  10. The judges themselves took the Smith test. Their scores showed a rather wide range from dry to wet, being 3.4, 5.1, 6.7, and 7.1, respectively. No direct effect of differential influence of bias could be detected from a judge's individual ratings.
  11. The factors of special point of view and of bias might be present also among the 300 judges whose ratings formed the basis for determining the scale values of the Smith test. This has not been checked with reference to the prohibition scale, but an elaborate experiment was performed to check bias with reference to a scale of attitudes toward Negroes which was constructed by the same methods as Dr. Smith's test, Professor Elmer D. Hinckley of the University of Florida had three scales of attitudes toward Negroes constructed independently, one by a group of white people who were thought by their answers to a preliminary test to be strongly prejudiced against Negroes, a second by a group of white people who were thought by their answers to a preliminary test to be less prejudiced against Negroes, and a third by a group of Negro college students. Six hundred white judges and 250 Negro judges were used in all. The correlation found between the scale values assigned to 114 opinions by the group of presumably prejudiced whites as compared with the less prejudiced whites was +.98. The correlation found between the scale values assigned to the 114 opinions by the group of presumably prejudiced whites as compared with the Negroes was t.94. The conclusion was that personal bias of the judges was no factor in determining the scale values. See Hinckley, the Influence of Individual Opinion on Construction of an Attitude Scale , Ph. D. thesis, University of Chicago, August, 1929 p 7-11 and 19-22.
  12. The average intercorrelation of the ratings by the four student judges was +.77 as compared with +,87 on the entire 238 cases. The difference is, due, of course, to the elimination of extreme cases in choosing this mutilated experimental sample.

Valid HTML 4.01 Strict Valid CSS2