An Experimental Comparison of Statistical and Case History Methods of Attitude Research
III. Reliability of the Case History Ratings
Before attempting the present investigation, the writer did some previous experimenting.
First, the members of Professor Ellsworth Faris's graduate seminar in Social Attitudes were asked to write anonymously 1000-word accounts of their experiences in the use of liquor and of their feelings about liquor laws. The 12 documents received within the time set were read by four graduate students and ranked In the order of the apparent favorableness of attitudes toward prohibition. The ranks assigned by one pair of judges were added and correlated with the ranks assigned another pair. The mean Pearsonian intercorrelation of the ranks assigned by all possible combinations of pairs of judges was only +.49 raw. The documents, though apparently done conscientiously, suffered from lack of naiveté and from lack of the specificity which a good set of informal questions might have encouraged. Moreover, so many of the students tended to be drys that the range was too short for easy discrimination, The results were reported to the seminar and a number of useful ideas were forthcoming from Professor Faris and the students. As a result, a tentative set of informal suggestions for writing autobiographies, somewhat on the pattern of E. T. Krueger's  was prepared.
Second, a "dress rehearsal" of the projected experiment was hold with the members of one of Mr. Paul F. Creseey's Campus Classes in Sociology 110 as subjects. They took the Smith test and wrote autobiographical documents. An analysis of the results removed some of the skepticism which the writer and others had entertained as to the possibility of getting a sufficiently detailed and revealing study compressed into one thousand words. Twenty-three students took part. Different groups of the 23 case histories were distributed among six judges such that each judge read 16 papers and such that each paper was read by at least four judges. The judges were Professor Ellsworth Faris, Professor E. W. Burgess, Dr. Herbert Blumer, Mr. Paul F. Cressey, Mr. Frederick F. Stephan, and Mr. E. V. Stonequist, The papers were rated by placing a check mark on a graphic rating scale similar to that in Appendix A, page 70. A numerical score was assigned by measuring the distance from the left end of the scale to the check mark, in tenths of inches. The total of the raw scores given by Faris, Blumer, and Stonequist correlated +.96 with the total of the raw scores given by Burgess, Cressey and Stephan. The reliability of the Smith test was found to be + .98 (raw). The correlation between the test and the combined ratings of the six judges (four ratings on each case history) was +.86 (raw). These coefficients were not regarded very seriously, of course, because they were based on only 23 subjects, and also because the class appeared to be rather heavily loaded with extreme anti's and pro's. Several minor
( 21) changes were made, as the fruit of criticisms by the judges who read the papers and of some further statistical tabulation, and the present project was put under way.
Each student who was to write the documents used in the larger study was given the mimeographed set of suggestions shown in Appendix A, pages 66-68. The method of guaranteeing anonymity and at the same time rewarding prompt and apparently conscientious work was carefully explained. In establishing rapport and eliciting cooperation, the present writer was greatly helped by the instructors, who made the preparation of the documents a required assignment and pointed out the value of attitude research. The students were not told specifically that their documents would be compared statistically with their test sheets and self ratings. On the contrary, they were urged to disregard their answers to the test or their self ratings, just as far as they liked, in writing their papers. it was made particularly emphatic that consistency with their answers to the test or with their self-ratings was not a basis of giving a class-room grade for their work. What was desired, they were told, was simply a full, truthful paper. It was pointed out that considerable inconsistency in attitudes from one day to the next might occur with many people.
As mentioned previously, out of the 249 students present, 238 wrote autobiographical documents within the tine limit of about a week. Five other papers were turned in too late for use. This near unanimity of response seemed to forestall a possible bias due to selection of students who had had
( 22) the most striking experiences or who were most willing to write. Some of the 238 documents, of course, were much better than others. Though a few seemed too inadequate to do justice to the case study method, the investigator hesitated to assume responsibility for throwing them out. The whole 238 were handed over to the judges without alteration except for writing the sex and age of the author at the top of each document. If certain papers, of a sort which probably would be discarded in a practical application of the case study method, had been discarded in this Investigation, the degree of agreement between the test scores and judges' ratings probably would have been slightly higher than that which was found,
The four judges were graduate students chosen by a committee of the faculty in the Department of Sociology as among those best equipped, by technical experience in the interpretation of case materials, knowledge of the theoretical literature on attitudes, and quality of insight into human experience, to. make the judgments. The judges were Leonard S. Cottrell, Jr., Robert Faris, Everett V. Stonequist, and Edgar T. Thompson.
Each judge interpreted each of the 238 papers with respect to the author's attitude toward prohibition laws and with respect to his attitude toward drinking liquor. The only instructions on the graphic rating sheet  were:
Please put one check mark () on each of the above lines at a point which seems to represent the writer's present attitude best. Use any standards of 'favorable' or 'unfavorable' which you wish — simply endeavor to judge this and all other papers by the same standards.
Two other suggestions were added verbally. One was to regard the line on which the judges-placed their check mark as a scale, with the far extremes to be reserved only for these most extreme cases. The other was to formulate their judgments cumulatively while reading a given document and to come to a tentative decision before reading the last two paragraphs.
As was anticipated, the judges had none too easy a task in making their ratings. "Just what is meant by liquor? Suppose a person detests whisky but likes a glass of wine now
(24) and then — how can one resolve this into a single judgment as to his attitude toward liquor?" "Just what is meant by prohibition laws? Suppose a person favors state-wide prohibition but opposes national prohibition— how can one resolve this into a single judgment as to his attitude toward prohibition laws?" "What definition of attitudes shall be used?” Careful consideration of several formulations which might provide a uniform standard among the judges for considering such questions led to the conclusion that an effort to define such a standard more specifically would lead only into deeper water. Therefore, the judges were thrown back on the original instructions to use any standards which they wished, but simply to judge all the papers by the same standards. This meant, of course, the renunciation of effort to draw significant conclusions from a comparison of the absolute markings of the various judges as to their central tendency, dispersion, or shape of frequency distribution. But, if a judge was consistent in judging all papers by the same standard a high linear or non-linear correlation between the judges' relative ratings would be just as possible theoretically as if the absolute markings were comparable. 
( 25) The judges-had about three weeks in which to read the papers in their "spare hours." While this length of time
( 26) tended to reduce errors due to fatigue, it tended presumably to work against holding a consistent standard throughout. This might have been checked somewhat by asking the judges to reread at the end some of the papers which they had rated at the beginning. It seemed impracticable, however, because the task of reading and carefully interpreting 238 papers was long and tedious, in spite of the intrinsic interest of the stories related. The investigator is under a deep debt of obligation to these four graduate students for their earnest cooperation.
After the papers were returned by the judges, each rating was expressed as a number by measuring the distance from the left of the scale to the check mark, in tenths of inches. The units were made finer than would have been necessary if the appropriate class intervals of a frequency distribution of the ratings could have been determined in advance„ The ratings of a given judge were then converted into standard scores, by ex pressing them as a deviation from their mean and dividing by their standard deviation. There was little significant differ-
(27 ) -ence between the means assigned by the four judges, but the standard deviation of Faris' scores was less than that of the others. The means and standard deviations were, respectively Cottrell, 2.376, 1.518; Faris, 2.443, .857; Stonequist, 2.516, 1.3971; Thompson, 2.517, 1.595.
With respect to attitudes toward prohibition laws, the six intercorrelations between the standard scores of the ratings of each possible pair of the four judges were computed. The standard scores of each pair were then added and the three intercorrelations of each pair with each other were computed. As shown by Table 2, the average intercorrelations of one judge's ratings with another judge's was +.87. The average Intercorrelation of the composite ratings of two judges with the composite ratings of the other two was +.92. Either using the average of six individual score correlations as a base and estimating the reliability of a composite based on the combined scores of four judges instead of the scores of an individual judge, or using the average of three composite score correlations as a base and estimating the reliability of a composite based on the combined scores of four judges instead of the composite scores of two judges, one gets a reliability co-
(28) efficient of +.96. 
|Cottrell v Faris||+.84|
|Cottrell v Stonequist||+.91|
|Cottrell v Thompson||+.83|
|Faris v Stonequist||+.86|
|Faris v Thompson||+.87|
|Stonequist v Thompson||+.88|
|Cottrell and Faris v Stonequist and Thompson||+.93|
|Cottrell and Stonequist v Faris and Thompson||+.94|
|Cottrell and Thompson v Stonequist and Faris||+.92|
|Average reliability of composite ratings of four judges, estimated by Spearman-Brown formula, using as a base either the above averages carried out to three decimal places||+.96|
At face value, a reliability coefficient of +.96 seems high in view of the apparent difficulty of inferring an attitude from a document only about a thousand words long. An examination of the correlation tables in Appendix B will show, however, that there tended to be a rather heavy loading at the extremes. This justifies a warning about too confident generalization of these results to other studies. This is not to say, of course, that the correlation coefficient is a seriously inaccurate description, on the whole, of the reliability found in this particular study, by this particular group of judges, under this particular set of conditions. Of course, many other studies are necessary in order to determine the variability of the reliability under other sets of conditions. 
This measure of reliability does not take directly into account the possible shifts in ratings of an individual judge. It merely shows the extent to which the judges agreed on the basis of one rating by each. If a judge could have rated all the papers twice, with an interval of several months elapsing to reduce memory effect, his two ratings probably would have given a somewhat more dependable index than a single rating. If the result of this would have been to raise the reliability coefficient, the validity coefficient, corrected for attenuation, between the test scores and the composite ratings of the four judges might have been slightly lowered, unless the increased reliability of the judges' ratings produced a corresponding increase in the raw validity coefficient.
It will be observed that the reliability of the judges' ratings is not a direct measure of the reliability of the case histories themselves, No scheme for splitting the ease histories into parts and correlating the judgments on attitudes from each part seemed practicable. Moreover, it is perhaps the essence of the case study method that the interpretation is made on the basis of the total configuration of activities and feelings woven into the document.. It is quite conceivable that a few of the papers, either because of inadequate vocabulary, distorted emphasis on certain minor events, or deliberate invention of a fictitious story, might have deceived all four of the judges alike. Naturally, careful efforts were made by the present writer and by the instructors to reduce this effect by impressing the students with the importance of sincerity and
( 32) with the need of care in giving an accurate total impression by a well-balanced selection of concrete details. The latter caution seemed vital in this investigation particularly, since the documents had to be short. It was feared that some students might be tempted to telescope the description of their later experiences if they found that they had used up most of their thousand words in describing childhood events. Therefore, it was asked that 500 words be written about the period before coming to college and 500 words about the period after coming to college. In the mimeographed suggestions this admonition appears:
Important. While writing, please don't bother about how long your document is getting. After you finish, read over what you have written. Make sure that the picture you make would give a stranger, who never met you, a truthful, undistorted impression of your experiences and attitudes, past and present.
Later, if you find that you have written too much, as you possibly will, go through your first draft again. Cut out all unnecessary phrases and cut out the incidents which throw the least light on your experiences and attitudes.
In conclusion, the students were advised:
In writing this paper, the most important factor is sincerity. You need have no hesitation in telling intimate things, for what you write is strictly anonymous. It is not necessary to dress this up in fine English. A simple, frank narrative of your experiences and feeling will make the best contribution to social science that a document of this type can make.
The extent to which this advice was taken will have to be judged from the documents themselves. A careful reading of the 30 cases in Appendix C, p 132, selected not by virtue of their vividness or human interest or sincerity but by a strictly mechanical method explained in the introduction to Appendix C,
(32) may help the research worker, by whatever intuitive process or systematic check of internal evidence he has faith in, to form at least a fragmentary conception of their value. It was impracticable to check the accuracy of the documents by means of personal interviews or other external means at the disposal of the user of the case study method in some Investigations. In some instances, of course, an overstatement or even a deception, when it can be detected, makes the document just as useful for interpretative purposes as if it were accurate. And it is an error to assume that trained interpreters of case material take every statement at its face value. It is the investigator's opinion, corroborated by the judgment of others more competent than he in evaluating documents of this type, that the 238 case histories used in this study are, on the whole, a faithful effort to present a true picture.
One can be quite certain, however, that if a document intentionally or unintentionally misled all the judges alike this ordinarily would have resulted in lowering and not raising the validity coefficient found between the test scores and the judges' ratings. It would have this effect because it would decrease the raw-correlation between the test scores and judges' ratings without a compensating decrease in the reliabil-
(33) -ity coefficient used in the denominator of the formula for correction for attenuation, An exception would be a case in which the student was shrewd enough by some sort of second sight to guess accurately his score on the Smith test and then to write a story deliberately intended to lead the judges to a conclusion which would correspond to his test score. At least five reasons may be suggested why this was rather unlikely. First, no premium was placed on doing this; second, the students were not told exactly what use would be made of their documents; third, they had to fill out the test so rapidly that it was unlikely that they had time to appraise their answers as a whole or to remember many of the specific opinions endorsed; fourth, a successful attempt at manufacturing consistency would imply that the person could estimate with close accuracy his relative attitude score in filling out a test about whose complicated scoring methods he knew nothing; and fifth, it would imply that he know enough about his associates to devise a case history which would be rated by judges unknown to him at a certain relative position on an unknown scale with close accuracy.
A more serious possibility is that the judges tended to agree because they were all familiar with the theoretical points of view on the subject of attitudes as taught in the Department of Sociology at the University of Chicago. This, if true, might be an important limitation on the generality of the results of this experiment. The validity coefficient might show merely that attitudes as measured by the test scores corre-
( 34) -spond rather closely to attitudes inferred by judges indoctrinated with a special point of view on attitudes. Of course, it is not necessarily a reflection on the case study method if technical theoretical knowledge is essential to an interpretation of the documents any more than it would be a reflection upon research in biology because a microscopic worker needed technical theoretical knowledge for an interpretation of what went on beneath his lens. If the interpretation is to be of causal factors, the need for technical theoretical knowledge seems obvious But were ratings of the four judges who read these 238 documents forced into a peculiarly high agreement because these judges had somewhat the same theoretical training? Entangled with the question is also the problem of the influence of judges' individual bias on the subject of prohibition.
It was thought that if two other judges could be secured, one of whom was known to be decidedly dry and one of whom was known to be decidedly wet and neither of whom had extensive knowledge of the theoretical literature on attitudes, a crucial experiment might be made. If the ratings of these two judges agreed with one another rather closely and agreed rather closely also with the ratings of the four graduate students, one could have some confidence that the factors of
(35) special theoretical knowledge and of personal bias were fairly well controlled.
The cooperation was enlisted of Dr. George Safford, superintendent of the Illinois Anti-Saloon League, and Mr. Smil Thiele, secretary and director of the Illinois Association Opposed to Prohibition. It seemed too much to ask them to read the entire group of 238 documents. A sample of about a hundred cases was decided upon. In order to make the experiment as decisive as possible, it was arranged to make the sample conform to a uni-modal symmetrical distribution of Smith test scores, A theoretical distribution with a population of 99 was set up by use of an approximaton to the binomial
( 36) expansion. Code numbers were drawn by lot by Mr. Paul F. Cressey from the code numbers of subjects falling within each class interval of the test scores. The result, of course, was a group of papers which in no sense was representative of the entire 238, since most of the extreme papers, about which agreement among the graduate student judges had been relatively uniform, were automatically discarded. But by. this means it was sought to make agreement difficult rather than easy. The two outside judges were not informed of the nature of the distribution of test scores or any other scores in the papers given to them, nor were they given any instructions other than those given to the four graduate students as reported above, pages 23-24.
Generously Dr. Safford and Mr. Thiele gave their time to a study of these 99 documents and rated them on the graphic rating scale. The ratings of each man with respect to attitudes toward prohibition laws were correlated with those of the other and also with the ratings of each of the four graduate student judges. In addition, the six intercorrelations of the ratings of the four individual graduate student judges were recomputed on the basis of the 99 cases. The results are summarized in Table 3. The average correlations ranged between +.71 and +.80, and the difference, based on 99 cases, is too small to be statistically significant. The average correlation
(37) of the ratings of one student judge with those of each of the other three student judges is very little higher than the average correlations of the ratings by either Thiele and Safford with the ratings of each of the four student judges. The latter correlations, together with the correlation of +.65 between Safford and Thiele, seem too high, when compared with the others, to permit the assumption that the attitude of the judges themselves toward prohibition or a peculiarly uniform technical training in interpretation of attitudes was responsible in any large way for the agreement in ratings by the four graduate students.
The correlation between Safford and Thiele was +.65
In conclusion, it seems conservative to say that in this particular investigation the judges of the case history documents showed a high agreement in their ratings, an agreement rather accurately reflected in the reliability coefficient found.