An Experimental Comparison of Statistical and Case History Methods of Attitude Research

II. Reliability of the Attitude Test Scores

Samuel Stouffer

Table of Contents | Next | Previous

The attitude test scale employed was that constructed by Dr, Hattie N. Smith, using the method described in detail by Prof. L. L. Thurstone in 1928.[1] The test comprised 44 opinions, each with a predetermined scale value. Dr. Smith derived these scale values by the procedure described in detail in her dissertation.[2] A preliminary set of 135 opinions on prohibition was mimeographed on small slips, one opinion to a slip. Each set of 135 opinions was sorted by 300 judges into 11 piles, representing various degrees of attitudes from extremely favorable toward prohibition to extremely unfavorable. Using the method

( 14) of equal appearing intervals, the judges laid the slips in piles which appeared to them to be equal distances apart, For each opinion a cumulative frequency distribution was constructed., showing the number of judges who had placed the opinion on or to the left of each pile. The psychologically spaced piles numbered arbitrarily from 1(extremely favorable toward prohibition) to 11 (extremely unfavorable toward prohibition), formed the base line or attitude continuum.. The scale value of each opinion was then allocated to the attitude continuum by dropping to the base line a perpendicular drawn graphically from the median of a smooth curve which had been passed free-hand throw the ordinates of the cumulative frequency distribution.

Out of the 170 trial opinions, Dr, Smith selected 44, There were two main criteria of selection. First, only those opinions were selected about which the 300 judges rather closely agreed,. The degree of agreement on an opinion was measured graphically by dropping perpendiculars from the first and third quartile points on the smoothed cumulative frequency curve and reading off on the base line the semi-interquartile range. Second, the opinions were so chosen as to form two parallel groups of 22 opinions with scale values spaced equally, as nearly as possible, along the attitude continuum.[3]

Dr. Smith gave the test to 890 subjects. The subjects put a cheek mark opposite each of the 44 opinions with which they agreed. The reliability was determined by correlating the average scale value of opinions endorsed on one half of

( 15) the test with the average scale value of the opinions endorsed on the presumably parallel half. The correlation between these two sets of scores was +.85, which became +.92 when adjusted for a test of double the length of each half, by use of the Spearman-Brown prediction formula

In the present investigation, the same 44 opinions selected by Dr. Smith were used.[4] To determine reliability, it seemed worth while to take into account what Prof. Thurstone has called the "intrinsic popularity" of a given opinion, Experience has shown that although two opinions may have the same scale value and semi-interquartile range, one may be endorsed by several times as many people as another. As soon as the 238 subjects in the present study had taken the test, the approximate number endorsing each of the 44 opinions was counted. As an extreme example of the influence of the "intrinsic popularity" factor, two opinions may be cited.

The opinion "Prohibition is not desirable now because

( 16) there is not a sufficiently large majority in favor of it to make enforcement effective" (scale value, 5.6; semi-inter-quartile range, 1.1), was endorsed by 162 people, while the opinion, "It is absolutely immaterial whether we have prohibition or not" (scale value, 5.5; a semi-interquartile range, 0.6) was endorsed by only 9 people. In spite of its small semi-interquartile range, the latter opinion obviously is of no value in the test. There were a few other opinions which awakened relatively little response from anybody. It seemed best. to rearrange the opinions into two parallel scales somewhat different from the parallel scales chosen by Dr. Smith. This was accomplished by requiring that the number of people endorsing questions within a corresponding small range on each scale should be the same, as nearly as possible. While this was done a posteriori, it introduces no spurious reliability, because the mere total number of people endorsing two parallel opinions gives no indication that the same people who endorsed the one also endorsed the other (unless perhaps. each opinion should be endorsed by an extremely large proportion of the subjects, as was not generally the ease in the present study). The parallel scales in which the 44 opinions were divided is shown in Appendix B, Table 6.

In the present investigation the average scores made by subjects on each parallel form yielded a correlation coefficient of +.88., slightly higher than that obtained by Dr. Smith in her group of 890 subjects. Adjusted by the Spearman-Brown formula for a test of double the length of either half, the reliability coefficient was +.94. This is not quite as

( 17) high as the reliability coefficient of +.96 obtained by Thurstone and Chave on their scale of attitudes toward the church.[5]

There is an interesting difference between the distribution of the test scores in Table I and that of the opposite ratings of the judges of the case history documents. The ratings have a somewhat heavier loading at both extremes. This difference may be due in part to the inability of the judges of the ease histories to make fine enough distinctions near the extremes. It is very likely due in part, also, to the failure of the test to give people with extreme attitudes an opportunity to register their true feelings. A person's score on the test is the average scale value of the opinions which he endorses. If the opinion which is theoretically the best index of his attitude happens to be the most extreme opinion in the group of 44 his score is likely to be less extreme than his true position would justify, because all of the other opinions which he endorses will of necessity weight his score on the side of moderation. The arithmetic average, therefore, is not a good measure of central tendency for extreme cases, although it may be satisfactory for people whose true position is not too near either end of the scale. [6] Moreover the scale

( 18) values of the opinions near the ends of the scale are not quite so accurately determined as those near the middle of the scale, because of the "end effect" inevitable in the method of equal appearing intervals-and of the arbitrary element introduced by free-hand extrapolations.[7] the fact that there is a difference between the number of people with extreme attitudes as measured by the test and by the judges' ratings of the ease histories, just as would be expected in advance from a knowledge of the way in which the test scores are computed, provides apparently a useful warning against overconfidence in the exactness of a frequency distribution of test scores as indices of the true distribution of attitudes in the group. It is possible that some more refined methods of test construction now in process of development by Prof. Thurstone may eliminate some of the error at the extremes.

While the reliability of +.94 may be taken as an index of the reliability of the test in the circumstances under which it was given, it may be somewhat higher than the reliability which would be obtained by giving each parallel form of the test on different days„ on the other hand, if four parallel forms, of 22 opinions each, could be devised.. and if two foxes could be given each day, the correlation between the average scores on each day's test, corrected for attenuation by dividing by the geometric mean of their respective reliabilities, would possibly be higher than the reliability coefficient of +. 94 found in the present study.


  1. Thurstone, L.. L. "Attitudes Can Be Measured,” American Journal of Sociology, January 1928, pp. 529-64. Since this time, Prof. Thurstone has conducted a number of experiments with various types of attitude scales. He and his students have considerably refined the methods reported in this paper, especially in the direction of developing a criterion of internal consistency designed to determine reliability of the individual opinions more decisively than before. The criterion of internal consistency also is a partial check on the validity of the scale. Thus far, no outside criterion of validity such as employed in the present investigation has been used, except for the students’ self ratings on a graphic rating scale and for frequency distributions of scores of various interest groups. Among the more fundamental of Prof. Thurstone's papers are, in addition to the one just mentioned, "Psychophysical Analysis The American Journal of Psychology, Vol. 8, Ju1y,1927; The Measurement of Opinion," Journal of Abnormal and social Psychology, Vol. 22., Jan-March 1928; “The Method of Paired Comparisons for Social Values," same journal, Vol. 21, Jan March, 1927; "A Law of Comparative Judgment,” Psychological Review, Vol. 34, July, 1927; "Theory of Attitude Measurement - same journal, Vol, 36, May, 1929; The Measurement of Nationality Preferences, Journal of General Psychology, Jan., 1929; "The Measurement of Psychological Value,” Essays in Philosophy, edited by T. V. Smith and W.K. Wright, Chicago 1929 and Thurstone and Chave, "The Measurement of Attitudes*" a monograph published in Chicago, in 1929.
  2. Smith, op. cit.
  3. The scale values of the 44 opinions are given in Smith, op. cit., p, 44. The semi-interquartile ranges are given by her on pages 35-36.
  4. The list may be found in Smith, op. cit., pp. 37-39. In this list Dr. smith had 45 opinions, butt in determining reliability she afterwards dropped out her opinion No, 5 This opinion also was omitted from the list used in the present investigation,: leaving two sets of 22 presumably parallel opinions. In a preliminary experiment in which 23 students took part, the two parallel fronts of the test as arranged by Dr. Smith were given consecutively by the present writer. The questions were numbered continuously from 1 to 44. Following Dr. Smith's method, the students were told to check only the opinions which they endorsed. It was found that they apparently paid closer attention to the first part of the test than to the later part, many of them checking twice as many of the early opinions than of the late ones. To reduce the constant error which this probably “fatigue effect" might introduce, the opinions on the two forms were scrambled in the later study, and also the subjects were ask to check every opinion. The opinions with which they agreed were to be checked with a plus sign; those with which they disagreed, with a minus signs and those about which they could not decide, with a question mark. Only the opinions marked plus were considered in the final score.
  5. Thurstone and Chave, op. cit,
  6. Dr. Smith recognizes this problem and discusses it in some detail, op. cit., pp. 42-43. She suggests that it might be better to ask the subject to encircle the one opinion which he thinks is most representative of his attitude. She recognizes, however, that the reliability of a score based on one opinion is not likely to be as high as the reliability of a score based on an average of opinions, and that the loss in reliability hardly would be compensated by a gain In validity. A trial of the circle method in a preliminary st with 23 subjects by the present writer suggested that the circle method is probably much less reliable than the average method. The subjects were asked to encircle two opinions and there proved to be a wide scatter between the scale values of the two opinions encircled .
  7. For an illustration of the "end effect" and problem of extrapolation,, see Thurstone, "Attitudes Can Be Measured," American Journal of Sociology, Jan., 1928. pp.546-7.

Valid HTML 4.01 Strict Valid CSS2