An Experimental Comparison of Statistical and Case History Methods of Attitude Research
II. Reliability of the Attitude Test Scores
Samuel Stouffer
Table of Contents | Next | Previous
The attitude test scale employed was that constructed by Dr, Hattie N. Smith, using the method described in detail by Prof. L. L. Thurstone in 1928.[1] The test comprised 44 opinions, each with a predetermined scale value. Dr. Smith derived these scale values by the procedure described in detail in her dissertation.[2] A preliminary set of 135 opinions on prohibition was mimeographed on small slips, one opinion to a slip. Each set of 135 opinions was sorted by 300 judges into 11 piles, representing various degrees of attitudes from extremely favorable toward prohibition to extremely unfavorable. Using the method
( 14) of equal appearing intervals, the judges laid the slips in piles which appeared to them to be equal distances apart, For each opinion a cumulative frequency distribution was constructed., showing the number of judges who had placed the opinion on or to the left of each pile. The psychologically spaced piles numbered arbitrarily from 1(extremely favorable toward prohibition) to 11 (extremely unfavorable toward prohibition), formed the base line or attitude continuum.. The scale value of each opinion was then allocated to the attitude continuum by dropping to the base line a perpendicular drawn graphically from the median of a smooth curve which had been passed free-hand throw the ordinates of the cumulative frequency distribution.
Out of the 170 trial opinions, Dr, Smith selected 44, There were two main criteria of selection. First, only those opinions were selected about which the 300 judges rather closely agreed,. The degree of agreement on an opinion was measured graphically by dropping perpendiculars from the first and third quartile points on the smoothed cumulative frequency curve and reading off on the base line the semi-interquartile range. Second, the opinions were so chosen as to form two parallel groups of 22 opinions with scale values spaced equally, as nearly as possible, along the attitude continuum.[3]
Dr. Smith gave the test to 890 subjects. The subjects put a cheek mark opposite each of the 44 opinions with which they agreed. The reliability was determined by correlating the average scale value of opinions endorsed on one half of
( 15) the test with the average scale value of the opinions endorsed on the presumably parallel half. The correlation between these two sets of scores was +.85, which became +.92 when adjusted for a test of double the length of each half, by use of the Spearman-Brown prediction formula
In the present investigation, the same 44 opinions selected by Dr. Smith were used.[4] To determine reliability, it seemed worth while to take into account what Prof. Thurstone has called the "intrinsic popularity" of a given opinion, Experience has shown that although two opinions may have the same scale value and semi-interquartile range, one may be endorsed by several times as many people as another. As soon as the 238 subjects in the present study had taken the test, the approximate number endorsing each of the 44 opinions was counted. As an extreme example of the influence of the "intrinsic popularity" factor, two opinions may be cited.The opinion "Prohibition is not desirable now because
( 16) there is not a sufficiently large majority in favor of it to make enforcement effective" (scale value, 5.6; semi-inter-quartile range, 1.1), was endorsed by 162 people, while the opinion, "It is absolutely immaterial whether we have prohibition or not" (scale value, 5.5; a semi-interquartile range, 0.6) was endorsed by only 9 people. In spite of its small semi-interquartile range, the latter opinion obviously is of no value in the test. There were a few other opinions which awakened relatively little response from anybody. It seemed best. to rearrange the opinions into two parallel scales somewhat different from the parallel scales chosen by Dr. Smith. This was accomplished by requiring that the number of people endorsing questions within a corresponding small range on each scale should be the same, as nearly as possible. While this was done a posteriori, it introduces no spurious reliability, because the mere total number of people endorsing two parallel opinions gives no indication that the same people who endorsed the one also endorsed the other (unless perhaps. each opinion should be endorsed by an extremely large proportion of the subjects, as was not generally the ease in the present study). The parallel scales in which the 44 opinions were divided is shown in Appendix B, Table 6.
In the present investigation the average scores made by subjects on each parallel form yielded a correlation coefficient of +.88., slightly higher than that obtained by Dr. Smith in her group of 890 subjects. Adjusted by the Spearman-Brown formula for a test of double the length of either half, the reliability coefficient was +.94. This is not quite as
( 17) high as the reliability coefficient of +.96 obtained by Thurstone and Chave on their scale of attitudes toward the church.[5]
There is an interesting difference between the distribution of the test scores in Table I and that of the opposite ratings of the judges of the case history documents. The ratings have a somewhat heavier loading at both extremes. This difference may be due in part to the inability of the judges of the ease histories to make fine enough distinctions near the extremes. It is very likely due in part, also, to the failure of the test to give people with extreme attitudes an opportunity to register their true feelings. A person's score on the test is the average scale value of the opinions which he endorses. If the opinion which is theoretically the best index of his attitude happens to be the most extreme opinion in the group of 44 his score is likely to be less extreme than his true position would justify, because all of the other opinions which he endorses will of necessity weight his score on the side of moderation. The arithmetic average, therefore, is not a good measure of central tendency for extreme cases, although it may be satisfactory for people whose true position is not too near either end of the scale. [6] Moreover the scale
( 18) values of the opinions near the ends of the scale are not quite so accurately determined as those near the middle of the scale, because of the "end effect" inevitable in the method of equal appearing intervals-and of the arbitrary element introduced by free-hand extrapolations.[7] the fact that there is a difference between the number of people with extreme attitudes as measured by the test and by the judges' ratings of the ease histories, just as would be expected in advance from a knowledge of the way in which the test scores are computed, provides apparently a useful warning against overconfidence in the exactness of a frequency distribution of test scores as indices of the true distribution of attitudes in the group. It is possible that some more refined methods of test construction now in process of development by Prof. Thurstone may eliminate some of the error at the extremes.
While the reliability of +.94 may be taken as an index of the reliability of the test in the circumstances under which it was given, it may be somewhat higher than the reliability which would be obtained by giving each parallel form of the test on different days„ on the other hand, if four parallel forms, of 22 opinions each, could be devised.. and if two foxes could be given each day, the correlation between the average scores on each day's test, corrected for attenuation by dividing by the geometric mean of their respective reliabilities, would possibly be higher than the reliability coefficient of +. 94 found in the present study.