Warning - there is no shortcut to studying all of this section if you are going to responsibly analyse the performance of your questions and indirectly your teaching.
Two statistical methods are employed to help you judge the effectiveness of your questions and the performance of your class; Correlation and Analysis of Variance (AnoVa). First let's just review these statistical methods, then discuss how they apply to analysis of test results.
Correlation is used to test the relationship between two variables. The method assumes Normal variation in both of the variables. The statistical value calculated is the Product Moment Correlation Coefficient r. This number can range from -1.0 up to 1.0 (or -100% to 100%). A value of 100% indicates a strong relationship between the two variables and if they were graphed against each other you would see a perfect straight line sloping upward. A value of -100% also indicates a strong relationship between the two variables but an inverse one - if they were graphed against each other you would see a perfect straight line sloping downward. A value of 0% indicates no relationship between the variables - if graphed you would see a nebulous area of data points with no obvious way to fit a line through. Intermediate values occur when a line can be fitted to the data but there is random variation either side of that line.
Particularly with small data sets, it is possible to get a high value for r due to chance variation in the data. So a statistical test is made (a type of t test) to find out if r is significantly non-zero. If the significance is greater than 95% one can practically assume that there is a genuine relationship between the two variables. When using @Bodington@ MCQ tests you do not need to perform any calculations and you don't need to know the formulae used - you only need to look at the bottom line level of significance and whether a significant relationship is positive or negative.
AnoVa is used to compare two or more groups of figures with each other to see if there is significant variation between group means. There are number of varieties of AnoVa but for MCQ analysis only the simplest scenario applies - students are sorted into two groups and the mean mark for one group is tested against theother. The obvious statistic is the difference in the two means for the two groups. It is possible that the difference between the two means is due only to random variation inthe data (especially with small data sets) so AnoVa determines the significance of the difference. If the significance is greater than 95% one can reasonably assume that there is a genuine difference between the two groups.
When you request question analysis statistical values are calculated based on the class performance on the question compared to the class performance on the test paper as a whole. The selection of the analytical method depends on the type of question. For questions that have only one true statement an AnoVa is performed but for questions with multiple true statements and therefore a sliding scale of performance, a correlation is performed. Before you attempt to interpret the statistical values you must be very clear what performance on the test as a whole measures. Was it a random collection of general knowledge questions? Was it testing factual recall within a segment of your syllabus? Was it testing problem solving ability? If it was a self assessment test, does the final mark boil down to an inverse measure of apathy? Did it just test the ability of students to decipher waffly, ambiguous and misleading questions?
Question analysis looks at the ability of the question to discriminate between students on the same criteria as the test as a whole. A bad question in a bad test paper may score well!
When a question awards marks on a scale because there are multiple true statements a correlation is performed. For each student X is taken to be their mark on the question and Y is their mark on the whole test. One must take care in interpreting the results of the correlation and pay attention to the significance level. Particularly one should ignore the correlation coefficient if it is not significant. This leaves you three scenarios;
Correlation between the mark on the question against the mark on the test paper is only able to highlight possible problem questions. The statistics can't tell you why the result came out the way it did and so it is probably is better not to provide a practice MCQ paper unless you are prepared to do some investigation in response to the analysis. Talk to the author of the question, the person who taught the topic area and some of the students and try to find out what went wrong.
A correlation doesn't make sense for questions where the student either gets it wrong or right because there isn't a sliding scale of performance on the question. So for these questions the students are divided into two groups; Those that got the question right (scored 1) and those that got it wrong (scored 0). Getting the question wrong could be due to; not attempting it, selecting the wrong statement or selecting more than one statement (if allowed). In the analysis you get to see how many students got the mark and how many didn't. You also see the mean performance of each group on the test as a whole. You would probably expect that the students who got this question right generally did well on the paper as a whole. If the question is a good discriminator the students who got it right will have scored better on the paper than the students who got it wrong. To measure this the analysis presents the difference between the two group means - this runs on a scale from -100% to 100%. A value near zero suggests that the question doesn't discriminate well because equally able students got this question right and wrong. A non-zero value for this difference in means can occur due to random variation in the data so a significance test (AnoVa) is done. If the significance exceeds 95% one can reasonably assume that the difference between the two means is due to a real difference in ability.
As with the correlation test above questions will fall into one of three categories;
The explanations for these results are the same as before.
Item analysis looks at the individual statements within questions. These are all true/false responses so AnoVa is used in the analysis in exactly the same way as for questions as described above. You have to be more careful about the interpretation of the results though. The marking scheme implemented on these tests means that students select statements they think are true but they may leave statements unselected not because they think they are false but because they are uncertain about them and don't want to risk marks. So item analysis of false statements is much less meaningful than item analysis of true statements. The best advice is to treat true statements in the same way as for the question as a whole but for false statements you need not be as concerned when there is no significant difference in test performance. Significant negative relationships on false statements are a cause for concern because this indicates a false statement that is attractive to the most able students.
If you work to a particular significance threshold you will sometimes identify significance wrongly. If you want to institute a particular question review policy based on the statistical values you ought to choose threshold values that take into account balance between type I and type II errors. (Type 1 = taking action which was not needed, Type 2 = not taking action when you should have.)
To help you with the appropriate interpretation of the results a message is generated for each question/item analysed. It is possible to base all your conclusions and follow up investigation on these and ignore the numerical information. Here is a full list of the messages;
Not enough data to reach conclusions.
Either too few students attempted the test or too few selected a particular
box, or too few didn't select a particular box.
The more able students performed very significantly WORSE on this question!!!
This wording is used if if the level of significance exceeds 99%
The more able students performed very significantly better on this
question.
This wording is used if if the level of significance exceeds 99%
The more able students performed significantly WORSE on this question!!!
This wording is used if if the level of significance exceeds 95% but not 99%
The more able students performed significantly better on this question.
This wording is used if if the level of significance exceeds 95% but not 99%
The more able students may have performed worse on this question.
This wording is used if if the level of significance exceeds 90% but not 95%
The more able students may have performed better on this question.
This wording is used if if the level of significance exceeds 90% but not 95%
There is no evidence that performance on this question relates to ability.
This wording is used if if the level of significance is lower than 90%