Is there an alternative to significance testing? This question has acquired a kind of omnipresence this year, incessantly permeating its way into my blogs. For this very reason I am going to confront the contentious issue directly; synthesizing some of the arguments that have been encountered.
Firstly, it is important to consider why the world of scientific research would want to venture away from significance testing in favour of another less prevalent method.
As aforementioned in previous blogs of mine, the use of significance testing is somewhat contentious, having been heavily scrutinised for many decades. Why is this?
Schmidt (1996) strongly opposed the use of significance tests claiming that there were severe deficiencies associated with its use; “significance tests always lead to error as the process is inferential”. It is possible that the process may be confounded by sampling error or chance; a sample cannot give you complete information about a population. Some samples are particularly bad representatives of the population and can be very misleading; providing countless misinterpretations of data that are often amusing for their folly, but also hair-raising for their consequence (Rothman, 2000).
Type I and II errors are born from such misinterpretations, when the results obtained from a sample are misleading the researcher has no other option than to reach an incorrect conclusion (Gravetter & Forzano, 2009). These errors are the direct result of a careful evaluation of the research results, indicating that the problem does not lie with the researcher but the means of significance testing.
For this very reason Rothman (2000) suggested abandoning the use of significance testing altogether – quite a strong stance indeed. However, it is important to note that this strong opposition to the significance test is unceasing, relentless. There is a vast number of published papers calling for the abolition of significance testing.
To re-use a piece of evidence from a previous blog that dates back astonishingly to Lykken (1968) for its inherent ability to exemplify how long-winded this argument has been. Lykken (1968) argued that “statistical significance is possibly the least important aspect of a good experiment.” Proposing that “a significance test is never a sufficient condition for claiming that (1) a theory has been successfully corroborated, (2) a meaningful empirical fact has been established, or (3) an experimental report ought to be published.” So why have we been so loyal to the inadequate significance test?
According to Hunter (1997) “the significance test has been a disaster for modern psychology” and many people already know that the significance test does not work. However, they do not say this in public, and they still use significance tests in their research articles. The reason for this is fear – fear of social sanctions if they violate a social convention. Hunter (1997) bravely stuck his neck out in this piece, breaking the silence in regards to why such emphasis it still being placed upon the significance test.
Hunter’s (1997) insightful journal entry informed us that for more than thirty years, Cohen (of Cohen’s d) tried to save the significance test by requiring authors to perform power analyses so that they would know the actual error rates for their tests. Those who do this, quickly learn to largely ignore significance test results – which is the only rational solution to a 60% error rate.
However, we now have thirty years of experience showing that power analysis cannot be taught successfully in the present environment. Authors do not learn and do not use power testing. Cohen (1994) himself has given up on the training for power analysis and now admits that the significance test must be abandoned.
This analogous account clarifies how the world of psychological research approached the problematic significance test, choosing to make amendments to the test rather than search for a less questionable alternative.
Conversely, Schmidt (1996) propounded that teachers ‘revamp’ their courses to allow students to understand that reliance on statistical significance harnesses the growth of cumulative research knowledge, that the benefits believed to flow from statistical significance testing do not in fact exist; and that significance methods must be replaced with point estimates and confidence intervals.
Aiming to teach students how the reliance on significance testing is damaging to the growth of psychological research, rather than teach them how to amend the test itself. Informing students of the weaknesses associated with significance testing, and the level of care that should be taken when using it. Not to just accept the results unquestioningly.
As I have already pointed out there have been attempts to amend the “deficiencies” associated with significance testing, but this has not worked, leaving very few options other than to look for an alternative. We have just read Schmidt’s (1996) belief that confidence intervals should succeed significance testing – is this the common consensus in the field?
Hunter indicated that other areas of science such as physics and chemistry do not and have not ever relied on significance testing like psychology has, highlighting how other methods must exist – the most dominant technique used by mathematical statisticians is the technique of confidence intervals.
There are alternatives to the significance test that do not share the 60% error rate. For single studies, we can use confidence intervals to measure the potential sampling error in study results. A 95% confidence interval has only a 5% error rate, staggeringly lower than that of the average significance test (the significance test has an average error rate of 60% – worse than the error rate in a ‘coin flip’). Importantly confidence intervals are not context dependent like is significance tests, and thus cannot have the super high error rates that the significance tests possesses when people consistently use it in a context for which it is incompatible.
Confidence Intervals contain the information of a significance test, therefore there is no loss of information and no risk involved when confidence intervals replace significance tests (Brandstratter & Luz, 1999). Taken together, confidence intervals in addition to replications, graphic illustrations and meta-analyses seem to represent a methodically superior alternative to significance tests. Hence, in the long-run, confidence intervals appear to promise a more fruitful avenue for scientific research (Brandstratter & Luz, 1999).
To conclude, the use of significance testing has long been in dispute. Despite researchers being aware of the associated “deficiencies”, its use has not wavered in psychological research. There have been attempts made to amend the significance test by recommending the use of accompanying power analyses, however this did not work. Therefore there appears to be no other option than to use an alternative. The most popular alternative, used in other important facets of science appears to be confidence intervals. The shift from significance testing to the use of confidence intervals seems to come with no disadvantages, confidence intervals contain the same information as the significance test, therefore would be no loss of information and no risk involved in the swap. However, the 60% error rate belonging to the significance test would be replaced with the constant 5% error rate of the confidence interval, a marked improvement.
For further reading, particularly insightful articles –
– Brandstratter & Linz http://www.psychologie.de/fachgruppen/methoden/mpr-online/issue7/art2/brandstaetter.pdf