In the game of life and evolution there are three players at the table: human beings, nature, and machines. I am firmly on the side of nature. But nature, I suspect, is on the side of the machines — Freeman Dyson

ESP and Skepticism

I often use ESP experiments as a vehicle to introduce hypothesis testing to students. One of the learning outcomes is that statisticians are actually easier to convince of claims of paranormal ability than lay persons . On the basis of a binomial calculation, I tell them that

I would be convinced of ESP if they could correctly select which of 5 cards I am holding more than 75 times out of 250 trials. You get 50 by guessing, and another 25 convinces the hard hearted statistician that you can bend spoons.

But I am actually telling them fibs. I would not be convinced. Does this make me irrational?

more–>I recall many years ago hearing of the so-called Soal experiment, which was in a published in book by Soul and Bateman (1954. Modern Experiments in TelepathyLondon: Faber & Faber) demonstrating telepathy. My first reaction to such claims is always rather skeptical.

In fact, the very evidence that ESP researchers use to convince us of ESP can, under mild assumptions, leads us more strongly to the conclusion that ESP is not true. This is beautifully explained in a 2003 book Probability Theory: The Logic of Science by Edwin T. Jaynes, G. Larry Bretthorst.

The experiment in question involved a woman getting 9410 trails out of 37100 correct where the chance of success under random guessing is p=0.2. This is about 26 standard deviations above the null mean and the binomial P-value is of the order of  1 in 10 to the power 139. But I for one am not convinced that she has ESP.

Frequentist hypothesis testing allows of two hypotheses only. If we are limited to two then, in this case, the relevant ones seem to be HG:she is guessing and p=0.2, or HE: she is not guessing and p>0.2. The null (HG) is overwhelmingly rejected in favour of the conclusion (HE) that she has ESP.

Bayesians can also fall for the trap of considering two hypotheses, in which case they end up in the same place as frequentists. Let LE and LG be the likelihood of the data under the null and alternative. Then the posterior probability she has ESP

eq1

Assuming that our prior probability of ESP, πE , is small, and that the likelihood of the date under the hypothesis of ESP, LE, is not small, the posterior probability of ESP is close to 1 whenever the ratio

r=LG /LE<< πE.

So it merely requires a sufficiently unlikely experimental result to overcome our prior bias against ESP and we will be convinced. So with only two hypotheses, the Bayesians are no better off than the frequentists. If my prior probability of ESP is zero then I will never be convinced. This is still formally rational but resolving to ignore any evidence of whatever strength is not rational in the informal sense.

So am I irrational to not be convinced by the Soal experiment? Do I have a prior probability of zero for ESP? I am here to tell you that I do not.

There are more than two hypotheses in play. Because there are other possible ways the data could have been generated besides a binomial experiment with p=0.2 or p>0.2. There are all sorts of possible deceptions that may have been perpetrated by the woman or by the experimenter. Let us call these various deception mechanisms that generate the data hypotheses and label them HD1,…,H Dk and assign these priors π1,…, πk that are not all extremely small. The posterior of ESP is now 

eq2

where Li is the likelihood of the data under deception hypothesis i. The various deception hypotheses make the observed data reasonably likely, just as the ESP hypothesis does. If we assume that these likelihoods are similar in magnitude to the likelihood LE then the posterior becomes

eq3

where, as before, r=LG/LE measures evidence against guessing compared to ESP. The only terms in this expression that are not likely to be tiny are πG and the πi. So the conclusion is that the posterior of ESP will be close to zero rather than close to 1. Moreover, as the ratio r=LG/LE which we think of as embodying the evidence, becomes smaller, the posterior just reverts to the priors with the guessing hypothesis eliminated. So if we start off thinking that deception is more likely than ESP then we can never be convinced of ESP.

This leads to the conclusion that an honest person may tell the truth about an extremely pertinent experimental results, and not be believed, and those who disbelieve are not being irrational. This statement makes no assumptions about who is actually correct.

The obvious, in fact only, way for ESP researchers to avoid this effect is to force us to reduce our priors on the deception hypotheses. How they could do that, I have no idea.

 


You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

AddThis Social Bookmark Button

11 Responses to “ESP and Skepticism”

  1. Berwin Turlach Says:

    Love this one. :)

    Shows that with a good proper Bayesian analysis, including judicious choice of priors, one can come to any conclusion that one wants, regardless of the data; and without violating Cromwell’s rule :) By the way, what are the k deception mechanisms? And if you list them, are you sure I would not find another one or two (or more)? So should not k be another random quantity with a prior distribution on it? Even if we agree on an exhaustive list of possible deception mechanisms for an experiment, good luck in eliciting you k priors pi_i, i=1,…k, from the various people who want to analyse the data.

    Actually, the link to Samuel Soal that you posted, mentions that there “was also the testimony of 21 prominent observers who, individually, monitored Soal’s work with Shackleton, that they were satisfied with the conditions, and could conceive of no means by which the results could be obtained by normal means other than ESP”. This seems to indicate that, at least for those experiments, there were no plausible deception mechanisms and all the pi_i should be zero. So one is back to square one.

    Frequentist analysis, on the other hand, can solve this problem. And it is patently not true that there are only two possible hypothesis that can be tested here.

    Now, I am the first to admit that hypothesis testing is probably one of the most controversial issue in statistical inference; and to liven up the common room and tea time you just have to start a discussion on this topic. But here is my take:

    According to my training, and that is the stand I still take when I want to be orthodox, in hypothesis testing we can only make statements about the null hypothesis. We either retain or reject the null hypothesis, in particular, rejecting the null hypothesis does not imply that the alternative is “true”. This mean that “rejecting the null” leaves you in a state of no-idea-what-is-going-on, something I always felt to be somewhat unsatisfactory but it is clearly necessary: if you test the null that the earth is flat against the alternative that the earth is a cube, rejecting the null obviously does not mean that the alternative is correct. Thus, according to my training, if we want to make a statement about the alternative, we have to reformulate it as our null.

    Nowadays, in my less orthodox moments, I am willing to agree that rejecting the null means that the alternative is true if these two alternatives are exhaustive and the model is appropriate. In this case this would actually be possible. With 37100 Bernoulli experiment, take the sum of any 100 consecutive trials and you should end up with 371 Binomial(100,p) variables if the model is correct. If the sample mean and sample standard deviations of these indicate over/under-dispersion, then the model is clearly wrong and any kind of hypothesis testing based on this model would be flawed.

    But what is actually the alternative to “p=0.2″? In my book, it is “p is not 0.2, i.e. she is not guessing” and not “she has ESP”. To test whether a person has ESP, we would first have to agree what consitute evidence for ESP, and this is something that the scientific community would have to agree on. It could be something like “a person has ESP if in an experiment that assesses ESP his/her odds of getting the correct answer are X times the odds of random guessing”. And now we have to agree on what X is: 5? 2? 1.5? And here we can also discuss what we always preach to our students, the difference between “statistical significance” and “practical significance”.

    If the agreement of the scientific community is that X should be 2, then, to test whether the person has ESP I would test the null that “p>=1/3″ against the alternative that “p=27/99″ against “p<27/99″. Again, the null would be rejected with the observed number of successes being 8 standard deviation below the number of sucesses that we would expect.

    Thus, in summary, from my frequentist analysis I come to the same conclusion as you. The data seems to indicate that the person is not guessing at random, that null is rejected. But rejecting that null means just that, the person is not guessing at random. But now there are multiple explanations of what we have observed. Say we give the person the benefit of the doubt that she has some meaningful ESP then a test for that rejects that null too. In summary, she does not guess at random but she does not have to seem any meaningful ESP. Which means she could just have gotten lucky or that there is some foul play (not sure how to test that, but also of little interest to me).

    Well, I guess we can now start a discussion on corrections for multiple testing, shall we? :-)

  2. Sietse Brouwer Says:

    @Berwin Turlach:
    Curses, I was too slow. I was going to make the same points that
    (1) One’s hypotheses should be complementary (exhaustive and not overlapping; so p=.2 and p!=.2), and
    (2) One should be careful to conclude only one’s hypotheses. (If we conclude that the subject is not guessing, this doesn’t mean it’s necessarily ESP. Nor is it necessarily depection.)

    You do, however, go on to make a similar mistake in your own post:

    “This mean that “rejecting the null” leaves you in a state of no-idea-what-is-going-on, something I always felt to be somewhat unsatisfactory but it is clearly necessary: if you test the null that the earth is flat against the alternative that the earth is a cube, rejecting the null obviously does not mean that the alternative is correct. Thus, according to my training, if we want to make a statement about the alternative, we have to reformulate it as our null.”

    The correct alternative, by (1), is not that the earth is a cube — it’s that it’s not flat.

    It leaves you in a state of not-knowing, yes. But the world being what it is, there’s a whole lot of other theories waiting to take its place; and though you don’t know which one of *them* is true, you know it’s not the flat-earth hypothesis. So you take a whole bunch of them, and treat each of them as the null; and any that survive get to stay a while longer.

    Perhaps none of the theories will survive. Then you’ll have to wait for someone to devise a new model. Perhaps several theories will survive; then you’ll have to invent new tests to weed out more of them.

    * * *

    That’s as far as orthodox hypothesis testing goes. In practice, exploratory model calibration is hugely practical; but it assumes that one’s causal model is (sufficiently) correct. (Although a bad fit in the calibration stage may lead to rejecting the model. But I’m still in favour of a validation stage.)

  3. Berwin Turlach Says:

    @Sietse Brouwer:
    I do not think that you were too slow since these are not the points I wanted to make. :)
    In particular, I do not subscribe to your point (1).

    For me, the alternative gives an indication in “which direction” you expect deviations from your null and helps you to decide on a test statistic to be used, namely the one “most powerful” for that alternative. You may loose power against other alternatives, but that is the price you have to pay. With this approach, the two hypothesis do not need to be exhaustive.

    Perhaps another example may illustrate this issue better. I have just been reading “Everyday Probability and Statistics: Health, Elections, Gambling and War” by Michael M. Woolfson. Somewhere in chapter 7 he analyses the chances of winning when playing craps. Then, in chapter 8, he assumes that the dice are loaded such that the probability of a 1 is 0.175, the probability of a 6 is 0.158333 and the probability of 2, 3, 4 or 5 is 1/6 each. He then shows that with such dice the house has even a bigger advantage, In chapter 8.5 he then discusses how to test for a loaded die, essentially using a chi-square test to test the null “all probabilities are 1/6″ against the alternative “at least one probability differs from 1/6″, i.e. a typical chi-square goodness of fit test with 5 degrees of freedom. Apparently, the die has to be rolled quite often to be obtain “significant” results.

    In such a situation, if I were suspicious that the die is biased towards rolling ones, I would be happy to test the null “all probabilities are 1/6″ against the alternative “probability of a one is p, of a 6 is 1/3-p and of 2, 3, 4 and 5 is 1/6 each”. The resulting likelihood ratio test (or chi-square test) should have approximately a chi-square distribution with one degree of freedom. Of course, this test would have no power against some other manipulations of the die, but if the manipulation is of the manner I expect, I should be able to tell with much faster (i.e. with fewer rolls of the die).

    The two possible statements that I could do after performing the test would be “considering this particular alternative, there is no evidence in the data to reject the null” and “the evidence in the data is not consistent with the null that all numbers are equally likely and we may reject the null; the die is biased”. Note that in the latter statement I would just claim “the dice is biased” (negation of my null), not that it has the specific bias assumed under the alternative (since I did not test whether for that particular form of bias).

    We can probably now argue whether the second statement should also have the qualifier “considering this particular alternative”. Or whether I am fishing for significance by not doing the “standard” chi-square goodness of fit test. And that could then probably lead to a discussion on one-sided against two-sided tests. :-)

    As I said earlier, nothing livens up the common room as much as starting a discussion about hypothesis testing: how it should be done, how it should be taught, how it should be used,….

  4. Chris Lloyd Says:

    Thanks for those interesting comments and suggestions Berwin.

    Just to define my terms: In the context of this experiment, when I say someone has ESP I mean that p>0.2. By any amount at all. The parameter p defined to be the limiting long run proportion of correct selections (which we will never know). According to my personal definition of ESP, any limiting value greater than 0.2 means ESP, in the absence of cheating. I realise that some people might mean something stronger by ESP, but I do not. If your value of p is 0.201 then you have ESP. The fact that we reject the null p=1/3 in favour of p<1/3 is not really relevant to my definition. It just means she does not have heaps of ESP.

    I think I might view HT differently to you. You suggest that the alternative hypothesis sets a direction of deviation from the null, and a test statistic. That is true from the viewpoint of a statistical theorist, but in practice the alternative is what you suspect, what you are trying to prove, why you did the experiment in the first place. And the null is the status quo – the thing I am looking for is not there. The alternative is the central player, not the null.

    Also, I do not agree that rejecting the null leaves us in a state of no-idea-what-is-going-on. When there are two hypotheses, rejecting the null means accepting the alternative. The test decision rule defines a discrete estimator of a two point set (H0,H1). Bob Staudte has an entire inferential theory based on this idea and minimising a certain kind of risk. I still reckon that it is the two hypotheses that get the frequentist into trouble. Multiple hypothesis lead into, as you say, the vexing and fascinating field of multiple comparisons.

  5. Berwin Turlach Says:

    @Chris Lloyd:

    Regarding para 1: Fair enough, so everybody with p>0.2 has ESP. But how is a poor sod supposed to prove that under your approach? Declaring that there are k ways of cheating (where k is still unspecified), assuming that any evidence for ESP is equal evidence for each way of cheating and being able to assign arbitrary prior probability to each of the way of cheating, one arrives at your last equation. As you correctly pointed out, this ratio will be closer to 0 than to one, hence no evidence of ESP. Looks to me like the kind of Bayesian analysis where the prior belief completely dominates the posterior and I am left wondering: why bother collecting data? Wouldn’t it be more honest to just break Cromwell’s rule and say that your prior probability that anybody has ESP is zero?

    Regarding para 2: That is one use of hypothesis testing, and I agree that in this use the aim is usually to reject the null hypothesis (though, I still maintain that this does allow one to “accept” the alternative). Another use is, of course, when testing nested models. In that situation the alternative is typically the largest model that you are willing to consider but you hope you get away with a smaller model, i.e. you hope to retain the null hypothesis; and I noticed that students struggle with the fact that you have different hopes about rejecting/retaining the null based on the application. There is, of course, a school that says one should not use hypothesis testing for model selection, but I am not aware of any statistics departments that stopped teaching ANOVA or the F test for nested linear models.
    Also, I do not see much differences between our views. Yes, the alternative is “what you suspect, what you are trying to prove”, so it gives you the “direction” in which you expect to see differences to what you would expect to see under the null. Hence, you will choose your test statistic to have most power for the alternative that you are interested in. Our only difference seems to be that I do not subscribe to the idea of accepting the alternative if the null is rejected. Which leads me to:

    Regarding para 3: As I said earlier, in my book, accepting the alternative after rejecting the null is only justifiable if these two hypotheses are exhaustive (and the model used is reasonably plausible). Risking repeating myself: if you take the null “the earth is flat” (implying that you will drop off it if you venture too far west or east) and the alternative “the earth is a cube” (implying that if you keep going west (east) you will come back to the same spot from the east (west)), and you observe somebody sailing off into the sunset in the west and appearing several months later from the east, would you be happy to declare that the earth is a cube?

  6. Sietse Brouwer Says:

    @Berwin Turlach:

    One thing we certainly agree on: “accepting the alternative after rejecting the null is only justifiable if these two hypotheses are exhaustive”.

    So if you have exhaustive hypotheses, you will accept HA after rejecting H0, and be sure that you’re right. (Though HA may be broad) If you have non-exhaustive hypotheses, you might accept HA after rejecting H0, but you can’t be sure whether you’re right. (Because you’re only sure of the complement of H0, of which HA is a subset.)

    Your dice-rolling example amounts to saying that if you test for a non-exhaustive HA, you gain power for that HA, although you lose the power to detect other HAs. I say that you thereby eith shifting those other HAs into H0, or that you can’t be uncertain that the HA you find is true.

    Your H0: pi=(1/6,1/6,1/6,1/6,1/6,1/6)
    Your HA: pi=(p,1/6,1/6,1/6,1/6,1/3-p).

    Your test is, e.g., something likelihood-ratio like. Now, assume the die is loaded as follows:

    pi=(1/6,1/6,0,1/3,1/6,1/6)

    This would be closer to your null hypothesis, so you would conlcude H0 — leading you to the wrong conclusion that the die is fair. If your H0 had been the complement of your HA, you would have correctly concluded that it wasn’t one-biased.

    On the other side, there are also outcomes that lead to your your HA, although they really aren’t part of it. Say the die is loaded thus:

    pi=(0,1/2,1/18,1/18,1/18,1/3)

    This would lead you to accept your HA thaty specifies 0-6 bias when in fact the die is not fair for 2…5. If your H0 had been the complement of your HA, you would have correctly concluded that the die is not just one-biased.

    Conclusion: you might not call your approach a pair of exhaustive hypotheses, but it is really a too-specific phrasing of what *is* a pair of exhaustive hypotheses:

    H0: p = 1/6, HA: p != 1/6

    where p is estimated by both p(1) and 1/3 - p(6)

    Because that is what discriminates between your accepting H0 or HA. The same for every non-exhaustive pair of hypotheses you choose between.

    ___________________________________________________________________________

    Returning to the cube question: I agree that accepting HA *with conviction* is only permissible if the hypotheses are exhaustive.

    If I phrased my hypotheses exhaustively (flat vs. not-flat), and somebody circumnavigated the globe (block?), I would know for sure that the Earth was not flat. If I phrased my hypotheses non-exhaustively (flat vs. cube), I would reject my H0, but what good would the cube-HA be? As it was non-exhaustive I couldn’t be convinced it would be correct. All I’d know for sure is that H0 was false, which the exhaustive approach expresses more clearly.

    Having said all that: in science we have generally hopped from working hypothesis to improved working hypothesis, rather than from working hypothesis to truth. Although we would only be certain that H0 was false, we would tentatively replace it with the subset of HA that we thought most likely — until that new hypothesis was disproven, too.

    What if I thought the Earth was flat, and the only alternative I could think of was a cube? In both the exhaustive and the non-exhaustive case above, if the flat theory was disproven, I would happily take as my new working hypothesis the cube theory, as it fit the facts and I had no better explanation. (Until the sphere-theory came along.) Still, in the exhaustive case, I would have no false ideas of what my discriminant is.

  7. Berwin Turlach Says:

    @Sieste Brouwer:

    Happy to see that we agree on something. :-)

    I also like your re-interpretation of the hypotheses in the die rolling example.

    However, just a minor point. If I were to test these two hypotheses using a die biased in the way you describe, then I would not be lead “to the wrong conclusion that the die is fair”. Because this is not the conclusion I can draw. My conclusion would be “there is no evidence that the die is not fair (when testing against this particular alternative)”. “No evidence that the die is not fair” is something different then “evidence that the die is fair”.

    In general, absence of evidence is not evidence of absence. A point also often not adhered to. How often did you read phrases like “a t-test showed/proved that the means of the two populations are the same” (or ANOVA if there are more than two populations)? But this is not the conclusion that you can draw, the conclusion is “a t-test failed to show that the means of the two populations are different”.

  8. Chris Lloyd Says:

    So, getting back to the original problem, how can one ever establish ESP? You pointed out that in this case the priors dominate the data. I think that they do. No matter how impressive the outcome, I am inclined to believe that the explanation is that the experiment was rigged (which makes the observed highly likely) rather than the subject has ESP (which makes the observed likely but not highly likely unless the amount of ESP is exactly tuned to the observed).

    While observers of the Soal’s experiment thought that it appeared sound, I am not sure that scientists are particularly skilled at anticipating the many creative subterfuges that are possible. Indeed, in alter studies Soal probably was the victim of fraudulent signalling between subjects, for instance with ultra-sonic devices.

    The fact that ESP results have never been reliably repeated, makes me confident that my conviction that ESP does not exist is correct. But the disturbing fact remains that if ESP did exist, proponents would have the devil of a time proving it.

  9. “But the disturbing fact remains that if ESP did exist, proponents would have the devil of a time proving it.”

    Really? Assume that we can agree that “the existence of ESP” can be equated to “the demonstrable presence on Earth of at least one individual who can guess card face values across all sorts of blind experiments in a satisfactorily repeatable manner” then all that needs to be done is to agree upon the definition of what constitutes ESP (before the next experiment), take that individual, the one who purports to have that skill (or has been thrown up as an outlier from some large “experiment”), and repeat the experiment .. more precisely, repeat the experimental design, or one more carefully thought out - while controlling for and/or randomizing over the factors that might affect the outcome.

    A person, perhaps the one and only person on earth, who “had ESP” (as defined above) should come though these new experiments with flying colours. Easy to prove, if true. Irrespective of the priors. Just repeat and randomize and repeat.

    Retrospective analysis of flawed experimental data can take us only so far.

    The design-and-collect-more-data cycle can, on the other hand, and with luck, direct us to a closer approximation of what is more likely to be “the truth”.

    Your point about whether people/scientists are likely to be aware of possible biases/fraudulent activity (whether unconscious or not) feeds into this. Over time we come to realize what might have gone wrong in those experiments, and can redesign accordingly .

    For example I think it odd that one would do 37,100 (?independent) trials with the one person (and what about the stopping rules that were in effect) unless one’s objective was to prove that this specific person, selected on the basis of her record so far, “had ESP” : if one was interested in the more general proposition that “people in general have some ESP” then one would presumably distribute the trials over the participants more evenly.

    Additionally, I would start to think about shuffling. Did the experimenter achieve a good shuffle, and was this tested for? If not, then we could quite reasonably surmise that the successful person could and did intuit some complex patterns in the sequence. Shuffling is interesting, and hard .. see for example http://en.wikipedia.org/wiki/Shuffling and http://www.merlyn.demon.co.uk/pas-rand.htm#SDD

    And so the next set of experiments should perhaps have a new experimenter, a new team of observers (perhaps including a statistician or two!), a new location, and a better shuffling method.

    Or maybe leave it to the TV “investigators” .. “for those in Australia, tune in for ‘The One’ on Ch.7, Tuesdays at 7.30pm to see Richard Saunders being the voice of reason on a new show about ‘psychics’.”

  10. Frank Tuyl Says:

    It would seem that p=0.2 is the kind of “general law” discussed regularly by Jeffreys. Assigning prior point probability mass m to it, and 1-m uniformly to the remainder, would seem like a reasonable option here too. Then setting p(x|p=0.2)=f, the probability that the general law holds, having observed x, is p(p=0.2|x) = mf/(mf + (1-m)/(n+1)).

    If I would consider 1-m = 10^(-100), say, about right, the example of 9410/37100 (f=2*10^(-139)) should convince me of ESP. But e.g. x=8000 (7.5 standard deviations) and even x=9000 (20.5 standard deviations, f=5*10^(-90)) would not:-).

  11. Sietse Brouwer Says:

    @Berwin Turlach, much belatedly:

    In general, absence of evidence is not evidence of absence. A point also often not adhered to. How often did you read phrases like “a t-test showed/proved that the means of the two populations are the same” (or ANOVA if there are more than two populations)? But this is not the conclusion that you can draw, the conclusion is “a t-test failed to show that the means of the two populations are different”.

    A very good (and not-so minor) point, and one I had indeed forgotten. Thank you for the reminder. :-)

Leave a Reply