'Cure' as the gold standard for likelihood ratio assessment: theoretical considerations

A.L.B. Rutten, C.F. Stolper, R.F.G. Lugten, R.W.J.M. Barthels
Commissie Methode en Validering VHAN (Dutch Association of Homeopathic Physicians), The Netherlands

A gold standard is necessary to assess the validity of homeopathic symptoms. The gold standard is 'cure', but this is difficult to define, and depends on consensus. The likelihood ratio (LR) method will give valid results only if the gold standard is reliable. False positives (patients unjustly assessed as cured) in our investigation weaken results of LR investigation. Weakening the standard to enlarge the research population will seriously bias the results. The same gold standard should be used in LR assessment of all symptoms. Homeopathy (2004) 93, 78-83

Introduction

In a previous paper we introduced likelihood ratio (LR) as a modern epidemiological tool to help in the evaluation of homeopathic notions like keynotes and the peculiar and characteristic symptoms.(1) The LR enables us to estimate the increase or decrease of the probability that a medicine will work when a certain symptom is present or absent, the so-called prior and posterior probabilities of Bayesian philosophy. The LRs of symptoms should be assessed by prospective research. In such research we compare the outcome of a test (presence of a symptom) with the outcome of a gold standard (GS) criterion. If, say, we assess ultrasonography for appendicitis we compare the outcome of ultrasonography with the outcome of histopathology as GS. But is histopathologic outcome identical to reality? There is reason for doubt even in this case, but a GS based on clinical observations will be even more deviant from reality. In homeopathy a GS is not easy to define because we have no concrete outcome measure. Our GS is 'cure', but how do we define cure and what are the consequences of this definition?

Likelihood ratio (+) = (prevalence of a symptom in the population responding to a remedy) / (prevalence of the same symptom in the rest of the population). So: if the prevalence of 'Loquacity' in the 'Lachesis-population' is 40% and the prevalence of 'Loquacity' in the rest of the population is 10%, the LR+ = 4

What is a gold standard?

To research diagnostic tools for a disease we must first define the population that has the disease. This is also done by a diagnostic test, but one that we already rely on. In diagnostics the GS is the perfect test with no false positive and no false negative results. The 2x2 table of such a diagnostic test looks like Table 1; there are no false positive (b=0) or false negative (c=0) results.
 

illness present

illness absent

 

test positive

a=10

b=0

10

test negative

c=0

d=990

990

 

10

990

1000

Table 1 2x2 table of the perfect test as gold standard. False positive (b)=0 and false negative (c) =0

Such perfect tests are rare. Tissue histopathology obtained from surgery is often taken as GS, but even this is not perfect. Even in histopathology 2% false positives and 3% false negatives occur.(2) We must redefine 'GS' as the best possible reference test given the circumstances and the goal of the symptom/test being evaluated. In fact many imperfect tests have been used as GS. Table 2 gives some examples.

Illness/ diagnosis

test

gold standard

author

Headache in children

list of criteria

diagnosis of neurologist

A. Ozge(3)

Thyroid size

palpation

ultrasonography

Castaneda(4)

Myocardial infarction

new 6 hour protocol

existing 48 hour protocols

K.R. Herren(5)

Diagnosis of chest pain

first consultation

follow-up consultation

F. Buntinx(6)

Table 2 Different kinds of gold standards

The GS tests in Table 2 are taken as standards because of various qualities including: objectivity (ultrasound), better skilled (medical specialist), already accepted (existing protocol), or 'time tells' (follow-up consultation). The tests have different objectives, like first line screening, cheaper and faster tests and new diagnostic tools. The imperfection of these GSs can be accounted for by practical and ethical considerations.
We see that in most research the GS is not identical to truth. If we assess a diagnostic tool we should also 'assess' the GS, but that is not possible in most cases. In many cases we can only estimate the influence of the GS. It is possible to improve this estimation by a theoretical model, as we will show in this paper. Bias by GS has some similarities with misclassification and selection bias.(7)

Gold standard in homeopathic LR research

The LR+ of a symptom regarding a homeopathic medicine indicates the increase of likelihood of a curative action of that medicine. Our 'test' is not meant to diagnose an illness but a curative potential of a medicine. So 'cure' caused by the investigated medicine should be our GS. The clinical relevance of this GS is unarguable, but it is impossible to define 'cure' in a clear and unambiguous way. Therefore we must find an operational definition for cure. In homeopathy cure means more than the disappearance of a single symptom. Over two centuries of homeopathic practice has constituted consensus about the meaning of 'cure' after homeopathic treatment.(8,9) The most important criteria for cure are improvement of well-being, improvement of energy, 'constitutional changes' (temperature, appetite, stool, sleep, etceteras). Nevertheless, as Swayne states, "Collection of data and their analysis have not been systematic enough to define them clearly, to validate them absolutely, nor to tell us all that we need to know about them". There are several scales describing the outcome of treatment, one of them is developed in Glasgow (GHHOS).(10,11) This scale was further refined at a consensus meeting of 80 Dutch homeopathic physicians in 1996.(12) This refined version of the GHHOS has been used subsequently in the Netherlands (see Figure 1). Participating doctors present cases where the medicine under consideration was prescribed and the participants assess all cases. Outcomes are rated between -4 and +4. Generally 10 to 20 cases concerning one medicine are accepted as satisfying the criteria for +3 or +4 result in a group of 10 to 15 homeopathic physicians, about one case for each doctor. The causal relation between cure and medicine is also assessed in this process. In comparison to the Glasgow version, in the Dutch version two additional factors are specified:

  1. The effect is related to natural course of illness, premorbid course and gravity of illness (acute situation).
  2. Positive or negative external influences on the course of illness (chronic situation).

These factors influence the objectivity of the scale because they rely mainly on clinical judgement. In fact we use a threshold value to distinguish cases cured by a certain medicine and cases cured by other factors or maybe not cured at all. By lowering this threshold value we increase the number of ‘cured’ cases.

Then there is the relation between the assessed symptom and the cure. There are several pitfalls here, like the reliability of history taking and expectation bias.(13) Other problems, like our ignorance about the difference of effects of different potencies must be kept in mind.

Figure 1 Specified GHHOS scale

Post aut propter?

Is the patient cured after or because of the medicine? There has to be an apparent causal relationship between the medicine and an effect. The judgement about this causal relationship depends on 'knowledge about the world'; an expectation about the chance of spontaneous recovery and the spontaneous course of illness based on experience in daily practice. Such knowledge is often not founded on epidemiological data. Furthermore there is the role of placebo effects. A result of treatment of +3 or+4 according to the GHHOS scale increases our opinion that the cure was caused by the treatment, but the natural course of disease is even more important. Most practitioners will see the recovery of an acute illness as spontaneous, but when a patient suffering from, say, emphysema experiences a substantial and lasting improvement in functionality after a certain therapy we strongly suspect a causal relationship. In daily practice each practitioner makes an (implicit) estimate of an effect, based on clinical experience. At the moment there is no objective scale to replace this intuitive procedure.

A physician can get an idea about the action of a medicine in a case by reading a full and adequate description. It would be desirable to standardise the description of cases. Each case-description should include a paragraph containing the following:

Consensus

Our GS, ie our assessment of result whether the therapy resulted in a cure, relies mainly on consensus. But how reliable is the consensus process? The consensus procedure is not uncommon in medicine, because there is insufficient evidence for many health problems. Even if there is evidence, medical practice varies widely over the world indicating that consensus (and evidence-based medicine) is not an infallible system.(14) Efforts have been made to improve this situation, for instance by the American Institute of Medicine (IOM).(15) The IOM has proposed a list of eight desirable attributes for clinical practice guidelines: 1. Validity. 2. Reliability/ reproducibility. 3. Clinical applicability. 4. Clinical flexibility. 5. Clarity. 6. Multidisciplinary process. 7. Scheduled review. 8. Documentation. 'Validity' is explained as: "Practice guidelines are valid if, when followed, they lead to the health and cost outcomes projected for them. A prospective assessment of validity will consider the substance and quality of the evidence cited, the means used to evaluate the evidence, and the relationship between the evidence and recommendations. Practice guidelines should be accompanied by descriptions of the strength of evidence and the expert judgement behind them". 'Reliability and Reproducibility' is explained as: "Practice guidelines are reproducible and reliable (1) if - given the same evidence and methods for guidelines development - another set of experts produce essentially the same statements and (2) if - given the same clinical circumstances - the guidelines are interpreted and applied consistently by practitioners".

Nevertheless consensus is an inevitable and accepted process in defining GS. There is also statistical theory about consensus and models to assess rater performance. This theory states that the accuracy of a research is a function of the number of raters and the agreement between raters. Such models appear to be robust even with bias of raters.(16) In our research we already need more raters because of the required large number of data. Therefore we believe that consensus meetings must accompany LR investigation where cases are discussed to reach agreement about criteria for 'cure'. The investigation will last several years and in the meantime data from all participating practices will be gathered. Probably this will reveal differences in interpretation of 'cure'. These differences must be discussed to reach better consensus. The description of this consensus should spread in the homeopathic community and lead to further discussion.

The most important question is if this consensus is valid, and does it lead to better prescribing? We expect clear-cut cases to have a definite relation between symptom, medicine and cure, which can be measured by LR. If this relation is not visible there can be two causes: 1. The symptom is not valid for this medicine, or 2. The GS (consensus) did not indicate the right cases. Is it possible to distinguish a better GS, in other words: is a GHHOS score of +4 better than a GHHOS score of 2? For this we need a quantitative model.

A model for gold standard

In most cases GS is not equal to reality, but the best possible approximation of reality. In researching diagnostic tests we do not assess reality but GS. So this research consists of two stages:

  1. Submitting the population to GS procedure to identify cases with the diagnosis (cases cured by the therapy).
  2. The actual research of the diagnostic instruments (symptoms) that lead to the diagnosis (cases cured by the therapy).

Our GS is cure as we observe it. Suppose we observe a population of 1010 patients after treatment. Some of the results are vague: some patients are cured but not necessarily by the medicine, in some patients the medicine worked but cure was prohibited by other factors. Our observation is partly blurred like by a lens where the edges of our image are vague. Figure 2 shows a strict assessment of results designed to prevent cases being wrongly classified as ‘cured by the medicine’ (false positives). Strict assessment means that more cases that are doubtful will be qualified as ‘not-cured by the medicine’. In homeopathic practice only a small part of the population is cured by one medicine. Here we assume that one homeopathic remedy can cure 1% of our population (in Bayesian terms we say that the prior-chance of cure by one medicine is 1%). Therefore, the rest-population is much larger than the medicine-population; a slight weakening of the rules for assessment of cure will cause a large increase of the number of ‘cured cases’, but this increase is misleading. Figure 2 shows that even with strict criteria for cure one can end up with a research population where in fact more than half is not cured by the medicine.

Figure 2 Reality filtered by GS (assessment of cure); 1% false positives renders 10 cases unjustly qualified as cured

The influence of false positives and negatives

We have seen that the low prior-chance of cure by any one homeopathic medicine increases the risk of false positives. Figure 3 shows how this influences LR.
Suppose that in reality Symptom A occurs in 50% of the cured population and in 10 % of the rest of the population. Four of the eight cured patients that fit our criteria for cure (true positives) have symptom A. One of the 10 patients that are wrongly classed as cured (true negatives) has symptom A. In the same way the prevalence of symptom A in the observed rest of the consists of 10% of the true negative population and 50% of the false negative population. Thus we get the following list:
 

Symptom A population cured

Prevalence A population cured

Symptom A in rest-population

Prevalence A rest-population

Reality

10*50% = 5

50%

1000*10% = 100

10%

Observation

(8*50%)+(10*10%)=5

5/18=27.8%

(990*10%)+(2*50%)=100

100/992=10%

So the real LR+=50/10=5.

The observed LR+=27.8/10=2.78.

When the population cured by a medicine is much smaller than the rest of the population, as is the case in homeopathy, the influence of false positives is great. In the false positive population the prevalence of symptom A is lower than in the true positive population. In the false negative population the prevalence of symptom A is higher than in the true negative population. So, both false positives and false negatives have a negative influence on LR+.

Figure 3 The absolute occurrence of symptom A in reality and in the observed population after assessment of cure with 1% false positives and 10% false negatives

The difference between LR+ in the observed population and LR+ reality is over 80%. The influence of different GS on LR+ is demonstrated in Figure 4. We must also realise that 3% false positives will lead to a large majority of false positives in the assessment, which renders our assessment useless.


Figure 2 Difference between observed and real LR+ with different gold standard

Discussion

Cure as GS is hard to define because clinical judgement plays an important role. We must strive for a sound consensus procedure to define our GS, but even then the hazards of this standard must be clarified. This clarification helps us to make the right choices, although there will remain some uncertainty. In a previous paper we showed that clinical judgement can be used and controlled by the use of multiple raters and assessing inter-rater variance.(17) We cannot know the real accuracy of our GS because we cannot measure the number of false positives and false negatives. We also do not know the relative increment of the false positive population if our criteria are weakened. We made some assumptions based on consensus of a group of practitioners in Holland so we invite discussion. In presenting successful cases, we implicitly use some GS of our own, these should be more explicit. In this paper we show the consequences of choices made in the Dutch materia medica validation programme.

We assumed that the prior-chance that any homeopathic medicine will work is 1% in our treated population. We cannot know this for sure and we do not know if this prior-probability is the same for all homeopathic medicines.
We do not know if our GS is valid. If it is valid, it should distinguish cure by the investigated medicine from cure by other causes and therefore generate few false positives. As false positives are the main cause for systemic bias, we expect the best results (and the highest LR+) from strict criteria. If the GHHOS scale is valid, the highest LR+ should be obtained in category +4 patients and lower LR+ in lower categories. Therefore, assessment of LR could be used to validate the GHHOS scale. We must try to enhance consensus about our outcome assessments (like GHHOS) in order to make comparisons between different grades of 'cure'.

The GS procedure weakens the real LR because the prevalence of the symptom in the false positive population is less than in the population truly cured by the medicine. At the moment we cannot know how much deviation from reality our GS causes for reasons mentioned above. Eventually we might get indications of this deviation when practice shows that we need fewer symptoms to be certain of a cure than our expectation based on LR. We explain more about numbers of symptoms needed in a separate paper.(18)

There is an important difference in terms of population size between a GS and confidence intervals. If the GS is inaccurate there will be systematic bias of results, even in a large research population, while confidence intervals become smaller by enlarging the research population. So, in a large research population the reliability of research could be mainly determined by the GS. We are even tempted to use less strict GS to increase our research population. This has a positive effect on confidence interval, but enlarges systematic bias caused by weak golden standard. In a former paper we discussed the influence of qualitative vagueness and expectation bias on LR results.(13) This kind of bias makes LR stronger. As LR research will be performed by different groups of raters to investigate a large number of symptoms it is essential that both sources of bias are dealt with in the same way by every group. After all, our method consists of comparing the strength of different symptoms, so these symptoms must be comparable.

Conclusion

The nature of the GS adopted is one of the most important potential sources of bias in diagnostic research of homeopathic clinical symptoms. Assessment of a symptom or diagnostic instrument should consist of two parts, first the assessment of the GS, the second, assessment of the diagnostic instrument itself. An imperfect GS implies that an important part of the population regarded as cured by a medicine in fact is not (false positive) so results cannot be ascribed to the medicine. GS causes negative bias to the strength of LR, the more so if the standard is weakened. In a previous paper we demonstrated that expectation bias based on vagueness causes positive bias. These factors influence reality and can not be reduced by enlarging the investigated population, as is the case with confidence interval. We must beware not to weaken our GS in order to get larger research populations because this is misleading; our confidence interval becomes smaller but LR becomes much weaker. Our main concern is to standardise the protocol for LR research in order to make results of different symptoms comparable. As yet the influence of GS on results can only be estimated, but we can use it to validate our consensus about cure and our scales to express this consensus.

Acknowledgements

We thank Guido A. Wolman, Actuary (UBA) (Investigator University of Bologna, Argentina), for his comments and help.

References

1. Stolper CF, Rutten ALB, Lugten RF, Barthels RJ. Improving homeopathic prescribing by applying epidemiological techniques: the role of likelihood ratio. Homeopathy. 2002;91:230-8.
2. Riber C, Tonnesen H, Aru A, Bjerregaard B. Observer variation in the assessment of the histopathologic diagnosis of acute appendicitis. Scand.J Gastroenterol. 1999;34:46-9.
3. Ozge A, Bugdayci R, Sasmaz T, Kaleagasi H, Kurt O, Karakelle A et al. The sensitivity and specificity of the case definition criteria in diagnosis of headache: a school-based epidemiological study of 5562 children in Mersin. Cephalalgia 2003;23:138-45.
4. Castaneda R, Lechuga D, Ramos RI, Magos C, Orozco M, Martinez H. Endemic goiter in pregnant women: utility of the simplified classification of thyroid size by palpation and urinary iodine as screening tests. BJOG. 2002;109:1366-72.
5. Herren KR, Mackway-Jones K, Richards CR, Seneviratne CJ, France MW, Cotter L. Is it possible to exclude a diagnosis of myocardial damage within six hours of admission to an emergency department? Diagnostic cohort study. BMJ 2001;323:372.
6. Buntinx F, Truyen J, Embrechts P, Moreel G, Peeters R. Chest pain: an evaluation of the initial diagnosis made by 25 Flemish general practitioners. Family Practice 1991;8:121-4.
7. Bouter LM, Dongen MCJM van. Bedreigingen van de interne validiteit. Epidemiologisch onderzoek; opzet en interpretatie, pp 186-221. Houten/Diegem: Bohn Stafleu Van Loghum, 1995.
8. Kent JT. XXXV. Prognosis after observing the action of the remedy. Lectures on homeopathic philosophy, pp 224-34. Wellingborough, Northamptonshire: Thorsons Publishers Ltd, 1900.
9. Swayne J. The response to the prescription. Homeopathic method, pp 169-85. New York: Churchill Livingstone, 1998.
10. European Committee for Homeopathy. Data Collection in Homeopathic Practice: a proposal for an international standard. 1999.
11. ADHOM: Academic Departments of the Glasgow Homeopathic Hospital. The development of the GHHOS, the IDCCIM action research, & the PC-HICOM project. Interim report februari 2003. 2003.
12. Stolper CF, Kipp RP, Lugten RFG. A proposed consensus on evaluating a case: results of the VHAN-conference. Homeopathic Links 1998;11:51-3.
13. Rutten ALB, Stolper CF, Lugten RF, Barthels RJ. Assessing likelihood ratio of clinical symptoms: handling vagueness. Homeopathy. 2003;92:182-6.
14. Payer L. Medicine & Culture. Varieties of treatment in the United States, England, West Germany and France. New York: Holt and Company, 1998.
15. Committee on Clinical Practice Guidelines IoM. Guidelines for clinical practice: from development to use. Washington DC: National Academic Press, 1992.
16. Weller SC,.Mann NC. Assessing rater performance without a "gold standard" using consensus theory. Med.Decis.Making 1997;17:71-9.
17. Rutten ALB, Stolper CF, Lugten RF, Barthels RJ. Is assessment of likelihood ratio of homeopathic symptoms possible? A pilot study. Homeopathy. 2003;92:213-6.
18. Rutten A.L.B., Stolper CF, Lugten RF, Barthels RJ. Repertory and likelihood ratio: time for structural changes. Homeopathy. 2004;93:120-124.

 


Lex Rutten, MD

Aard 10 - 4813 NN Breda, Netherlands