STUDY SIZE DETERMINATION USING THE PROGRAM 'SSD'.
EXPLANATION AND REFERENCES.
#
Contents:
1. Introduction
2. Experimentation and statistical tests
3. Relationship between experiment and reality
4. Level of significance, beta and size of study
5. Conservatism in computing
6. Messages and input orders
7. References and further comments
8. Checking the program output
9. Program versions
Addendum: Other statistical PC programs for medicine that
may be obtained free of charge.
#
1. Introduction
---------------
It is today considered unethical to perform medical
experiments without giving due consideration to certain
statistical concepts. Namely, tke Power and the Confidence
of the design.
The program 'SSD' is an inexpensive and efficient
tool for experiment planning in medicine. A study should be
large enough to allow the chance of 'finding what you are
after' to be as good as possible. If you can define the
Confidence you want to achieve (corresponding to the
'alpha' value), as well as the Power (corresponding to the
'beta' value), the program calculates the necessary size of
your study.
The present text oulines the theory behind the
calculations, and gives an overview of the program SSD and
references for the methods.
If you are familiar with the theory concerning alpha,
beta, Error of type I and II, you may safely skip to
section 6, 'Messages and input orders'.
2. Experimentation and statistical tests
----------------------------------------
One of the foundations of modern research is hypothesis
testing by experiment. The experimenter tries to select one
out of two conclusions: Either the hypothesis is accepted
(for the time being) as true, or discarded (for the time
being) as false.
Under suitable circumstances, a statistical test may be
used to reach the conclusion in a safe way. The procedure
is named: The "Neyman-Pearson scheme". This method takes
as its origin a re-formulation of the original research
hypothesis, called a "null hypothesis", often written H0.
In its simplest version, a null hypothesis implies that
the groups to be compared by means of the experiment are
basically similar. In statistical terms, this is expressed
by saying that the two data sets to be compared could
vary well have been drawn at random from the same set
of data. This common set, which is purely theoretical,
is often thought of as being very large, i.e. infinitely
large.
All statistical tests consist of the same basic steps:
1) From the experiment data, the value of a test variable
is computed.
2) The value of the test variable is compared with a
pre-arranged table, to find the so-called significance
(P value).
3) The P value expresses how frequently the test variable
would reach the computed value, IF THE NULL HYPOTHESIS
WERE TRUE.
The 'alpha' value is set beforehand. It is in fact a
critical P value. If computed P is equal to or smaller than
'alpha', the P value is said to be 'significant'.
Let us suppose that a priori, alpha is set to 0.05.
Then, if the computed P value turns out to be equal to or
smaller than 0.05, that is taken to mean that were the null
hypothesis true, then something very rare has happened.
( Something so rare that that its probability is 1 to 20 or
smaller. As very rare occurrences are rare indeed, it
seems easier to conclude that the null hypothesis might be
wrong altogether.
When a certain alpha (critical P value) has been defined,
then such a test will always lead to one out of two
conclusions:
1) The null hypothesis must be accepted (for the time being),
or
2) The null hypothesis must be rejected (for the time being).
Note: There is a current trend towards preferring confidence
intervals instead of statistical tests for data presentation.
However, the approach is the same in principle, and testing
takes place as before.
3. Relationship between experiment and reality
----------------------------------------------
So far, we have spoken only about hypothesis and experiment.
However, in the background looms harsh reality, the reality
that we are trying to master through our research.
One might say that "in reality", the null hypothesis is
"really" right or "really" wrong. We do not know which is
the case. This lack of knowledge is precisely the reason
for performing the experiment.
Clearly, the experiment together with the test may lead
us into one out of four situations:
1) Conclusion: H0 is true. Reality: H0 is true.
2) Conclusion: H0 is false. Reality: H0 is true.
3) Conclusion: H0 is true. Reality: H0 is false.
4) Conclusion: H0 is false. Reality: H0 is false.
Situations 1) and 4) are beneficial, because they help us
to master reality. Situations 2) and 3) are harmful, and
we want to avoid them at all reasonable cost.
Situation 2) is named: Error of type I (Erroneous rejection
of H0). - Clinical analogy: False positive (acceptance of)
diagnosis.
Remember that H0 implies no difference between the groups
- 'no diagnosis'.
Situation 3) is named: Error of type II (Erroneous acceptance
of H0). - Clinical analogy: False negative (rejection of)
diagnosis.
Statistical tests are arranged in such a way that the P
value defined as critical expresses the risk we are willing
to take of making an Error of type I. This risk is also
named "alpha" (or sometimes 2 * alpha).
In this text and in the program, we shall use the
convention of setting P = alpha for one-sided
(one-tailed) tests, and P = 2 * alpha for two-sided
tests. Some authors use P = alpha/2 and P = alpha,
respectively. The choice of notation is unimportant,
as long as the meaning is clear.
The risk we are willing to take of making an Error of type
II is named "beta".
In current research, critical P is usually set (in
advance) equal to 0.01 (1 per cent) or 0.05 (5 per cent).
The inverse value, (1 minus Pcrit) or (1 minus alpha), is
termed "Confidence", and is set, correspondingly, to 99 or
95 per cent.
Beta is usually set (in advance) equal to 0.01, 0.05,
0.1 or 0.2.
In practice, the inverse value, ( 1 minus beta), termed
"Power", is often used, set correspondingly to 99, 95,
90 or 80 per cent.
Under ordinary circumstances, setting P = 0.05 and
aiming at a Power of 90 per cent is a good choice.
4. Level of significance, beta and size of study
------------------------------------------------
The magnitudes of alpha and beta are closely connected
to study size, N. The larger our study, the smaller will be
alpha (for a set beta) or beta (for a set alpha).
In some cases, Error of type I may be very harmful, and
must be avoided even if the cost is high. The P value
is then set low, e.g. equal to 0.01. In other cases, one
is very afraid of Error of type II. Beta is then set low,
e.g. 0.05 or 0.01. - A very low alpha or beta must always
be paid for with a large N. -
The statistical part of today's experiment planning may
consist of the following steps:
* Definition of H0. Choice of statistical test.
* Discussion of consequences of making Error of type I and
Error of type II.
* Based on this discussion, significance level alpha and
Power (beta) are set.
* From alpha and beta values, as well as statistical numbers
expected to emerge, necessary study size is computed.
* If the resulting study size turns out to be impractical,
the sequence is repeated.
The program 'SSD' performs the necessary computations,
thus leading to clear advice concerning study size.
However, there are instances when our guess of the
statistical numbers expected to emerge is very is fuzzy.
Then, study size cannot be calculated accurately. It is
recommended in such cases first to set the alpha value and
then aim at a set minimal Power (maximal beta).
It is common practice during such calculations to assume
a 'classical' study design: a study having only two
experiment groups, each group consisting of the same
number of individuals (experiment animals or patients),
n = N/2.
If more than two groups are planned, one must identify
the two groups which are most interesting to compare, and
base the calculations on those groups. The computed study
size then, of course, comprises these two groups only, and
additional groups must be added afterwards.
In most experimental research, the experiment groups are
made equally large. However, there are situations where
unequal group sizes may be feasible. For example, one of
the groups may cost far more per unit than the other. In
epidemiological research, one 'experiment group' may in
fact be so large as to constitute a 'population'. The
computations necessary in these cases may also be performed
by the program SSD.
5. Conservatism in computing
----------------------------
This section pertains mostly to the simple comparative
two-group experiment, where measurement data are used for
Student's t-test, alternatively the Wilcoxon two-group
rank test. In case of enumeration data, Chi square
tests with or without Yates's correction are used,
alternatively Fisher's 'exact' test.
The word 'conservative' in statistical texts usually
means 'taking care of significance level and Power'
when the statistical calculations are performed.
The present kind of computations are based on
approximations. Rounding errors as well as 'continuity'
errors (stemming from the discreteness of the
distributions) must be taken into account. The randomness
of the realization must also be kept in mind. And, last
but not least, the uncertainty nearly always present when
we try to set down the alternative or alternatives to
the null hypothesis. (Such an alternative hypothesis
is often called 'H1'.)
The word 'conservative' implies that deviations caused by
these factors will be steered towards the safe side, so
that realised alpha (P) and beta will be equal to, or
smaller than the theoretical values (hence, Confidence and
Power larger).
However, designing studies that are larger than necessary
is uneconomical, and may also be ethically indefensible.
The wish for a design having high a priori Confidence and
Power must be balanced against economy and ethics. This
is the very reason for the development of the present
program.
The program SSD is conservative, but not overly so.
The sample sizes calculated are robust against non-normality.
The advice given is based, when possible, on simulations.
When using the Wilcoxon rank test in preference of
Student's t-test, Power may be increased or reduced,
depending on the form of the distributions. Relatively
large gains of Power (or saving with regard to study size)
can be achieved by taking the expected distribution of data
into consideration.
If minimal study size is essential, computer simulation
can clarify the situation. By running a few simulated
experiments, one may study the behaviour of the data,
choose the statistical test leading to maximal Power, and
adjust study size so that the Power requirement is
accurately satisfied. This procedure is the same for
enumeration data as for measurement data.
The simulation programs SEQX, SEQY and SEQZ, written to
simulate group sequential experiments, may also be used to
simulate fixed size experiments.
6. Messages and input orders
----------------------------
Some of the messages and input ordes appearing during
program execution are listed below, with comments.
(Due to minor differences between the versions of SSD,
the below messages may be slightly different from the
SSD version that you are using.)
"- The program computes study size necessary to attain a"
"specified alpha and beta. Three simple designs are covered:"
"1) Two independent samples, collecting measurement data and"
" analysing e.g. averages. Example: Clinical trials."
"2) Two independent samples, collecting enumeration data and
" analysing proportions: Survival studies, case-control studies."
"3) 'Paired data' design - two observations per unit: Observation"
" at time 1 and time 2, or of a case and its control."
" "
"Statistical tests based on 'normal' distribution theory are"
"assumed, where a null hypothesis of 'no difference' is tested"
"against the hypothesis specified by user input data."
"For 1) and 3) the result is applicable also for rank tests."
"1) is applicable for correlation coefficients etc."
"- Comparison of one sample with a population is covered by 1)"
"and 2), regarding the population as a very large sample."
>>>Comment: In most cases, when comparing two samples, the
comparison is performed using the two mean values or the
two proportions emerging from the calculations. However,
other statistics may be used as well. One may compare
two standard deviations, two correlation coefficients etc..
A few such instances have been included.
" *** CHOICE OF DESIGN ***
"
"WHICH IS YOUR DESIGN? PLEASE NOTE CODES:
"
" *** TWO-SAMPLE DESIGN - comparing..
" ..AVERAGES (t- and Wilcoxon's two-sample test): 1
" ..PROPORTIONS (Chi2-test, *NO* Yates's corr.): 2
" .. PEARSON (and rank) CORRELATIONS: 3
" .. TAU CORRELATIONS: 4
" .. STANDARD DEVIATIONS: 5
" ..PROPORTIONS (Chi2-test, with Yates's corr.): 6
" *** PAIRED DESIGN - observing..
" ..DIFFERENCES (t- and matched-pairs test): 7
" .. ENUMERATION (McNEMAR's TEST): 8
" .. ENUMERATION (Sign test): 9
" *** SURVIVAL STUDY, TWO-SAMPLE DESIGN.........: 10
" *** GROUP SEQUENTIAL DESIGN, POCOCK TYPE......: 11
" *** MULTIPLE COMPARISONS......................: 12
" *** UNEQUAL SAMPLE SIZES......................: 13
>>>Comment: Design code is critical. If wrongly chosen, the
result will be positively and totally wrong.
>>>Note: The use of a continuity correction (Yates's
correcions) when testing 2 x 2 tables has been strongly
debated. See the reference by Fleiss, below.
" *** CHOICE OF PARAMETERS ***
"CHOOSING 'ORDINARY START', YOU WILL SET PROGRAMME PARAMETERS.
"- IF 'STANDARD' IS CHOSEN, THESE PARAMETERS ARE USED:..
"..P=alpha=0.05 (TWO-sided test), beta=0.1 (Power=90%).
>>>Comment: A set of standard values is suggested here. There
are situations where one-sided should be used - see below.
" *** ONE OR TWO-SIDED TEST? ***
"NOW PLEASE SELECT 'ALPHA CLASS': ONE- OR TWO-SIDED TEST...
"NOTE: ONE-SIDED TEST may be used if CHANGE IN ONE DIRECTION only
"of measurement or enumeration variable is of interest.
>>>Comment: The choice is presented explicitly so that
the user can make this decision at an early stage. There is
a tendency to use two-sided tests when one-sided tests
would have been appropriate. Then, actual P will be only
half of the supposed value, which diminishes the risk of
making a Type I error but augments the risk of making Type
2 error (compared to a priori set values). It is a chilling
thought that important results may have been overlooked for
this reason!
" *** SELECTION OF BETA (POWER) ***
"Beta is the risk of making an error of type II - failing to
"reject a null hypothesis which should have been rejected.
"Power is the 'inverse' - the chance of avoiding such an error,
"hence, of discovering a significant result.
"E.g. Beta = 0.1 means Power = 90 %
" *** CHOICE OF ALPHA (CONFIDENCE) ***
"Alpha value, Pcrit, is the risk of making an Error of type I,
"i.e., of rejecting a null hypothesis which should have been accepted.
"'Confidence' is the inverse, namely the chance of revealing a correct
"null hypothesis.
"E.g.: Pcrit = 0.05 corresponds to Confidence = 95 %
>>>Comment: Note that Pcrit values should be entered, not Confidence.
"? ENTER SMALLER STATISTIC (n.n) or (0.n)>
"? ENTER LARGER STATISTIC (n.n) or(0.n)>
>>>Comment: These input messages are the same for most
designs. Only positive numbers are allowed, a
restriction without practical significance. - Values
inconsistent with the design or algorithm will also be
rejected.
"You will now be asked to enter a 'best guess' of the St. Deviation.
>>>Comment: The guess of standard deviation must be based on
experience. When in doubt, try the largest realistic value,
e.g. expected maximal range divided by 3 to 4. - The case
where the two populations in question must be supposed to
have unequal S.D.'s is often said to need special treatment
(The 'Behrens-Fisher problem'). However, this problem
is hardly ever of more than academic interest, the ordinary
Student's t-test may safely be performed even if the
S.D.'s are very different indeed, e.g. 1 to 2 or 1 to 3.
"TWO-GROUP DESIGN.- MEASUREMENTS - AVERAGES
"Method is robust to non-normality, but 'S.D.sensitive'.
>>>Comment: This design may be used not only for averages, but
also for slopes and intercepts of the regression line, and
in general, for any two statistics that may be compared using
Student's t-test.
"TWO-GROUP DESIGN - ENUMERATION DATA - PROPORTIONS
>>>Comment: Problems regarding rates, relative risks and odds
ratios may be solved by referring the data 'back' to proportions.
"TWO-GROUP DESIGN - MEASUREMENTS - PEARSON CORR. COEFF.
"The method is not robust against non-normality.
>>>Comment: The sample size calculation is also valid for the
Spearman correlation coefficient. The Spearman coefficient
is nothing but a Pearson coefficient performed on ranks.
However, when using Spearman's coefficient, as a
conservative measure, 10 per cent may be added to the
recommended sample size. (Regrettably, I am not at present
able to document this advice.)
"TWO-GROUP DESIGN - MEASUREMENTS - TAU CORR. COEFF.
"Test is 'assumption free' and 'conservative'.
>>>Comment: Tau is a correlation coefficient different from
those of Pearson and Spearman. It is very 'robust' in nearly
every way, but in general, larger materials are needed.
" *** STUDY SIZE DETERMINATION - TWO SAMPLES OF DIFFERENT SIZES ***
"SIZE OF LARGER SAMPLE IS SET = k TIMES THAT OF SMALLER SAMPLE...
>>>Comment: The factor k expresses the relationship between sample
sizes in the simplest possible way. The design where a single
sample is compared with a population is covered by making one
of the two samples very large.
6. References and further comments
----------------------------------
1) General theory
Altman, DG: How large a sample. In: Statistics in
practice. London: The British Medical Association, 1982.
Bourke, GJ et al: Interpretation and Uses of Medical Statistics.
Oxford: Blackwell Scientific Publications, 1985.
Fleiss, JL: Statistical methods for rates and proportions.
New York: Wiley, 1981. (A standard reference book).
Lentner, C (ed.): Geigy Scientific Tables, Volume 2.
Basle: Geigy, 1982.
See also the reference given under 12)
2) Study size - two-group comparison using mean values and
proportions: See references under 1)
3) Study size - two-group comparison of standard de-
viations: The computation is based on a large-sample
estimate of the standard error of the standard deviation.
See, for example, Moroney MJ: Facts from Figures. London:
Penguin Books, 1977
4) Study size - two-group comparison of correlation
coefficients: See reference under 3). Z-transformation
and iteration are used.
5) Study size - two-group comparison of tau values:
A conservative standard error given by Kendall is used
for this computation. See Kendall MG: Rank Correlation
Methods. London: Griffin, 1970. The option is a
non-parametric and very robust alternative to 4).
6) Study size - paired data, proportions (McNemar's
test). Asessment of Power as well as sample size
calculation for this test is not straightforward. The
solution chosen here is traditional, and should be
perfectly safe. The standard error used for the
computations may be found in the extremely useful book
by Gardner MJ and Altman DG: Statistics with confidence -
Confidence intervals and statistical guidelines. The
British Medical Journal, London 1989. - See the chapter
"Calculating confidence intervals for proportions and their
differences", subchapter "Two samples: paired case". The
nomenclature of that text has been used below.
Before running the program, one must try to guess number
of cases with factor PRESENT both at time 1 and time 2.
(Tied pairs with factor present).
These cases should be EXCLUDED and ignored altogether
from the present computations. They must, however, be
added AFTERWARDS (on paper) to the calculated study size,
to obtain a realistic result.
For program input must be used PROPORTION WITH FACTOR
PRESENT at each time (AFTER exclusion as mentioned). And
NOT ACTUAL NUMBERS, as in McNemar's test itself.
Note that a 'pair' may consist of two observations in the
same individual. The economy of such a design is well
brought out by trial calculations using the program SSD.
However, when the pair consists of observations from two
different individuals, there is often more to lose than to
gain by using paired design and McNemar's test, instead
of ordinary two-group design.
7) Study size - paired data, measurements: See last
reference under 1), Appendix D.
8) Adjustment of study size, two samples of different
size - see Smith PG and Morrow RH (ed's): Methods for field
trials of interventions against tropical diseases - A
toolbox. Oxford: Oxford University Press 1991. Page 61.
9) Sign test - see Hoel PG: Introduction to mathematical
statistics. New York:John Wiley & Sons, 1964. Third
edition, printed 1964, pp. 330 - 333.
10) Survival studies - see Lachin JM: Introduction to
sample size determination and Power analysis for clinical
trials. Controlled Clinical Trials, 1960; 2: 93 - 113.
11) Group sequential design - see Pocock SJ: Clinical
trials. A Practical Approach. Cichester: John Wiley and
Sons 1983 (or later editions). - The present author has
performed a validation study of the sample size advice
given by SSD. Reference: Lehmann EH: Repeated
significance test (RST) plans a.m. Pocock: An adaptation
for experimenters in medicine and biology, with a validity
(Power) trial using Monte Carlo stochastic simulation. -
The latter study contains numerous tables and hence has not
been made available on Internet. A paper copy will be
mailed to users on request.
12) Multiple comparisons - a reference is given with the
program. See also the excellent textbook by Douglas G. Altman:
Practical statistics for medical research - London: Chapman
and Hall. 1991 (or later editions). Keyword: Bonferroni's
method.
13) Unequal sample sizes - see above reference.
The following comprehensive handbook is recommended:
Hartung J, Elpelt B, Kl"sener K-H: Statistik. Lehr- und Handbuch
der angewandten Statistik. Mnchen: R Oldenbourg Verlag.
1995 (or later editions).
8. Checking the program output
--------------------------------
Every opportunity should be seized to control the output
of the program 'SSD' by comparison with other methods.
The formulas and tables given in references 1) are
most instructive and helpful. These texts may, of course,
be used as basis for study size determination by hand and
calculator.
It is instructive to compare the output from the present
program with study size calculations from the program
PREAKSTAT, mentioned in the Addendum.
A final control of all study size computations consists
of computer simulations. However, this is necessary only
in special cases, because the study sizes suggested by SSD
have themselves been controlled in this way, and may be
considered robust against non-normality.
9. Program versions
---------------------
It has been said that no software is worth bothering with
untill version 3 has been reached. The program SSD is
now well past this stage.
* The present V.7.3 was released 1. January 2001.
No computing errors have been found or reported in
version 2.1 or later versions.
To all users who have reported their experiences with the
program 'SSD', my sincere gratitude. I am also indebted to
professors Knut Westlund, Oslo; Tor Bjerkedal, Oslo; Hogne
Sandvik, Bergen; Sean Lavelle, Galway; and assistant
professor Torunn Moksheim Lehmann, Haugesund.
P.O.Box 1346 GARD, N-5501 Haugesund, Norway, 1. January 2001.
Egil Henrik Lehmann egilhl@online.no
ADDENDUM: Other statistical PC programs for medicine that
may be obtained free of charge.
----------------------------------------------------------
Net searches for public domain software suitable for
experimenters in medicine and biology have not been very
fruitful. The author would appreciate information regarding
such software.
'EpiInfo' is a public domain, Americal program package
sponsored by the WHO. EpiInfo has an epidemiological
orientation and has thousands of users all over the world.
The package is not overly transparent, most health workers
need to read and practice a bit before mastering it.
But - it is gratis, in contrast to the commercially available
packages, costing from $1000 upwards!
EpiInfo is well suited for general practice research and is
hereby recommended. A DOS version can be downloaded from the
following site: http://www.cdc.gov/epiinfo
PRAKSTAT is a public domain, Danish collection of DOS based
programs aimed at general practice research. Data must be
entered anew each time the program is run. Sample size
calculation for clinical trials using enumeration data is
included. PRAKSTAT is available in English. Write to:
Dr. Frede Olesen
Institut for Almen medicin
Aarhus Universitet
DK-8000 rhus C.
Denmark
My own programs SSD, SEQX, SEQY, SEQZ, STTT and TABCHI may be
downloaded from the site where you found the present text:
http://www.uib.no/isf/meeting.htm
About SSD, see the above text, SSD.DOC. - The programs
SEQX, SEQY, SEQZ are 'canned' simulation programs, written
to assist colleagues in medicine and related fields with
designing and analysing simple group sequential
experiments. The programs realise the design advocated by
Pocock, where group sizes and nominal levels of
significance are kept constant. - The use of only one
interim test at a strict LOS can also be simulated. - User
provided distributions may be entered.
Regarding sequential designs, professional statisticians
today in general prefer designs more sophisticated than the
Pocock type. However, if no statistician is available, I
believe that the Pocock type may be safely used by workers
in medicine and biology with little statistical training.
A design implying only one interim test at a strict
Pcrit is today used almost routinely in Stage III
pharmacological research. This simple and resource saving
design is in fact a group sequential design having only two
groups.
The programs STTT and TABCHI perform elementary
statistical tests. STTT performs the ordinary Student's
t-test, accepting different types of input.
TABCHI tests for 'independence' in twodimensional tables.
TABCHI can also test whether data tabulated as a one-way
table could stem from a given (user provided) distribution.
Data must be entered anew for each run. Comprehensive
advice included.
ADVERTISEMENT: Medical colleagues wanting assistance with
statistical problems are welcome to contact me, without
obligation. I would, if necessary, write small statistical
data programs to order, and in principle without a fee, if
the job is not all too large. Drop me an e-mail NOW and
describe your problem! - Qualifications: I am a MD and a
GP holding the 'Postgraduate Certificate of Statistics' of
Sheffield Hallam University, 'distinctive level'. The
latter, of course, does not make a statistician, however, I
have solid experience in consultation work: No 'customer'
has ever, to my knowledge, had a paper rejected because of
poor statistics. - My own publication list comprise about
30 titles.
P.O.Box 1346 GARD, N-5501 Haugesund, Norway, 17. January 2001.
Egil Henrik Lehmann egilhl@online.no
***END OF FILE 'SSD.DOC'***
Download Program (SSD.EXE)
Home page Department
Home page University
Department of Public Health and Primary Health Care, last updated 31.05.01
Hogne.Sandvik@isf.uib.no