Lecture 0

Short Resume of Statistical Terms. Estimation of Parameters.

A *random event* is an event which has a chance of happening, and *probability* is a numerical measure of the chance. Probability is a number lying between 0 and 1, both inclusive; higher values indicate greater chances. An event with zero probability (effectively) never occurs; one with unit probability surely does. We write P(A) for a probability that an event A occurs ; P(A+B+…) for the probability that at least one of the events A,B,… occur; and P(A|
B) for the probability that the event A occurs when it is known that the event B occurs. P(A|
B) is called the *conditional probability of A given B. *The two most important axioms which govern probability are

P(A+B+…) P(A) + P(B) + …… (0.1)

and

P(AB)= P(A| B) P(B). (0.2)

If only one of the events A,B,… can occur, they are called *exclusive,* and equality holds in (0.1). If at least one of the events A,B,.. must occur, they are called *exhaustive,* and the left-hand side of (0.1) is 1. If P(A|
B)=P(A), we say that A and B are *independent*: effectively, the chance of A occurring is uninfluenced by the occurrence of B.

Consider a set of exhaustive and exclusive events, each characterized by a number x
. The number x
is called a *random variable* and with it is associated a *cumulative distribution function F(y)* defined to be the probability that the event which occurs has a value x
not exceeding a prescribed *y*. This may be written

*F(y)=P(x
y) *(0.3)

The adjective ‘cumulative’ is often omitted. Clearly *F(-¥
)=0, F(+¥
)=1*, and *F(y)* is a non-decreasing function of *y*.

If *g(x
)* is a function of x
, the *expectation *(or *mean value*) of *g* is defined by

E *g(x
)* = (0.4)

The integral is taken over all values of *y*. For full generality (0.4) should be interpreted as a Stieltjes integral. Those who care little for Stieltjes integrals will not lose much by interpreting (0.4) in one of the two following senses:1) if *F(y)* has a derivative *f(y)*, take (0.4) to be

E *g(x
)* = (0.5)

2) if *F(y)* is a step-function with steps of height at the points take (0.4) to be

E *g(x
)* = (0.6)

The point of the Stieltjes integral is to combine (0.5) and (0.6) in a single formula, and also to include certain possibilities not covered by 1) and 2). In other words, the expectation *g(x
)* is the weighted average of *g(x
)*, the weights being the respective probabilities of different possible values of x
.

Sometimes it is convenient to characterize exhaustive and exclusive events with a vector x (i.e. a set of numbers, called the coordinates of x ). The distribution function of such a vector random variable is

F(**y**) =P(x
**y**) (0.7)

where x
**y** means that each coordinate of x
is not grater than the corresponding coordinate of **y**. Now the expectation (mean value) of *g *is defined by

* *E*g(x
)* = (0.8)

The interpretation of (0.8) in the sense of (0.5) is

E*g(x
)*= (0.9)

where, , we write

(0.10)

and .

The quantities f(y) and , appearing in (0.5) and (0.6), are called the *frequency functions* or *probability density functions* of the random variable x
.

Consider a set of exhaustive and exclusive events, each characterized by a pair of numbers x
* * and h
, for which F(y,z) is the distribution function. From this given set of events, we can form a new set, each event of the new set being the aggregate of all events in the old set which have a prescribed value of x
*.* The new set will have a distribution function G(y). Similarly we have a distribution function *H(z)* for the random variable h
. Symbolically

F(y,z) = P(x
*y,h
z); G(y)= P(x
y); H(z) = P(h
z)*

If it happens that

F(y,z) = G(y)H(z) for all y and z,

the random variables x
* * and h
are called *independent*. Effectively, knowledge of one is no help in predicting the behavior of the other. The idea of the independence can be extended to several random variables, which may be vectors:

Random variables which are independent in pairs are not necessarily mutually independent as in the last equation.

The definition of expectation implies that

(0.11)

*whether or not* the random variables are independent. On the other hand the equation

(0.12)

is true for independent , though generally false for dependent . Equation (0.11) is of great importance in Monte Carlo work. As a caution, it should be added that the relation Eg(x )=g(Ex ) is rarely true. Equations (0.11) and (0.12) also hold for vector random variables.

The quantity is called the rth moment of x
. Similarly the quantity , where , is called rth *central moment* of x
. The most important moments are m
, known as the mean of x
, and m
2, known as the variance of x
. The mean is a measure of location of a random variable, whereas the variance is a measure of dispersion about that mean. The standard deviation is defined by . The *coefficient of variation* is s
divided by m
. It is sometimes expressed as a percentage.

If x
and h
are random variables with means m
and n
respectively, the quantity is called the covariance of x
and h
. Notice that, by (0.12), the covariance vanishes if x
and h
are independent; though the converse is generally false. The abbreviations *var *and *cov* are used for variance and covariance. It is true that cov(x
,x
)=varx
. The correlation coefficient between x
and h
is defined as . It always lies between ±
1. If r
=0, then x
and h
are said to be *uncorrelated*: they are *positively correlated* if r
>0 and *negatively correlated* if r
<0.

The above definitions yield the important formula

(0.13)

The following approximate formula is useful in applications

where denotes evaluated for equal to their mean values. For this formula to be valid, the quantities var should be small in comparison with {Eg}2. To obtain it, expand *g* as a Maclaurin series in the , neglecting the terms of the second degree, and then use (0.13).

The so called ** central limit theorem** asserts that (under suitable mild conditions specified in the standard textbooks) the sum of

**A remark**: as n®
¥
the distribution function of the sum tends to the normal istribution more rapidly in the region of the mean than in the tails of the distribution.

Estimation

Most Monte Carlo work is concerned with estimating the unknown numerical value of some parameter of some distribution. The parameter is called *estimand*. The available data consist of a set of observed random variables, constituting the sample. The number of observations in the sample is called the sample size. The sample and the estimand are connected because the latter is a parameter of the distribution of the random variables constituting the former. For example, the estimand might be the mean value of the distribution (the first moment) and the sample (of size *n*) might consist of independent random variables each distributed normally. A reasonable way to estimate the mean value of the normal distribution is to average over the observations

(0.14)

On the other hand, the weighted averaged of the observations is another estimation

(0.15)

of which (0.14) is a special case. The question is: can one choose {w} in a way to obtain a better estimator of the mean value than (0.14)? The answer depends on the meaning of a “better” estimation.

Let us study this issue in general. The sample can be denoted by a vector and the *estimand* q
will be a parameter of the distribution of x
that is called a *parent distribution* to distinguish it from the sampling distribution (the definition is given below). To estimate q
we use a function of the observations, *t*(x
). The function *t*(x
) can be regarded either as a mathematical function of some unspecified variables (i.e. *t* = *t*(**y**)), in which case it is called the *estimator* *t*; or it can be considered as a numerical value of *t* when take the observed values (i.e. *t* = *t*(x
)), in which case we speak of the estimate *t*. The problem is to find an estimator which provides good estimates of q
, i.e. to choose *t*(**y**) such that** ***t*(x
) is close to q
.

It is important to understand that *t*(x
) is a random variable. The reason is that x
is a particular observation obtained in a specific experiment and is random to the extent that, if we repeat the experiment we get different values of x
. The parent distribution describes the distributions of these values. Since x
** **varies from experiment to experiment, so does t(x
), and consequently *t*(x
) has a distribution, called* sampling distribution*. If *t*(x
) is close to q
, then the sampling distribution is concentrated around q
. In practice, we can determine the sampling distribution mathematically instead of repeating the experiment. The sampling distribution is expressed in terms of the estimator *t*(**y**) and the parent distribution * F*(**y**) by

T(u) = P(*t*(x
)u) = (0.16)

where the integral is taken over all values of **y** such that *t*(**y**)u. Thus, given F, we have to find *t*(**y**) such that (0.16) clusters about q
. The difference between q
and the average value of *t*(x
) (average over hypothetically repeated experiments) is

(0.17)

and similarly the dispersion of *t*(x
) can be measured by

**(0.18)**

Indeed, * *and * *are the mean and the variance of the sampling distribution. We call b
the *bias* of *t *, and * the sampling variance* of* t*. Note that (0.17) and (0.18) are special cases of (0.8).

We can now stipulate what is meant by a good estimator. We say that *t*(**y**) is a good estimator if b
and are small. This is not the only definition but the most convenient and simple one. If *t*(**y**) is such that b
=0, we speak of an *unbiased estimator*. If * *is smaller than the sampling variance of any other estimator, we speak of a *minimum-variance estimator*. We prefer to use unbiased minimum-variance estimator. The problem is how to find it. In some cases it is easy to solve this problem while most of the cases are difficult tasks in the calculus of variations. It turns out that (0.14) is an unbiased minimum-variance estimator of the mean of a normal distribution.

**Efficiency**

The main concern in Monte Carlo work is to obtain a respectably small standard error in the final result. It is always possible to reduce standard errors by taking the average of * n* independent values of an estimator. This is rarely rewarding procedure because the standard error is inversely proportional to the square root of the sample size *n* and to reduce the standard error by a factor of *k*, the sample size needs to be increased *k*2-fold. This is impracticable when *k* is large, say 100. The remedy lies in careful design of the way in which the data are collected and analyzed. The *efficiency* of a Monte Carlo process may be taken as inversely proportional to the product of the sampling variance and the amount of labor expended in obtaining this estimate. It pays handsome dividends to allow some increase in labor (usage of sophisticated rather than crude estimators) if that produces an overwhelming decrease in the sampling variance.

Regression

** **Sometimes the variation in raw experimental data can be broken into two parts: the first part consists of an entirely random variation that we can perhaps do little about; but the second part arises because the observations are influenced by certain accompanying conditions and determine how they influence the raw observations. When this is so, we can then calculate (or at least estimate) this second part and subtract it out from the reckoning, thus leaving only those variations in the observations which are not due to the concomitant conditions. A typical example of such a procedure is the analysis of a gamma-spectrum: the ‘useful’ (information carrying) events are recorded on a background produced by aside scattering (mainly, Compton scattering). In order to reveal the internal structure of the transitions, it is necessary to substract the background from the measured spectrum and to estimate the peak positions, widths, and intensities described with a set of parameters .

The basic model for this is to suppose that the random observations ,(i=1,2,..,n) are associated with a set of concomitant numbers (j=1,2,..,p) which describe the experimental conditions under which the observation was taken. It is then assumed that is the sum of a purely random component with zero expectation and a linear combination of the concomitant numbers. Here the coefficients are the uknown parameters to be estimated from the data itself. Let **X** denote the *n x p* matrix , and let **V** be the *n x n* variance-covariance matrix of ’s. Then the minimum-variance unbiased linear estimator of the vector is

(0.19)

and its sampling variance-covariance matrix is =. In passing, it may be noticed that the last formula can be used to find an *unbiased minimum-variance linear estimator* by letting **X**=**m**, where **m** is the column vector . The known components represent the mean of the parent distribution = q
, and q
* *is unknown estimand. Very often in practice we put =1. The numbers are called *regression coefficients.*

At first sight it may seem that the assumption that depends linearly upon the is unnecessarily restrictive. But, in fact, this apparent linear dependence is more a matter of notational convenience than anything else: there is no reason why the should not be functionally related. An example is when the are powers of a single number : = that is the special case of *polynomial regression*. Another important special case arises when the data can be classified into *strata *or *categories*, and is given the value 1 or 0 according as does or does not belong to *j*th category. In this case the formula (0.19) is known as the analysis of variance. In the mixed case, when some but not all are restricted to the values 0 and 1, one deals with the analysis of covariance.

Sampling Methods

There are two basic types of sampling: *fixed* and *sequential*. In the fixed sampling, one collects the data in accordance with a given number without reference to the actual numerical values of the resulting observations. In the second sampling, one allows the method of collecting and the amount collected to depend upon the observed numerical values during the collection. For example, with fixed sampling from a binomial distribution one first selects a number *n* and then carries out the number *n* of trials and observes how many of them are successful . A special type of sequential sampling is the *inverse sampling*, in which one does not fix *n* but instead carries out trials until a prescribed number of successful events is reached. In this case, the observation is the total number of trials needed for this purpose. Another type of sampling is *stratified sampling*, in which the data are collected to occupy prearranged categories or strata.