Probability Applied to Landing Page Testing

So how does probability apply to landing page optimization?
The random variables are the visits to your site from the traffic sources that you have selected for the test. The audience itself may be subject to sampling bias. You are counting whether or not the conversion happened as a result of the visit. You are assuming that there is some underlying and fixed probability of the conversion happening, and that the only other possible outcome is that the conversion does not happen (that is, a visit is a Bernoulli random variable that can result in conversion, or not).
As an example, let's assume that the actual conversion rate for a landing page is 2%. Hence there is a larger chance that it will not convert (98%) for any particular visitor. As you can see, the sum of the two possible outcome probabilities exactly equals 1 (2% + 98% = 100%) as required.
The stochastic process is the flow of visitors from the traffic sources used for the test. Key assumptions about the process are that the behavior of the visitors does not change over time, and that the population from which visitors are drawn remains the same. Unfortunately, both of these are routinely violated to a greater or lesser extent in the real world. The behavior of visitors changes due to seasonal factors, or with changing sophistication and knowledge levels about your products or industry. The population itself changes based on your current marketing mix. Most businesses are constantly adjusting and tweaking their traffic sources (e.g., by changing PPC bid prices and the resulting keyword mix that their audience arrives from). The result is that your time series, which is supposed to return a steady stream of yes or no answers (based on a fixed probability of a conversion), actually has a changing probability of conversion. In mathematical terms, your time series is nonstationary and changes its behavior over time.
The independence of the random variables in the stochastic process is also a critical theoretical requirement. However, the behavior on each visit is not necessarily independent. A person may come back to your landing page a number of times, and their current behavior would obviously be influenced by their previous visits. You might also have a bug or an overload condition where the actions of some users influence the actions that other users can take. For this reason it is best to use a fresh stream of visitors (with a minimal percentage of repeat visitors if possible) for your landing page test audience. Repeat visitors are by definition biased because they have voluntarily chosen to return to your site, and are not seeing it for the first time at random. This is also a reason to avoid using landing page testing with an audience consisting of your in-house e-mail list. The people on the list are biased because they have self-selected to receive ongoing messages from you, and because they have already been exposed to previous communications.
The event itself can also be more complicated than the simple did-the-visitor-convert determination. In an e-commerce catalog, it is important to know not only whether a sale happened, but also its value. If you were to tune only for higher conversion rate, you could achieve that by pushing low-margin and low-cost products that people are more likely to buy. But this would not necessarily result in the highest profits.
Statistical Methods
Landing page testing is a form of experimental study. The environment that you are changing is the design of your landing page. The outcome that you are measuring is typically the conversion rate. Landing page testing and tuning is usually done in parallel, and not sequentially. This means that you should split your available traffic and randomly alternate the version of your landing page shown to each new visitor. A portion of your test traffic should always see the original version of the page. This will eliminate many of the problems with sequential testing.
Observational studies, by contrast, do not involve any manipulation or changes to the environment in question. You simply gather the data and then analyze it for any interesting correlations between your independent and dependent variables.
For example, you may be running PPC marketing programs on two different search engines. You collect data for a month on the total number of clicks from each campaign and the resulting number of conversions. You can then see if the conversion rate between the two traffic sources is truly different or possibly due to chance.
Descriptive statistics only summarize or describe the data that you have observed. They do not tell you anything about the meaning or implications of your observations. Proper hypothesis testing must be done to see if differences in your data are likely to be due to random chance or are truly significant.

Have I Found Something Better?
Landing page optimization is based on statistics, and statistics is based in turn on probability theory. And probability theory is concerned with the study of random events. But a lot of people might object that the behavior of your landing page visitors is not "random." Your visitors are not as simple as the roll of a die. They visit your landing page for a reason, and act (or fail to act) based on their own internal motivations.
So what does probability mean in this context? Let's conduct a little thought experiment.
I have flipped the coin and covered up the result after catching it in my hand. Now imagine if I peeked at the coin without letting you see it. What would you estimate the probability of it coming up heads to be? Still 50%, right? How about me? I would no longer agree with you. Having seen the outcome of the flip event I would declare that the probability of coming up heads is either zero or 100% (depending on what I have seen).
How can we experience the same event and come to two different conclusions? Who is correct? The answer is—both of us. We are basing our answers on different available information. Let's look at this in the context of the simplest type of landing page optimization. Let's assume that you have a constant flow of visitors to your landing page from a steady and unchanging traffic source. You decide to test two versions of your page design, and split your traffic evenly and randomly between them.
In statistical terminology, you have two stochastic processes (experiences with your landing pages), with their own random variables (visitors drawn from the same population), and their own measurable binary events (either visitors convert or they do not). The true probability of conversion for each page is not known, but must be between zero and one. This true probability of conversion is what we call the conversion rate and we assume that it is fixed.
From the law of large numbers you know as you sample a very large number of visitors, the measured conversion rate will approach the true probability of conversion. From the Central Limit Theorem you also know that the chances of the actual value falling within three standard deviations of your observed mean are very high (99.7%), and that the width of the normal distribution will continue to narrow (depending only on the amount of data that you have collected). Basically, measured conversion rates will wander within ever narrower ranges as they get closer and closer to their true respective conversion rates. By seeing the amount of overlap between the two bell curves representing the normal distributions of the conversion rate, you can determine the likelihood of one version of the page being better than the other.
One of the most common questions in inferential statistics is to see if two samples are really different or if they could have been drawn from the same underlying population as a result of random chance alone. You can compare the average performance between two groups by using a t-test computation. In landing page testing, this kind of analysis would allow you to compare the difference in conversion rate between two versions of your site design. Let's suppose that your new version had a higher conversion rate than the original. The t-test would tell you if this difference was likely due to random chance or if the two were actually different.
There is a whole family of related t-test formulas based on the circumstances. The appropriate one for head-to-head landing page optimization tests is the unpaired one-tailed equal-variance t-test. The test produces a single number as its output. The higher this number is, the higher the statistical certainty that the two outcomes being measured are truly different.

Collecting Insufficient Data
Early in an experiment when you have only collected a relatively small amount of data, the measured conversion rates may fluctuate wildly. If the first visitor for one of the page designs happens to convert, for instance, your measured conversion rate is 100%. It is tempting to draw conclusions during this early period, but doing so commonly leads to error. Just as you would not conclude a coin could never come up tails after seeing it come up heads just three times, you should not pick a page design before collecting enough data.
The laws of probability only guarantee the accuracy and stability of results for very large sample sizes. For smaller sample sizes, a lot of slop and uncertainty remain.
The way to deal with this is to decide on your desired confidence level ahead of time. How sure do you want to be in your answer—90%, 95%, 99%, even higher? This completely depends on your business goals and the consequences of being wrong. If a lot of money is involved, you should probably insist on higher confidence levels.
Let's consider the simplest example. You are trying to decide whether version A or B is best. You have split your traffic equally to test both options and have gotten 90 conversions on A, and 100 conversions on B. Is B really better than A? Many people would answer yes since 100 is obviously higher than 90. But the statistical reality is not so clear-cut.
Confidence in your answer can be expressed by means of a Z-score, which is easy to calculate in cases like this. The Z-score tells you how many standard deviations away from the observed mean your data is. Z=1 means that you are 67% sure of your answer, Z=2 means 95.28% sure, and Z=3 means 99.74% sure.
Pick an appropriate confidence level, and then wait to collect enough data to reach it.
Let's pick a 95% confidence level for our earlier example. This means that you want to be right 19 out of 20 times. So you will need to collect enough data to get a Z-score of 2 or more.
The calculation of the Z-score depends on the standard deviation ([.sigma]). For conversion rates that are less than 30%, this formula is fairly accurate:
In our example for B, the standard deviation would be 10
So we are 67% sure (Z=1) that the real value of B is between 90 and 110 (100 plus or minus 10). In other words, there is a one out of three chance that A is actually bigger than the lower end of the estimated range, and we may just be seeing a lucky streak for B.
Similarly at our current data amounts we are 95% sure (Z=2) that the real value of B is between 80 and 120 (100 plus or minus 20). So there is a good chance that the 90 conversions on A are actually better than the bottom end estimate of 80 for B.
Confidence levels are often illustrated with a graph. The error bars on the quantity being measured represent the range of possible values (the confidence interval) that would be including results within the selected confidence level. Figure 1: shows 95% confidence error bars (represented by the dashed lines) for our example. As you can see, the bottom of B's error bars is higher than the top of A's error bars. This implies that A might actually be higher than B, despite B's current streak of good luck in the current sample.

 Figure 1: Confidence error bars (little data)

If we wanted to be 95% sure that B is better than A, we would need to collect much more data. In our example, this level of confidence would be reached when A had 1,350 conversions and B had 1,500 conversions. Note that even though the ratio between A and B remains the same, the standard deviations have gotten much smaller, thus raising the Z-score. As you can see from Figure 2, the confidence error bars have now "uncrossed," so you can be 95% confident that B actually is better than A.

Figure 2: Confidence error bars (more data)

Testing of Statistical Hypothesis

Understanding the Results
The null hypothesis in probability and statistics is the starting assumption that nothing other than random chance is operating to create the observed effect that you see in a particular set of data. Basically it assumes that the measured effects are the same across the independent conditions being tested. There are no differences or relationships between these independent variables and the dependent outcomes—equal until proven otherwise.
The null hypothesis is rejected if your data set is unlikely to have been produced by chance. The significance of the results is described by the confidence level that was defined by the test (as described by the acceptable error "alpha-level"). For example, it is harder to reject the null hypothesis at 99% confidence (alpha 0.01) than at 95% confidence (alpha 0.05).
Even if the null hypothesis is rejected at a certain confidence level, no alternative hypothesis is proven thereby. The only conclusion you can draw is that some effect is going on. But you do not know its cause. If the experiment was designed properly, the only things that changed were the experimental conditions. So it is logical to attribute a causal effect to them.
What if the null hypothesis is not rejected? This simply means that you did not find any statistically significant differences. That is not the same as stating that there was no difference. Remember, accepting the null hypothesis merely means that the observed differences might have been due simply to random chance, not that they must have been.
Some concepts involved in testing of hypothesis.
In applied investigations or in experimental research, one may wish to estimate the yield of a new hybrid line of corn, but ultimate purpose will involve some use of this estimate. One may wish, for example, to compare the yield of new line with that of a standard line and perhaps recommend that the new line replaces the standard line if it appears superior. This is the common situation in research. One may wish to determine whether a new method of sealing light bulbs will increase the life of the bulbs, whether a new germicide is more effective in treating a certain infection than a standard germicide, whether one method of preserving foods is better than the other so far as the retention of vitamin is concerned, which one among the six available varieties of any crop is best in terms of yield per hectare.
Using the light bulb example as an illustration, let us suppose that the average life of bulbs made under a standard manufacturing procedure is about 1400 hours. It is desired to test a new procedure for manufacturing the bulbs. Here, we are dealing with two populations of light bulbs: those made by the standard process and those made by the proposed process. From the past investigations, based on sample tests it is known that the mean of the first population is 1400 hours. The question is whether the mean of the second population is greater than or less than 1400 hours? This we have to decide on the basis of observations taken from a sample of bulbs made by second process. 
In making comparisons of above type, one cannot rely on the mere numerical magnitudes of the index of comparison such as mean, variance, etc. This is because each group is represented only by a sample of observations and if another sample were drawn, the numerical value would change. This variation between samples from the same population can at best be reduced in a well designed experiment but can never be eliminated.  One is forced to draw inference in the presence of the sampling fluctuations which affect the observed differences between the groups, clouding the real differences. Hence, we have to devise some statistical procedure, which can test whether those difference are due to chance factors or really due to treatment. 
The tests of hypothesis are such statistical procedures which enable us to decide whether the differences are attributed to chance or fluctuations of sampling.
Sample space: The set of all possible outcomes of an experiment is called sample space.  It is denoted by S. For example in an experiment of tossing two coins simultaneously, the sample space is S = {HH, HT, TH, TT}; where ‘H’ denotes the head and ‘T’ denotes the tail outcomes.  In testing of hypothesis, we are concerned with drawing inferences about the population based on a random sample.  Let there are ‘N’ units in a population and we have to draw sample of size ‘n’.  Then the set of all possible samples of size 'n' is the sample space and any sample x=(x1, x2,…,xn) is the point of the sample space.
Parameter: A function of population values is known as parameter For example, population mean (m) and population variance(σ2).
Statistic: A function of sample values say, (x1, x2,…,xn) is called a statistic. For example, sample mean
(),sample variance (s2), where

 


A statistic does not involve any unknown  parameter.
Statistical Hypothesis: A Statistical Hypothesis is an assertion or conjecture (tentative conclusion) either about the form or about the parameter of the distribution. For example
i)         The normal distribution has mean 20.
ii)      The distribution of process is Poisson.
iii)    Effective life of a bulb is 1400 hours.
iv)    A given detergent cleans better than any washing soap.
In a statistical hypothesis, all the parameters of a distribution may be specified completely or partly. A statistical hypothesis in which all the parameters of a distribution are completely specified is called simple hypothesis, otherwise, it is known as composite hypothesis. For example, in case of normal population, the hypothesis
i)                    Mean(μ) = 20, variance(σ2) = 5(Simple hypothesis)
ii)                  μ = 20, σ2>1 (composite hypothesis)
iii)                μ = 20 (composite hypothesis)
Null Hypothesis:
The statistical hypothesis under sample study is called null hypothesis. It is usually that the observations are the result purely of chance. It is usually denoted by H0.
Alternative Hypothesis: In respect of every null hypothesis, it is desirable to state, what is called an alternative hypothesis.  It is complementary to the null hypothesis.  Or "The desirable attitude of the statistician about the hypothesis is termed as alternative hypothesis". It is taken usually that the observations are the result of real effect plus chance variation.  It is usually denoted by H1. For example, if one wishes to compare the yield per hectare of a new line with that of standard line, then, the null hypothesis: 
H0: Yield per hectare of new line (μ1)=Yield per hectare of standard line (μ2)
The alternative hypothesis corresponding to H0, can be the following:
i)                    H1  : m1 > m2  (Right tail alternative)
ii)                  H1: m1 < m2 (left tail alternative)
iii)              H1: µ≠ µ2 (Two tailed alternative)
(i) and (ii) are called one tailed test and (ii) is a two tailed test. Whether one sets up a one tailed test or a two-tailed test depends upon the conclusion to be drawn if H0 is rejected. The location of the critical region will be decided only after H0 has been stated. For example in testing a new drug, one sets up the hypothesis that it is no better than similar drugs now on the market and tests this hypothesis against the alternative hypothesis that the new drug is superior. Such an alternative hypothesis will result in a one tailed test (right tail alternative).
If we wish to compare a new teaching technique with the conventional classroom procedure, the alternative hypothesis should allow for the new approach to be either inferior or superior to the conventional procedure. Hence the test is two-tailed.
Critical Region: It is region of rejecting null hypothesis when it is true if sample point belongs to it. Hence ‘C’ is the critical region.
Suppose that if the test is based on a sample of size 2, then the outcome set or sample space is the first quadrant in a two dimensional space and a test criterion will enable us to separate our outcome set into two complementary subsets, C and Cbar If the sample point falls in the subset C, H0 is rejected, otherwise, H0 is accepted.
The terms acceptance and rejection are, however, not to be taken in their literal senses. Thus acceptance of H0 does not mean that H0 has been proved true. It means only that so far as the given observations are concerned, there is no evidence to believe otherwise. Similarly, rejection of H0 does not disprove the hypothesis; if merely means that H0 does not look plausible in the light of given observations.
It is now known that, in order to establish the null hypothesis, we have to study the sample instead of entire population. Hence, whatever, decision rule we may employ, there is every chance of committing errors in the decision for rejecting or accepting the hypothesis. Four possible situations, which can arise in any test procedure, are given in the following table.
From the table, it is clear that the errors committed in making decisions are of two types.
Type I error:  Reject H0 when H0 is true.
Type II error: Accept (does not reject) H0 when H0  is false.

For example, a judge, who has to decide whether the person has committed the crime. The statistical hypothesis in this case is:
H0: Person is innocent;  H1: Person is criminal.

In this situation, two types of errors which the judge may commit are:
Type I error: Innocent person is found guilty and punished.
Type II error: A guilty person is set free.
Since, it is more serious to punish an innocent than to set free a criminal. Therefore, Type I error is more serious than the Type II error.
Probabilities of the errors:
Probability of Type I error = P (Reject H0 / H0 is true) = α
Probability of Type II error = P (Accept H0 / H1 is true) = β
In quality control terminology, Type I error amounts to rejecting a lot when it is good and Type II error may be regarded as accepting a lot when it is bad.
P (Reject a lot when it is good) = α (producer’s risk)
P (Accept a lot when it is bad) = β (consumer’s risk)
Level of significance: the probability of Type I error (α) is called the level of significance. It is also known as size of the critical region.
Although 5% level of significance has been taken as a rough line demarcation in which deviations due to sampling fluctuations alone will be interpreted as real ones in 5% of the cases. Hence, yet the inferences about the population based on samples are subject to some degree of uncertainty. It is not possible to remove this uncertainty completely, but it can be reduced by choosing the level of significance still lesser like 1%, in which chances of interpreting the deviations due to sampling fluctuations, as real one, is only one in 100 cases.
Power function of a Test: The probability of rejecting H0 when H1 is true is called power function of the test.
Power function = P(Reject H0 / H1 is true)
                                      = 1-P(Accept H0 / H1 is true)
                                      = 1- β.
The value of this function plays the same role in hypothesis testing as the mean square error plays in estimation. It is usually used as our standard in assessing the goodness of a test or in comparing two tests of same size. The value of this function for a particular point is called the power of the test.
In testing of hypothesis, the ideal procedure is to minimize the probabilities of both type of errors. Unfortunately, for a fixed sample size n, both the probabilities cannot be controlled simultaneously. A test which minimizes the one type of error, maximizes the other type of error. For example, if there is a critical region which makes the probability of Type I error zero, will be of the form always accept H0 and that the probability of Type II error will be one. It is, therefore, desirable to fix the probability of one of the error and choose a critical region which minimizes the probability of the other error. As Type I error is consider to be more serious than Type II error. Therefore, we fix the probability of Type I error (α) and minimize the probability of Type II error (β) and hence, maximize the power function of test.
Steps in solving testing of hypothesis problem:
i)                    Explicit knowledge of the nature of the population distribution and the parameters of interest, i.e., about which the hypothesis is set up.
ii)                  Setting up of null hypothesis H0 and alternative hypothesis H1 in terms of the range of parameter values, each one embodies.
iii)                The choice of a suitable test statistic, say, t=t(x1, x2,…,xn)called the test statistic which will best reflect on H0 and H1.
iv)                Partition the simple space (set of possible values of test statistic, t) into two disjoint and complementary subsets C and Cbar= A (say) and framing be test, such as
(a)  Reject H0 if value t ε C  
                 (b) Accept H0 if value t ε A.
After framing the above test, obtain the experimental sample observations, compute the test statistic and take action accordingly.