Exploring of data

“Garbage in, garbage out“ is the rule of data processing. This means that wrong input data or data with serious flaws will always leads to incorrect conclusions, and often, incorrect or harmful actions. In most of practical situations, it is hard to get good basic data, even in simple, non controversial situations and with the best of intentions. With the available basic data, the job of its processing, statistician needs help of various computing equipments such as computer, calculator and mathematical tables etc. Due to the limitations of computing capabilities of these equipments the calculations performed are not always accurate and are subject to some approximations. This means that howsoever fine techniques a statistician may use, if computations are inaccurate, the conclusions he draws from an analysis of numerical data will generally be wrong and very often misleading. It is essential therefore, to look into the sources of inaccuracies in numerical computations and the way to avoid them. In addition to this, before the data is actually processed, it must be ensured that the underlying assumptions for the desired analysis are satisfied because it is well known that the classical statistical techniques behave in the optimum manner under predefined set of conditions and perform badly for the practical situations where they depart significantly from the ideal described assumptions. For these situations thus there is a need to look at the data carefully before finalizing the appropriate analysis. This involves checking the quality of the data for the errors, outliers, missing observations or other peculiarities and underlying assumptions. For these rectifications, the question also arises whether the data need to be modified in any way. Further, the main purpose of classification of data and of giving graphical and diagrammatical representation is to indicate the nature of the distribution i.e. to find out the pattern or type of the distribution. Besides the graphical and diagrammatical representation, there are certain arithmetical measures which give a more precise description of the distribution. Such measures also enable us to compare two similar distributions and are helpful for solving some of the important problems of statistical inference.

Thus there is a need to look into these aspects i.e. inaccuracies, checking of abnormal observations, violation of underlying assumptions of data processing and summarization of data including graphical display.

The first step of data analysis is the detailed examination of the data. There are several important reasons for examining data carefully before the actual analysis is carried out. The first reason for examination of data is for the mistakes which occur at various stages right from recording to entering the data on computer. The next step is to explore the data. The technique of exploratory data analysis is very useful in getting quick information, behaviour and structure of the data. Whereas the classical statistical techniques are designated to be best when stringently assumptions hold true. However it is seen that these techniques fail miserably in the practical situation where the data deviate from the ideal described conditions. Thus the need for examining data is to look into methods which are robust and resistant instead of just being the best in a narrowly defined situation. The aim of exploratory data analysis is to look into a procedure which is best under broad range of situations. The main purpose of exploratory data analysis is to isolate patterns and features of the data which in turn are useful for identifying suitable models for analysis. Another feature of exploratory approach is flexibility, both in tailoring the analysis to the structure of the data and in responding to patterns that successive steps of analysis uncover.

Graphical Representation of Data

The most common data structure is a collection of batch of numbers. This simple structure, in case of large number of observations, is sometimes difficult to study and scan thoroughly with just looking into it. In order to concise the data, there are number of ways by which the data can be represented graphically. The histogram is a commonly used display. The range of observed values is subdivided into equal intervals and then the cases in each interval are obtained. The length of the interval is directly proportional to the number of cases within it. A display closely related to the histogram is the stem-and-leaf plot.

Stem-and-leaf Display

The stem-and-leaf plot provides more information about the actual values than does a histogram. As in the histogram, the length of each bar corresponds to the number of cases that fall into a particular interval. However, instead of representing all cases with a same symbol, the stem-and-leaf plot represents each case with a symbol that corresponds to the actual observed value. This is done by dividing observed values into two components - the leading digit or digits, called the stem and the trailing digit called the leaf. The main purpose of stem-and leaf display is to throw light on the following :

(1) Whether the pattern of the observation is symmetric.

(2) The spread or variation of observation.

(3) Whether a few values are far away from the rest.

(4) Points of concentration in data.

(5) Areas of gaps in the data.

Example: For the data values 22.9, 26.3, 26.6, 26.8, 26.9, 26.9, 27.5, 27.6, 27.6, 28.0, 28.4, 28.4, 28.5, 28.8, 28.8, 29.4, 29.9, 30.0. Display stem and leaf diagram.

For the first data value of 22.9

Data value Split Stem and Leaf

22.9 22/9 22 9

Then we allocate a separate line in the display for each possible string of leading digits (the stem), the necessary lines run from 22 to 31. Finally we write down the first trailing digit (the leaf) of each data value on the line corresponding to its leading digits.

(Unit = 1 day )

22 : 9

23 :

24 :

25 :

26 : 3 6 8 9 9

27 : 5 6 6

28 : 0 4 4 5 8 8

29 : 4 9

30 : 0 3

31 : 2 8

Sometimes, there are too many leaves per line (stem) then in that case it is desired to split lines and repeat each stem.

0 * (Putting leaves 0 through 4)

0 . (Putting 5 through 9)

1 *

1 .

2 *

2 .

In such a display, the interval width is 5 times a power of 10. Again, even if for two lines it is crowded then we have a third form, five lines per stem.

With variables 0 and 1 on the * line, 2 (two) and 3 (three) on the t line, 4 (four) and 5 (five) on the f line, 6 (six) and 7 (seven) on the s line and 8 and 9 on the . line.

The Box-plot

Both the histogram and the stem-and-leaf plots are useful for studying the distribution of observed values. A display that further summarizes information about the distribution of the values is the box-plot. Instead of plotting the actual values, a box plot displays summary statistics for the distribution. It plots the median, the 25th percentile, 75th percentile and values that are deviating from the rest. Fifty percent of the cases lie within the box. The length of the box corresponds to the interquartile range, which is the difference between the Ist and 3rd quartiles. The box plot identifies extreme values which are more than 3 box-lengths from the upper or lower edge of the box. The values which are more than 1.5 box-lengths are characterized as outliers. The largest and the smallest observed values are also part of the box-plot in terms of edges of lines. The median which is a measure of location lies within the box. The length of box depicts the spread or variability of observations. If the median is not in the center of the box, the values are skewed. If the median is closer to the bottom of the box than the top, the data are positively skewed. If the median is closer to top then the data are negatively skewed.

Spread-versus-level plot

When a comparison of batches shows a systematic relationship between the average value or level of a variable and the variability or spread associated with it, then it is of interest to search for a re-expression, or transformation of the raw data that reduces or eliminates this dependency. If such a transformation can be found, the re-expressed data will be better suited both for visual exploration and for analysis. This will further make analysis of variance techniques valid and more effective, when there is exactly or approximately equal variance across groups. The spread-versus-level plot is useful for searching an appropriate power transformation. By power transformation it is meant as power i.e. searching a power (or exponent) p as the transformation that replaces x by x^p . The power can be estimated from the slope of line in the plot of log of the median against the log of the interquartile range i.e. IR a M_d = c M_d or log IR = log c + B log M_d . The power is obtained by subtracting the slope from 1. (i.e. Power = 1 - slope). This is based on the concept that transformation Z = x^1-b of the data given re-expressed value Z whose interquartile range or spread does not depend at least approximately on the level. In addition to this graphical method of judging the independence of spread and level, there is a test known as Levene Test for testing the homogeneity of variances.

Although there is a wide variety of tests available for testing the equality of variances, but many of them are heavily dependent on the data being samples from normal populations. Analysis of variance procedures on the other hand are reasonably robust to departures from normality. The Levene test is a homogeneity of variance test that is less dependent on the assumption of normality than most tests and thus is all the more important with analysis of variance. It is obtained by computing for each case the absolute difference from its cell mean and then performing a one-way analysis of variance on these differences.