TEACHING STATISTIC USING MATLAB

This paper is to describe the use of Matlab as a scientific tool for the teaching of statistics in undergraduate school. Application of this software in the teaching of some difficult topics like probability concepts, probability distribution, statistical significance, and significance tests, were demonstrated using the Matlab. Matlab has proved itself to be a very effective tool in the educational process because it offers a simple and powerful tool for analyzing and visualizing results of numerical simulations and measurements.


Introduction
In the information era, there is a thoughtful need for statistically educated students. Conventional statistical education has focused on evolving knowledge and on methodological skills, procedures, and computations. It was supposed that students would add value to the subject in the process of learning. But this approach has not functioned and does not lead students to reason or think statistically [1].
Over the past few decades there has been increasing thought given to the teaching aspects of statistics education [2]. It is broadly recognised that statistics is one of the most important quantitative subjects in a undergraduate curriculum [3]. It is also recognized that teaching statistical courses is puzzling because it serve students with variable backgrounds and abilities, many of whom have had negative experiences with statistics.
Statistics should be taught as a laboratory science explored thorough a model for developing case labs for the use of the undergraduate statistics class [4]. In this approach, it were exposed the concepts of statistics and probability through case labs. At first it is taught the theory in a lecture format and then performed the relevant lab. This will help to learn all aspects and extensions of the statistics and be prepared with the tools needed for the lab portion of the class. In this lab based lecture class it is learned the fundamental ideas of statistics in the context of contemporary real world situations.
Technology tools are increasingly becoming available to enhance and promote statistical understanding but most of them have one in common todaycomputer [5]. Computer based learning has found a way in the learning process in undergraduate school [6]. Computers are also significantly involved in teaching technology serving sciences like statistics. Many software applications are accessible for teaching such as Matlab. Matlab is considered as standard in technical computing and science.
Matlab is a very powerful software package that has many built in tools for solving problems and developing graphical illustrations [7]. It is also noted that the simplest method for using the Matlab product is interactively; an expression is entered by the user and Matlab immediately responds with a result. Moreover, it is possible to write scripts and programs in Matlab, which are essentially groups of commands that are executed sequentially.
The aim of the study is to demonstrate different ways of applying Matlab software in the teaching of statistics. Specifically, the study intends to demonstrate how to test hypothesis and to analyse error, to the the significanes test, and to find the correlation between two or more variables.
This paper is organised as follows: Section 2 describes a research method. Research and analysis is presented on Section 3, while Section 4 gave the conclusion followed by the reference.

Research Method
A Matlab is a high-performation language for technical computing. Matlab stands for matrix laboratory, which reflects its original application to matrix applications [8]. From the start menu, when Matlab software is clicked on, a window opens in which the main part is the command window ( Figure 1). In the command window, one should see: >> which is called prompt. In the command window, Matlab can be used interactively. At the prompt, any Matlab command or expression can be entered, and Matlab will immediately respond with the result. During this process, some commands can serve as will introduction to MATLAB and allow one get help: Info will display contact information for the product, Demo has demos of some of the features of MATLAB, Help explain any command; help help will explain how help works, and then Helpbrowser opens a help window, and finally Lookfor searches through the help for a specific word or phrase (note: this can take a long time). To get out of Matlab, either type quit at the prompt, or choose file, then Exit Matlab from the menu. While Figure 2 shows the command window with some basic mathematical tasks performed. Figure 3 shows an empty script file. Scripts in Matlab are used to write basic code to implement some mathematical tasks so it can be saved and can also be edited.
Vectors and matrices are used to store sets of values, which are the same type. A vector can be either a row vector or a column vector. A matrix can be pictured as a table of values. The dimensions of a matrix are r x c, where r is the number of rows and c is the number of columns. This is marked "r by c", if a  . An empty script vector has n elements, a row vector would have the dimensions 1 x n, and a column vector would have the dimension n x 1. Matlab is written to work with matrices; the name Matlab is a short form of "matrix laboratory." Since Matlab is written to work with matrices, it is very easy to create vector and matrix variables, and there are many operations and functions that can be used on vectors and matrices.
Turning engineering data into knowledge requires an ability to test hypotheses and to analyse errors, which requires some understanding of probability and certain basic statistics. An underlying concept in statistics is that of a random variable. Random variables may be thought of as physical quantities, which are yet to be known. Since their values cannot predict, one may say that they depend on "chance".
After collecting a series of data, these data are regarded as a set in probability theory, defined as a collection of objects about which it is possible to determine whether any particular object is a member of the set. In particular, the possible result of a series of measurements (or experiments) represent a set of points called the sample space. These points may be grouped together in various ways called events, and under suitable conditions probability functions may be assigned to each. The probabilities always lie between zero and one, such that an impossible event has the probability of zero, and the probability of a certain event is one.
When a sample space of points is considered to represent the possible outcomes of a particular series of measurements, a random variable x(j) is a set function defined for points k from the sample space. A random variable X can assume x(j) values which can be real numbers between -8 and +8 , associated to each sample points that might occur. In other words, the random outcome of an experiment, indexed by j, can be represented by a discrete distribution of real numbers x(j), which are the possible values of X. A random variable is described by a function called the probability density function (PDF). The PDF is a measure of the density of probability of the random variable plotted on a horizontal axis, which is the domain of possible values of the random variable. Thus if X is a random variable, the PDF f(x) has a graph whose area is 1, since it is certain that x will have some value within its domain.
Each x(j) has a probability p(j). The discrete distribution function f(x) of p(j) is: So the probability distribution function by taking sums: The first moment of a probability distribution is the mean, and the first central moment about the mean is zero. The second moment about the mean is called the variance, 2 , and its square root, sx, is called the standard variation , . The third moment about the mean is called skewness, , and is zero for PDF's which are symmetric about the mean. The fourth moment about the mean is the kurtosis. It measures "peakedness" of the distribution. The median value divides the probability density distribution in two halves such that there is a 50% chance for x to be less than the median and a 50% chance for it to be greater than mx. The Figure 4 shows measures of central tendency of an arbitrary probability density distribution. The statistical significance (p-level) of a result is an estimated measure of the degree to which it is true, in the sense of representative of the population. The value of the p-level represents a decreasing index of the reliability of a result. The higher the p-level, the less one can believe that the observed relation between variables in the sample is a reliable indicator of the relation between the respective variables in the population. Definitely, the p-level represents the probability of error that is involved in accepting our observed result as valid as representative of the population.
There is no way to circumvent arbitrariness in the final decision as to what level of significance. The selection of some level of significance is arbitrary. In practice, the final decision usually depends on whether the outcome was predicted a priori. Typically, in many sciences, results that yield p = 0.05 are considered borderline statistically significant but remember that this level of significance still involves a pretty high probability of error (5%). Results that are significant at the p = 0.01 level are commonly considered statistically significant, and p = 0.05 or p = 0.001 levels are often called highly significant.
The Gaussian or Normal distribution is important because many natural processes result in data that are normally or log-normally distributed. The distribution of many test statistics is normal or follows some form that can be derived from the normal distribution. A random variable is said to follow a Gaussian (or normal) distribution, if its probability density function is given by [9]: The Gaussian PDF is completely specified by the mean mx and standard deviation s. The shape of the Gaussian distribution is a bell-shaped curve, symmetric about the mean, with 68% of its area within one standard deviation, and 95% within two standard deviations, shown in Figure 5. In a Normal distribution, observations that have a standardized value of less than -2 or more than +2 have a relative frequency of 5% or less. Standardized value means that a value is expressed in terms of its difference from the mean, divided by the standard deviation.
The Gaussian distribution is quite frequently used in engineering, because of a result known as the central limit theorem, which states that the sum of many independent random variables tends to behave as a Gaussian random variable. This result implies that any physical process which is the sum of random events is Gaussian in its distribution. Inappropriately this assumption often does not embrace for some distributions of real data which have to cope with. Most computers contain a random number generator, which produces numbers between 0 and 1 with an approximately uniform distribution, shown in Figure 6. Random numbers are approximately Gaussian with zero mean and unit variance. With a computer program to generate uniformly distributed random numbers on the interval (0,1), one may compute the sum of 12 of them, which by the Central Limit Theorem is approximately Gaussian, subtract the mean value (6), and obtain approximately Gaussian numbers with unit variance. Figure 6. Normal probability density and distribution fcunctions [9] Significance tests are based on certain assumptions: The data have to be random samples out of a well defined basic population and one has to assume that some variables follow a certain distribution, in most cases the normal distribution is assumed.
Power of a test is the probability of correctly rejecting a false null hypothesis. A null hypothesis is a hypothesis of no difference. If a null-hypothesis is rejected when it is actually true, a Type I error has occurred. This probability is one minus the probability of making a Type II error (b), which is the error that occurs when an erroneous hypothesis is accepted. Decreasing the probability of making a Type I error will increase the probability of making a Type II error. The probability of correctly retaining a true null hypothesis has the same relationship to Type I errors as the probability of correctly rejecting an untrue null hypothesis does to Type II error.
Anytime one test whether a sample differs from a population or whether two sample come from 2 separate populations, there is the assumption that each of the populations has its own mean and standard deviation. The distance between the two population means will affect the power of our test.
It should notice that what really made the difference in the size of b is how much overlap there is in the two distributions. When the means are close together the two distributions overlap a great deal compared to when the means are farther apart. Thus, anything that effects the extent the two distributions share common values will increase b (the likelihood of making a Type II error).
Many statistical methods are based on the assumption that data are normally distributed. If an initial histogram plot indicates that the data to be analysed may be normally distributed, we can perform another quick test, before conducting more formal tests (e.g. a chi-square test). In order to determine if a sample of data may have come from a Normal population, the best-fit Normal distribution is computed & compared with a histogram (Figure 7). The non-standardised variable x is plotted and the area under the curve is equal to the total frequency times the histogram class interval. Correlation is a measure of the relation between two or more variables [10]. The measurement scales used should be at least interval scales, but other correlation coefficients are available to handle other types of data. Correlation coefficients can range from -1.00 to +1.00. The value of -1.00 represents a perfect negatice correlation while a value of +1.00 represents a perfect positive correlation. A value of 0.00 represents a lack of correlation. The most widely-used type of correlation coefficient is Pearson r, also called linear or product-moment correlation.
Pearson correlation (hereafter called correlation), assumes that the two variables are measured on at least interval scales, and it determines the extent to which values of the two variables are "proportional" to each other. The value of correlation (i.e., correlation coefficient) does not depend on the specific measurement units used. Proportional means linearly related; that is, the correlation is high if it can be "summarized" by a straight line (sloped upwards or downwards).
This line is called the regression line or least squares line, because it is determined such that the sum of the squared distances of all the data points from the line is the lowest possible. Note that the concept of squared distances will have important functional consequences on how the value of the correlation coefficient reacts to various specific arrangements of data. The correlation coefficient (r) represents the linear relationship between two variables. If the correlation coefficient is squared, then the resulting value (r2, the coefficient of determination) will represent the proportion of common variation in the two variables, i.e the "strength" or "magnitude" of the relationship). It is important to know this magnitude or "strength" as well as the significance of the correlation.

Research Results
Some mathematical examples will be performed using MATLAB so show its many varied functions and use.

Normal Disrtibution
Grades of chip samples from a body of ore have a normal distibution with a mean of 12% and a standard deviation of 1.6%. Find the probability of the grade of a chip sample taken at random will have a grade of: 1) 15% or less 2) 14% or more 3) 8% or less 4) between 8% and 15%

Conclusion
This study has shown the different way in which of Matlab software in the teaching of the various topics such as normal distribution, confidential interval of mean, and linier regression. Matlab software usage is proposed to increase the understanding of these difficult topics among the undergraduate school students. It is therefore possible that with good course design, lecturers can have some degree of control over what topics, that Matlab software can be effective for improving performance in Statistics.