C2 Real Distributions

Distribution of random variables is fundamentally important in statistics and econometrics, but most students rarely understand the actual concept and remain confused. This sequence of posts for real statistics unit 1 is aimed to address the nature of the problem. It provides appropriate solution to enhance the learning process by means of pedagogical principles and simple real world applications. It is observed that inability to differentiate among the distinct types of distributions is major factor to create misunderstanding. Only one type of distributions refers to the characteristics within a real population whereas, other three do not exist in external reality and refer to mere theoretical and statistical concepts. This misunderstanding about elementary part results in weak build-up of complex concepts of probability and statistics. Therefore, first five posts are meant to follow a learning path on basic concepts by adopting a unique methodology that also highlights the problems of conventional treatments.

1. Introduction: This post discusses the basic concept of a real distribution of characteristics within a real population. It is the only type of distributions that exists in external reality and easier to understand as we have a finite parent population for reference. In fact, parent population is the target of all statistical inferences we make, and all the real populations are finite. For illustration, we can consider the population (P) of persons where we are interested to see the distribution of their gender. It is simply the percentage of males and females in the population. Here we have a characteristic function gender (C(i)), that assigns some characteristic (male/female) to each member of population. We can easily extend this concept to other populations with distinct objects and multiple categories (i=1,2,...,N), such as voters distribution in a city or grades distribution in a class etc. So, the distribution of a characteristic function is actually the percentage (pi) of the population members in each category. The categories are mutually exclusive because no member of population can belong to more than one category and exhaustive given that every member must belong to at least one category. It refers to that percentages of population in each category would be zero or more (pi ≥ 0) and all percentages sum up to 100%. These two laws of percentages are also fundamental “laws of probability” because percentage is naturally interpreted as probability of a random draw from the population. It is notable that a distribution is independent of the population size and only depends on the percentages in each category. If every member is completely different from the others, each category will have only one member and distribution will be 1/N. However, choice of the categories is arbitrarily made by the statistician subject to the purpose of study. Each way of categorization gives a corresponding distribution composed of percentages of members in each category. This easy yet vital concept is foundational brick to build up the complex statistics.

2. PP2: Building Confidence: Students need to learn concept which start with objects and ideas within their experience, so that they can move forward with confidence. If we start out with abstract concepts and theories outside their experience, then they will get lost and lose confidence in their ability to learn. After learning the fundamental yet easier concept of real distributions in previous note, these pedagogical principles discuss about building the confidence in students to understand the complex ideas. Lack of confidence is a major obstacle to learning process and must be addressed satisfactorily. Students must believe that every human in the world is naturally endowed with infinite potential. That enables everyone to excel in every type of complex knowledge with right training. Small subdivisions of the difficult ideas is key element of this training. There are some frequent reasons to learning failures that reduce self-confidence. First is attempt to larger learning steps, these steps must be aligned with present capabilities of student as similar goals can be achieved by small steps without creating any doubt in self potential. Second difficulty arises from absence of the real connection of education. For instance, almost all textbooks introduce probability using hypothetical examples of an artificial world. Students are unable to relate those solutions in much different context of the real world. This failure can be avoided by learning through some familiar illustrations of the real world. Finally, inappropriate challenge level also prevents learning process. If challenging tasks are very difficult, learners are discouraged and if those are very easy, there is no improvement in knowledge. It is concluded that failure of knowledge acquisition is actually failure to follow the learning principles. This issue can be solved by use of customized lessons for each student’s capabilities and avoiding haste to cover the syllabus. Students must develop a justified level of confidence by seeking support from Allah and being thankful for every small learning. So, the longer journey of knowledge can be completed step by step.

3. Real Statistics: This post explains the philosophy behind the development of this course. It is established from Global Financial Crisis and several other such incidents that use of mathematics has drastically failed the economic policies. It happens because mathematical theories in textbooks are not appropriately linked with the real world issues that make students incapable of resolving the issues. Famous statistician David Freedman has also realized that assumptions of heavy mathematical theories do not hold for real. Therefore, he used a novel approach in his pathbreaking textbook by first studying the real world issues and then using statistical theories to solve them without using any formula or equations. This method is very useful to avoid distraction created by complex mathematics in thought process of the learners. I also experienced similar intellectual transition while working on my graduate textbook with advance theoretical econometrics. I realized that real applications require to adopt highly intuitive approach to data. In this connection, I am developing a new approach to statistics with Islamic principles to seek knowledge. Differentiating the useful and useless knowledge is at the heart of this approach. Studying statistics in this way requires to focus on application of theory that must benefit humans. In search of such theories, it revealed to me that majority of concepts we teach are not applicable to reality. This theory and application divide is like the gap in knowledge about car engine and driving skills. Using applied approach and bypassing heavy theories have provided us excellent teaching and learning outcomes in terms of quality and efficiency. This situation requires to restructure the entire discipline of statistics from fundamentals. First post on real distributions is one such example that discusses alternate approach for probability distribution, that is confused by most students. These notes are part of the upcoming textbook of Real Statistics: An Islamic Approach (RSIA). Introductory chapter of the book has discussed the value and need of Islamic approach to knowledge and how it trains our hearts by making efforts.

4. Histograms: Pictures of Distribution: A real population can be categorized in several ways that generate different corresponding distributions. Histograms visualize real distributions to understand and handle them in a better way. Therefore, learning concept of histograms is necessary before we proceed to complex concepts of randomness and probability. Histograms enable us to see the distribution of a finite population P with N members. Population is collection of any objects PN={1,2,3, …, N} that is conventionally termed as sample space. If we need to measure some characteristic of the population, a function is used on the sample space. For illustration we can think of a population of 20 students and use a function G(j) to know the gender of each student j in PN. Now the function G(j) has categorized population into two types i.e. male and female according to their respective proportions in PN. If there are 12 male students and 8 female students, histogram of gender distribution will show 60% male and 40% females. We can use histogram to see the distribution of age for the same population where age is given in months. If none of the students have same age A(j), there will be only one student in each category. Distribution of age has same proportion (1/N) in all categories as only one student belongs to each of them. Age distribution changes completely if age is in years; there will be many students in one category, and we can directly see their proportions in histogram. The technical meaning of distribution is also very clear with histograms as they provide a hands-on simple approach to direct understanding. This background knowledge is enough to master the idea of probability without having confusion.

5. Coarse and Fine Histograms: There are several aspects of each structure in this universe that are revealed with levels of evaluation. This is like macroscopic or microscopic magnification, the former focuses on the shapes at large and later examines more details of the structure. Histogram is also used as a tool to examine different patterns of data that are observed by varying the magnification level. Data show how the histogram changes picture of the same data when categorized differently. These categories are termed as bins, each bin contains data points of one type. We consider the population of 30 students, S(j) is a function indicating exam score of each student j in the class. Finest histogram will directly present the scores of students like a table as each score is one category. This histogram has largest number of bins and few bins have two students who have exactly matching scores. We can have a better picture by increasing bin size if scores are categorized by grades A, B, C, D, E and F. This increase in the bin size reduces the number of categories in histogram. Here three bin sizes 5, 10 or 15 points are used to observe that histogram becomes coarser as we include more scores in one category. Histogram can identify modes of distribution, scores distribution with bin size 15 is unimodal as highest proportion of students have grade D. However, the number of modes varies with bin size as change in level of coarseness changes the perspective of information. The finest level do not provide intuitive understanding of data; however, coarseness gives a broader aspect graphically that is not observable at first. Principles of “learning by doing” and “forest and trees” are both used. It illustrates forest (abstract) view with reference to the tree (concrete) idea of scores distribution. This approach is contrary to the conventional textbooks yet more beneficial particularly for understanding probabilities.