Syllabus
Statistical Methods: Definition and scope of Statistics, concepts of statistical population and sample.Data: quantitative and qualitative, attributes, variables, scales of measurementnominal, ordinal, interval and ratio. Presentation: tabular and graphical, including histogram and ogives, consistency and independence of data with special reference to attributes.
Measures of Central Tendency: mathematical and positional. Measures of Dispersion: range, quartile deviation, mean deviation, standard deviation, coefficient of variation, Moments, absolute moments, factorial moments, skewness and kurtosis, Sheppard’s corrections.
Bivariate data: Definition, scatter diagram, simple, partial and multiple correlation (3 variables only), rank correlation. Simple linear regression, principle of least squares and fitting of polynomials and exponential curves.
Index Numbers: Definition, construction of index numbers and problems thereof for weighted and unweighted index numbers including Laspeyre’s, Paasche’s, Edgeworth-Marshall and Fisher’s. Chain index numbers, conversion of fixed based to chain based index numbers and vice-versa. Consumer price index numbers.
Introduction
Many people are familiar with the term statistics. It denotes recording of numerical facts and figures, for example, the daily prices of selected stocks on a stock exchange, the annual employment and unemployment of a country, the daily rainfall in the monsoon season, etc. However, statistics deals with situations in which the occurrence of some events cannot be predicted with certainty. It also provides methods for organizing and summarizing facts and for using information to draw various conclusions.
Historically, the word statistics is derived from the Latin word status meaning state. For several decades, statistics was associated solely with the display of facts and figures pertaining to economic, demographic, and political situations prevailing in a country. As a subject, statistics now encompasses concepts and methods that are of far-reaching importance in all enquires/questions that involve planning or designing of the experiment, gathering of data by a process of experimentation or observation, and finally making inference or conclusions by analyzing such data, which eventually helps in making the future decision.
Fact finding through the collection of data is not confined to professional researchers. It is a part of the everyday life of all people who strive, consciously or unconsciously, to know matters of interest concerning society, living conditions, the environment, and the world at large. Sources of factual information range from individual experience to reports in the news media, government records, and articles published in professional journals. Weather forecasts, market reports, costs of living indexes, and the results of public opinion are some other examples. Statistical methods are employed extensively in the production of such reports. Reports that are based on sound statistical reasoning and careful interpretation of conclusions are truly informative. However, the deliberate or inadvertent misuse of statistics leads to erroneous conclusions and distortions of truths.
Basic Concepts of Data Analysis
In order to clarify the preceding generalities, a few examples are provided:
Socioeconomic surveys: In the interdisciplinary areas of sociology, economics, and political science, such aspects are taken as the economic well-being of different ethnic groups, consumer expenditure patterns of different income levels, and attitudes toward pending legislation. Such studies are typically based on data oriented by interviewing or contacting a representative sample of person selected by statistical process from a large population that forms the domain of study. The data are then analyzed and interpretations of the issue in questions are made.
Clinical diagnosis: Early detection is of paramount importance for the successful surgical treatment of many types of fatal diseases, say, for example, cancer or AIDS. Because frequent in-hospital checkups are expensive or inconvenient, doctors are searching for effective diagnosis process that patients can administer themselves. To determine the merits of a new process in terms of its rates of success in detecting true cases avoiding false detection, the process must be field tested on a large number of persons, who must then undergo in-hospital diagnostic test for comparison. Therefore, proper planning (designing the experiments) and data collection are required, which then need to be analyzed for final conclusions.
Plant breeding: Experiments involving the cross fertilization of different genetic types of plant species to produce high-yielding hybrids are of considerable interest to agricultural scientists. As a simple example, suppose that the yield of two hybrid varieties are to be compared under specific climatic conditions. The only way to learn about the relative performance of these two varieties is to grow them at a number of sites, collect data on their yield, and then analyze the data.
In recent years, attempts have been made to treat all these problems within the framework of a unified theory called decision theory. Whether or not statistical inference is viewed within the broader framework of decision theory depends heavily on the theory of probability. This is a mathematical theory, but the question of subjectivity versus objectivity arises in its applications and in its interpretations. We shall approach the subject of statistics as a science, developing each statistical idea as far as possible from its probabilistic foundation and applying each idea to different real-life problems as soon as it has been developed.
Statistical data obtained from surveys, experiments, or any series of measurements are often so numerous that they are virtually useless, unless they are condensed or reduced into a more suitable form. Sometimes, it may be satisfactory to present data just as they are, and let them speak for themselves; on other occasions, it may be necessary only to group the data and present results in the form of tables or in a graphical form. The summarization and exposition of the different important aspects of the data is commonly called descriptive statistics. This idea includes the condensation of the data in the form of tables, their graphical presentation, and computation of numerical indicators of the central tendency and variability.
There are mainly two main aspects of describing a data set:
Summarization and description of the overall pattern of the data by
Presentation of tables and graphs
Examination of the overall shape of the graphical data for important features, including symmetry or departure from it
Scanning graphical data for any unusual observations, which seems to stick out from the major mass of the data
Computation of the numerical measures for
A typical or representative value that indicates the center of the data
The amount of spread or variation present in the data
Summarization and description of the data can be done in different ways. For a univariate data, the most popular methods are histogram, bar chart, frequency tables, box plot, or the stem and leaf plots. For bivariate or multivariate data, the useful methods are scatter plots or Chernoff faces. A wonderful exposition of the different exploratory data analysis techniques can be found in Tukey (1977), and for some recent development, see Theus and Urbanek (2008).
A typical or representative value that indicates the center of the data is the average value or the mean of the data. But since the mean is not a very robust estimate and is very much susceptible to the outliers, often, median can be used to represent the center of the data. In case of a symmetric distribution, both mean and median are the same, but in general, they are different. Other than mean or median, trimmed mean or the Windsorized mean can also be used to represent the central value of a data set. The amount of spread or the variation present in a data set can be measured using the standard deviation or the interquartile range.
Qualitative data is non-statistical and is typically unstructured or semi-structured. This data isn't necessarily measured using hard numbers you use to develop graphs and charts. Instead, it is categorized based on properties, attributes, labels, and other identifiers.
Qualitative data can be used to ask the question, 'why'. It is investigative and asks open-ended questions to conduct the research. Generating this data from qualitative research is used for theorizations, interpretations, developing hypotheses, and initial understandings.
Real-world examples of qualitative data:
Product reviews
Interview transcripts
Texts and documents
Customer testimonials
Focus group responses
Notes and observations
Audio and video recordings
Survey and questionnaire labels and categories
Contrary to qualitative data, quantitative data is statistical and typically structured – meaning it is more rigid and defined. This data type is measured using numbers and values, making it a more suitable candidate for data analysis.
Whereas qualitative is open for exploration, quantitative data is much more concise and close-ended. It can be used to ask 'how much' or 'how many,' followed by conclusive information.
Real-world examples of quantitative data:
Calculations (annual revenue)
Measurements (height, width, and weight)
Counts (the number of people who signed up for the webinar)
Projections (predicted revenue increase as a percentage during a fiscal year)
Quantification of qualitative data (customer satisfaction score calculation based on ratings on a scale of 1 to 5)
Attributes and Variables
An attribute is a quality of an object (person, thing, etc.). Attributes are closely related to variables. A variable is a logical set of attributes.Variables can "vary" – for example, be high or low. How high, or how low, is determined by the value of the attribute (and in fact, an attribute could be just the word "low" or "high").
Age is an attribute that can be operationalized in many ways. It can be dichotomized so that only two values – "old" and "young" – are allowed for further data processing. In this case the attribute "age" is operationalized as a binary variable. If more than two values are possible and they can be ordered, the attribute is represented by ordinal variable, such as "young", "middle age", and "old".
Levels of measurement, also called scales of measurement, tell you how precisely variables are recorded. In scientific research, a variable is anything that can take on different values across your data set (e.g., height or test scores).
There are 4 levels of measurement:
Nominal: the data can only be categorized
Ordinal: the data can be categorized and ranked
Interval: the data can be categorized, ranked, and evenly spaced
Ratio: the data can be categorized, ranked, evenly spaced, and has a natural zero.
Depending on the level of measurement of the variable, what you can do to analyze your data may be limited. There is a hierarchy in the complexity and precision of the level of measurement, from low (nominal) to high (ratio).
Going from lowest to highest, the 4 levels of measurement are cumulative. This means that they each take on the properties of lower levels and add new properties.
Nominal level: You can categorize your data by labelling them in mutually exclusive groups, but there is no order between the categories.
Examples: City of birth, Gender, Ethinicity, Car brands, Marital status, etc.
Ordinal level: You can categorize and rank your data in an order, but you cannot say anything about the intervals between the rankings. Although you can rank the top 5 Olympic medallists, this scale does not tell you how close or far apart they are in number of wins.
Examples: Top 5 Olympic medallists, Language ability (e.g., beginner, intermediate, fluent), Likert-type questions (e.g., very dissatisfied to very satisfied),etc.
Interval level: You can categorize, rank, and infer equal intervals between neighboring data points, but there is no true zero point.The difference between any two adjacent temperatures is the same: one degree. But zero degrees is defined differently depending on the scale – it doesn’t mean an absolute absence of temperature.The same is true for test scores and personality inventories. A zero on a test is arbitrary; it does not mean that the test-taker has an absolute lack of the trait being measured.
Examples: Test scores (e.g., IQ or exams), Personality inventories, Temperature in Fahrenheit or Celsius, etc.
Ratio level: ou can categorize, rank, and infer equal intervals between neighboring data points, and there is a true zero point. A true zero means there is an absence of the variable of interest. In ratio scales, zero does mean an absolute lack of the variable. For example, in the Kelvin temperature scale, there are no negative degrees of temperature – zero means an absolute lack of thermal energy.
Examples: Height, Age, Weight, Temperature in Kelvin, etc.
The level at which you measure a variable determines how you can analyze your data.
The different levels limit which descriptive statistics you can use to get an overall summary of your data, and which type of inferential statistics you can perform on your data to support or refute your hypothesis.
In many cases, your variables can be measured at different levels, so you have to choose the level of measurement you will use before data collection begins.
Example of a variable at 2 levels of measurement:
You can measure the variable of income at an ordinal or ratio level.
Ordinal level: You create brackets of income ranges: $0–$19,999, $20,000–$39,999, and $40,000–$59,999. You ask participants to select the bracket that represents their annual income. The brackets are coded with numbers from 1–3.
Ratio level: You collect data on the exact annual incomes of your participants.
At a ratio level, you can see that the difference between A and B’s incomes is far greater than the difference between B and C’s incomes.
At an ordinal level, however, you only know the income bracket for each participant, not their exact income. Since you cannot say exactly how much each income differs from the others in your data set, you can only order the income levels and group the participants.
Descriptive statistics help you get an idea of the “middle” and “spread” of your data through measures of central tendency and variability.
When measuring the central tendency or variability of your data set, your level of measurement decides which methods you can use based on the mathematical operations that are appropriate for each level.
The methods you can apply are cumulative; at higher levels, you can apply all mathematical operations and measures used at lower levels.
Arithmetic mean is the most commonly used type of mean. A geometric mean is a method used for averaging values from scales with widely varying ranges for individual subjects. You can then compare the subject level means with each other. While an arithmetic mean is based on adding values, a geometric mean multiplies values.
Relative standard deviation is simply the standard deviation divided by the mean. If you use it on temperature measures in Celsius, Fahrenheit and Kelvin, you’d get 3 totally different answers. The only meaningful answer is the one based on a scale with a true zero, the Kelvin scale.
Types of Data
A random variable, or simply variable, is a characteristic of a population or sample.
Examples: Student grades, which varies from student to student; and stock prices, which varies from stock to stock as well as over time.
Typically denoted by a capital letter: X, Y , Z. . . T
he values of a variable are possible observations or realizations of that variable. The possible values of a variable usually land in a specified range. Examples: Student Grades: the interval [0, 100]. Stock Prices: nonnegative real numbers.
Data are the observed values of a variable. Examples: Grades of a sample of students: {34, 78, 64, 90, 76} Prices of stocks in a portfolio: {$54.25, $42.50, $48.75}
Data fall into three main groups:
Interval Data:
Real numbers, e.g., heights, weights, prices, etc. Also referred to as quantitative or numerical data.
Arithmetic operations can be performed on interval data,
Nominal Data:
Names or categories, e.g., {Male, Female} and {single, Married, Divorced, Widowed}. Also referred to as qualitative or categorical data.
Ordinal Data:
Ordinal Data are also categorical in nature, but their values have an order.
Example: Course Ratings: Poor, Fair, Good, Very Good, Excellent. Student Grades: F, D, C, B, A.
Taste Preferences: First Choice, Second Choice, Last Choice.
Thus, information is lost as we move down this hierarchy.
Tabular method of data presentation is wide spread in all spheres of human life. These methods are used to summarize data from a sample or population into table format. Data is grouped into categories and the number (or frequency) of observations in each category is obtained.
Frequency distribution is a type of tabular method. A frequency distribution is a tabular summary of data showing the frequency of items in each of several non-overlapping classes. The objective is to provide insights about the data that cannot be quickly obtained by looking only at the original data.
These methods are applied to visually describe data from a sample or population. The shape of sample data can indicate the shape of the population from which it is taken. Graphs provide visual summaries of data which is more quickly and completely describe essential information than tables of numbers.
Graphs are essential as these provide insight for the analyst into the data under scrutiny, and illustrate important concepts when presenting the results to others. A graphical method is developed which signifies the accuracy of test results. The graphs can be constructed from Producer's scores and Consumer's scores on each of the scales of test score, antigen dose and probability of protection against disease.
General Rules for Drawing Graphs, Diagrams and Maps
Selection of a Suitable Method:
Data represent various themes such as temperature, rainfall, growth and distribution of the population, production, distribution and trade of different commodities, etc. These characteristics of the data need to be suitably represented by an appropriate graphical method. For example, data related to the temperature or growth of population between different periods in time and for different countries/states may best be represented using line graphs. Similarly, bar diagrams are suited best for showing rainfall or the production of commodities. The population distribution, both human and livestock, or the distribution of the crop producing areas may suitably be represented on dot maps and the population density using choropleth maps.
Selection of Suitable Scale:
The scale is used as measure of the data for representation over diagrams and maps. Hence, the selection of suitable scale for the given data sets should be carefully made and must take into consideration entire data that is to be represented. The scale should neither be too large nor too small.
Design:
We know that the design is an important cartographic task . The following components of the cartographic designs are important. Hence, these should be carefully shown on the final diagram/map.
Title: The title of the diagram/map indicates the name of the area, reference year of the data used and the caption of the diagram. These components are represented using letters and numbers of different font sizes and thickness. Besides, their placing also matters. Normally, title, subtitle and the corresponding year are shown in the centre at the top of the map/diagram.
Legend: A legend or index is an important component of any diagram/map. It explains the colours, shades, symbols and signs used in the map and diagram. It should also be carefully drawn and must correspond to the contents of the map/diagram. Besides, it also needs to be properly positioned. Normally, a legend is shown either at the lower left or lower right side of the map sheet.
Direction: The maps, being a representation of the part of the earth’s surface, need be oriented to the directions. Hence, the direction symbol, i. e. North, should also be drawn and properly placed on the final map.
Construction of Diagrams
The data possess measurable characteristics such as length, width and volume. The diagrams and the maps that are drawn to represent these data related characteristics may be grouped into the following types:
One-dimensional diagrams such as line graph, poly graph, bar diagram, histogram, age, sex, pyramid, etc.;
Two-dimensional diagram such as pie diagram and rectangular diagram;
Three-dimensional diagrams such as cube and spherical diagrams.
It would not be possible to discuss the methods of construction of these many types of diagrams and maps primarily due to the time constraint. We will, therefore, describe the most commonly drawn diagrams and maps and the way they are constructed. These are :
Line graphs
The line graphs are usually drawn to represent the time series data related to the temperature, rainfall, population growth, birth rates and the death rates. Table 3.1 provides the data used for the construction of Fig 3.2. Construction of a Line Graph:
Simplify the data by converting it into round numbers such as the growth rate of population as shown in Table 3.1 for the years 1961 and 1981 may be rounded to 2.0 and 2.2 respectively.
Draw X and Y-axis. Mark the time series variables (years/months) on the X axis and the data quantity/value to be plotted (growth of population in per cent or the temperature in 0C) on Y axis.
Choose an appropriate scale and label it on Y-axis. If the data involves a negative figure then the selected scale should also show it as shown in Fig. 3.1
Plot the data to depict year/month-wise values according to the selected scale on Y-axis, mark the location of the plotted values by a dot and join these dots by a free hand drawn line.
1.Bar diagrams:
The bar diagrams are drawn through columns of equal width. It is also called a columnar diagram. Following rules should be observed while constructing a bar diagram:
(a) The width of all the bars or columns should be similar.
(b) All the bars should be placed on equal intervals/distance.
(c) Bars may be shaded with colours or patterns to make them distinct and attractive.
The simple, compound or polybar diagram may be constructed to suit the data characteristics.
Simple Bar Diagram:
A simple bar diagram is constructed for an immediate comparison. It is advisable to arrange the given data set in an ascending or descending order and plot the data variables accordingly. However, time series data are represented according to the sequencing of the time period.
Example: Construct a simple bar diagram to represent the rainfall data of Thiruvananthapuram as given in Table 3.3 :
Line and Bar Graph :
The line and bar graphs as drawn separately may also be combined to depict the data related to some of the closely associated characteristics such as the climatic data of mean monthly temperatures and rainfall. In doing so, a single diagram is drawn in which months are represented on X-axis while temperature and rainfall data are shown on Y-axis at both sides of the diagram.
Example: Construct a line graph and bar diagram to represent the average monthly rainfall and temperature data of Delhi as given in Table 3.4
Construction :
(a) Draw X and Y-axes of a suitable length and divide X-axis into 12 parts to show months in a year.
(b) Select a suitable scale with equal intervals of 5° C or 10° C for temperature data on the Y-axis and label it at its right side.
(c) Similarly, select a suitable scale with equal intervals of 5 cm or 10 cm for rainfall data on the Y-axis and label at its left side.
(d) Plot temperature data using line graph and the rainfall by bar diagram as shown in Fig. 3.5.
Multiple Bar Diagram:
Multiple bar diagrams are constructed to represent two or more than two variables for the purpose of comparison. For example, a multiple bar diagram may be constructed to show proportion of males and females in the total, rural and urban population or the share of canal, tube well and well irrigation in the total irrigated area in different states.
Example: Construct a suitable bar diagram to show decadal literacy rate in India during 1951 – 2001 as given in Table 3.5 :
Construction:
(a) Multiple bar diagram may be chosen to represent the above data.
(b) Mark time series data on X-axis and literacy rates on Y-axis as per the selected scale.
(c) Plot the per cent of total population, male and female in closed columns (Fig 3.6) .
Compound Bar Diagram:
When different components are grouped in one set of variable or different variables of one component are put together, their representation is made by a compound bar diagram. In this method, different variables are shown in a single bar with different rectangles.
Example: Construct a compound bar diagram to depict the data as shown in Table 3.6
Construction:
(a) Arrange the data in ascending or descending order.
(b) A single bar will depict the gross electricity generation in the given year and the generation of thermal, hydro and nuclear electricity be shown by dividing the total length of the bar as shown in Fig 3.7.
Advantages: Bar diagrams are used in various industries as they are easily accessible and ideal for visual data representation. Some other benefits are as follows:
Bar diagram is easy to design both on paper and in computer software. All you need is the required data for comparison before selecting the type of bar diagram. For a few categories, the vertical bar diagram works perfectly. While for multiple categories, the horizontal bar diagram is better suitable. Furthermore, you can select a stacked bar diagram to break up categories into segments or a grouped bar diagram for representing data over time. The x-axis and y-axis can be easily labeled, and the bars can be drawn to corresponding values. For this reason, the bar diagram is the popular choice for visual data presentation.
The bar diagram is used in various fields across the globe. For instance, it has applications in epidemiology for understanding the spread of diseases and controlling them. Businesses widely use bar diagrams for analyzing sales and finances. Furthermore, it can also be employed for tracking personal finances.
Bar graph has multiple types suitable for accurate visual data representation. This streamlines the comparison processes. For instance, to compare the sales of products for an online and an offline store across different months, the stacked bar diagram can be used. The bar itself conveys the sales of online as well as offline stores each month. This helps businesses compare how well each of their stores is performing and where they need to focus their attention.
The large data set is outlined using a bar diagram. This is because it makes the analyses process simpler and fun. The graphical representation of the data set gives a better sense of the data to be analyzed. Plus, there are spaces between bars that emphasize individual bars represent discrete values. For instance, a business wants to understand the delivery time and status of on-peak and off-peak hours. The grouped bar diagram can be used to summarize the data in visual form and analyze the areas that need improvement.
The bar diagram visually represents data that changes over time, which helps in pattern and trend identification. Compare this with a table of numerical data, and you will realize employing a bar diagram gets the job done more easily. While you would need an expert to recognize patterns from a table of numerical data, with a bar diagram even the beginner can read the label and recognize patterns and trends. That’s how easy visual representation of data with a bar diagram makes studying patterns and highlighting trends.
Bar diagrams are widely utilized in presentations for representing data visually. But oftentimes, only using the bar diagram is not enough. Even though labels along both the axes explain the represented data, for clarity purposes, a further illustration is needed. That’s the drawback of a bar diagram, especially when you are representing complex data sets. Relying only on the bar diagram to explain the data set is not sufficient.
In the digital times we live in, more and more people prefer a visual representation of content. That’s when the bar diagram comes into the picture. But sadly, considering how easily accessible it is and how quickly it can be shared across social media, its misuse is bound to happen. In recent times, bar diagrams are being manipulated in several ways. A few of these ways are omitting the baseline, manipulating the y-axis, and messing with the standard. All these bad practices are carried out to mislead and manipulate readers.
While a bar diagram can be used to calculate the work and time required for various activities in the project, unfortunately, it doesn’t showcase the interrelationship between these activities. Therefore, it cannot be used as a controlling tool. Since the relationship between these actions cannot be represented, using the bar diagram for managing projects becomes difficult.
A bar graph facilitates a systematic representation and arrangement of different activities in a project. But when it comes to monitoring the progress of these activities, the bar diagram cannot depict them. This makes it an undesirable tool as we live in a dynamic world where we need to take timely actions. Additionally, it makes detecting delays in activities difficult.
As we have mentioned earlier, bar diagrams require additional explanation to allow readers to easily understand them. Similarly, the bar diagram lacks fundamental causes, impact, and assumptions essential for analyzing data. Although you can represent the large data set in a visual and interpretable form, the bar diagram is suitable for small projects.
2. Pie diagram:
Pie diagram is another graphical method of the representation of data. It is drawn to depict the total value of the given attribute using a circle. Dividing the circle into corresponding degrees of angle then represent the sub– sets of the data. Hence, it is also called as Divided Circle Diagram.
The angle of each variable is calculated using the following formulae.
(Value of given State/Region *360)/(Total Value of All States/Regions )
If data is given in percentage form, the angles are calculated using the given formulae.
(Percentage of x *360 )/100
For example, a pie diagram may be drawn to show total population of India along with the proportion of the rural and urban population. In this case the circle of an appropriate radius is drawn to represent the total population and its sub-divisions into rural and urban population are shown by corresponding degrees of angle.
Example: Represent the data as given in Table 3.7 (a) with a suitable diagram.
Calculation of Angles:
(a) Arrange the data on percentages of Indian exports in an ascending order.
(b) Calculate the degrees of angles for showing the given values of India’s export to major regions/ countries of the world, Table 3.7 (b). It could be done by multiplying percentage with a constant of 3.6 as derived by dividing the total number of degrees in a circle by 100, i. e. 360/100.
(c) Plot the data by dividing the circle into the required number of divisions to show the share of India’s export to different regions/countries (Fig. 3.8).
Construction:
(a) Select a suitable radius for the circle to be drawn. A radius of 3, 4 or 5 cm may be chosen for the given data set.
(b) Draw a line from the centre of the circle to the arc as a radius.
(c) Measure the angles from the arc of the circle for each category of vehicles in an ascending order clock-wise, starting with smaller angle.
(d) Complete the diagram by adding the title, sub-title, and the legend. The legend mark be chosen for each variable/category and highlighted by distinct shades/colours.
Precautions:
(a) The circle should neither be too big to fit in the space nor too small to be illegible.
(b) Starting with bigger angle will lead to accumulation of error leading to the plot of the smaller angle difficult.
Advantages:
Display relative proportions of multiple classes of data.
Require minimum addition explanations
Summarize a large data set into visual form.
Pie charts are easily understood due to its widespread use in business and media
Pie charts permit a visual check of the reasonableness or accuracy of the calculation.
Pie charts are visually simpler than other types of graphs.
Size of the circle can be made proportional to the quantity it represents.
Pie charts are less useful than bar graphs for accuarate reading and interpretation when the series is divided into a large number of components or the difference among the components is very small.
If the given data has more than six categories the pie chart becomes very crowded and ugly. In such cases it is not advisable to use pie charts to represent the data.
If most of the sectors of the data are of roughly equal size then we cannot make visual comparisons between categories by simply looking at the pie chart.
Before drawing the pie chart, we need to do calculations of central angles for each category. These calculations are boring and tedious. On the other hand, no calculations are needed in order to draw simple bar graphs, line graphs, etc.
Pie charts cannot be used to represent time series data.
We cannot make comparisons between two sets of data with the help of a single pie chart. On the other hand we can draw two bars for each category to visually represent two sets of data in a single bar graph.
Circles are difficult to compare and thus pie charts are not very popular among professional statisticians.
Disadvantages:
It does not easily reveal exact values. Values are expressed in terms of percentages or ratio therefore it is not easy to know the exact value represented
The pie chart does not easily show changes over time
Pie charts fail to reveal key assumptions, causes, effects, or patterns
Pie charts can easily be manipulated to yield false impressions
Pie charts are less useful than bar graphs for accurate reading and interpretation when the series is divided into a large number of components or the difference among the components is very small.
If the given data has more than six categories the pie chart becomes very crowded and ugly. In such cases it is not advisable to use pie charts to represent the data.
If most of the sectors of the data are of roughly equal size then we cannot make visual comparisons between categories by simply looking at the pie chart.
Before drawing the pie chart, we need to do calculations of central angles for each category. These calculations are boring and tedious. On the other hand, no calculations are needed in order to draw simple bar graphs, line graphs, etc.
Pie charts cannot be used to represent time series data.
We cannot make comparisons between two sets of data with the help of a single pie chart. On the other hand we can draw two bars for each category to visually represent two sets of data in a single bar graph.
Circles are difficult to compare and thus pie charts are not very popular among professional statisticians.
3. Line Graph:
A line graph is a type of graph which is used to show information that changes over time. We can plot the line graph by joining several points with straight lines. A line graph is also called a line chart. A line graph contains two axes i.e., the x-axis and the y-axis.
The difference in median household income between California and the rest of the United States.
Advantages:
Easy to construct
Easy to interpret
Easy to read/estimate exact values.
Shows trend or movement overtime.
Disadvantages:
Doesn’t give a clear impression on the quantity of data.
May give false impression on the quantity especially when there was no production.
Poor choice of vertical scale may exaggerate fluctuations in values.
Difficult to find exact values by interpolation.
3. Pictograph:
A way of graphing categorical data by using pictures to represent data items. In other words, pictographs are a way of representing statistical data using symbolic figures to match the frequencies of different kinds of data.
Advantages:
Easy to read: Since images, objects or symbols are used to represent numbers, pictographs are very easy to read. Reading a pictograph also comes intuitively & does not require much effort to read.
Universally understandable: Pictographs are used globally, crossing language and cultural barriers as they do not require exhaustive explanation for making sense of them.
Efficient Teaching Tool: Since pictographs are visually heavy, they function as a good introduction on teaching children how to read and make sense of data.
Compression Capability: If the data is big and cannot be represented by assigning a single value to each object, the scale of each object can be increased. For example, the below pictograph uses a scale of eight to display more information in less space.
Disadvantages of Pictographs:
Lack of fractional representation: We cannot represent fractional values using pictographs. Only positive integers can be represented using pictographs.
Not ideal for large datasets: When the data to be represented is huge or complex, pictograph might not be the best way to display it as pictorial representation could be confusing for some people.
Not useful for business settings: Since pictographs are very basic, they might not be ideal to represent and communicate data in formal business settings.
Contextual Challenges: If pictographs are used to display larger datasets but the scale isn’t mentioned, the understanding of presented data could be wrong.
Histogram
A histogram is a graphical representation of a grouped frequency distribution with continuous classes. It is an area diagram and can be defined as a set of rectangles with bases along with the intervals between class boundaries and with areas proportional to frequencies in the corresponding classes. In such representations, all the rectangles are adjacent since the base covers the intervals between class boundaries. The heights of rectangles are proportional to corresponding frequencies of similar classes and for different classes, the heights will be proportional to corresponding frequency densities.
In other words, a histogram is a diagram involving rectangles whose area is proportional to the frequency of a variable and width is equal to the class interval.
You need to follow the below steps to construct a histogram.
Begin by marking the class intervals on the X-axis and frequencies on the Y-axis.
The scales for both the axes have to be the same.
Class intervals need to be exclusive.
Draw rectangles with bases as class intervals and corresponding frequencies as heights.
A rectangle is built on each class interval since the class limits are marked on the horizontal axis, and the frequencies are indicated on the vertical axis.
The height of each rectangle is proportional to the corresponding class frequency if the intervals are equal.
The area of every individual rectangle is proportional to the corresponding class frequency if the intervals are unequal.
Ans: The histogram graph is used under certain conditions. They are:
The data should be numerical.
A histogram is used to check the shape of the data distribution.
Used to check whether the process changes from one period to another.
Used to determine whether the output is different when it involves two or more processes.
Used to analyse whether the given process meets the customer requirements.
Difference Between Bar Graph and Histogram:
A histogram is one of the most commonly used graphs to show the frequency distribution. As we know that the frequency distribution defines how often each different value occurs in the data set. The histogram looks more similar to the bar graph, but there is a difference between them. The list of differences between the bar graph and the histogram is given below:
The histogram can be classified into different types based on the frequency distribution of the data. There are different types of distributions, such as normal distribution, skewed distribution, bimodal distribution, multimodal distribution, comb distribution, edge peak distribution, dog food distribution, heart cut distribution, and so on. The histogram can be used to represent these different types of distributions. The different types of a histogram are:
Uniform histogram
Symmetric histogram
Bimodal histogram
Probability histogram
A Probability Histogram shows a pictorial representation of a discrete probability distribution. It consists of a rectangle centered on every value of x, and the area of each rectangle is proportional to the probability of the corresponding value. The probability histogram diagram is begun by selecting the classes. The probabilities of each outcome are the heights of the bars of the histogram.
The applications of histograms can be seen when we learn about different distributions.
The usual pattern that is in the shape of a bell curve is termed normal distribution. In a normal distribution, the data points are most likely to appear on a side of the average as on the other. It is to be noted that other distributions appear the same as the normal distribution. The calculations in statistics are utilised to prove a distribution that is normal. It is required to make a note that the term “normal” explains the specific distribution for a process. For instance, in various processes, they possess a limit that is natural on a side and will create distributions that are skewed. This is normal which means for the processes, in the case where the distribution isn’t considered normal.
The distribution that is skewed is asymmetrical as a limit which is natural resists end results on one side. The peak of the distribution is the off-center in the direction of the limit and a tail that extends far from it. For instance, a distribution consisting of analyses of a product that is unadulterated would be skewed as the product cannot cross more than 100 per cent purity. Other instances of natural limits are holes that cannot be lesser than the diameter of the drill or the call-receiving times that cannot be lesser than zero. The above distributions are termed right-skewed or left-skewed based on the direction of the tail.
The alternate name for the multimodal distribution is the plateau distribution. Various processes with normal distribution are put together. Since there are many peaks adjacent together, the tip of the distribution is in the shape of a plateau.
This distribution resembles the normal distribution except that it possesses a bigger peak at one tail. Generally, it is due to the wrong construction of the histogram, with data combined together into a collection named “greater than”.
In this distribution, there exist bars that are tall and short alternatively. It mostly results from the data that is rounded off and/or an incorrectly drawn histogram. For instance, the temperature that is rounded off to the nearest 0.2o would display a shape that is in the form of a comb provided the width of the bar for the histogram were 0.1o.
The above distribution resembles a normal distribution with the tails being cut off. The producer might be manufacturing a normal distribution of product and then depending on the inspection to segregate what lies within the limits of specification and what is out. The resulting parcel to the end-user from within the specifications is heart cut.
This distribution is missing something. It results close by the average. If an end-user gets this distribution, someone else is receiving a heart cut distribution and the end-user who is left gets dog food, the odds and ends which are left behind after the meal of the master. Even if the end-user receives within the limits of specifications, the item is categorised into 2 clusters namely – one close to the upper specification and another close to the lesser specification limit. This difference causes problems in the end-users process.
Q1. Are histogram and bar chart the same?
No, histograms and bar charts are different. In the bar chart, each column represents the group which is defined by a categorical variable, whereas in the histogram each column is defined by the continuous and quantitative variable.
Q2. Which histogram represents the consistent data?
The uniform shaped histogram shows consistent data. In the uniform histogram, the frequency of each class is similar to one other. In most cases, the data values in the uniform shaped histogram may be multimodal.
Q3. Can a histogram be drawn for the normally distributed data?
Yes, the histogram can be drawn for the normal distribution of the data. A normal distribution should be perfectly symmetrical around its center. It means that the right should be the mirror image of the left side about its center and vice versa.
Q4. When a histogram is skewed to right?
A histogram is skewed to the right, if most of the data values are on the left side of the histogram and a histogram tail is skewed to right. When the data are skewed to the right, the mean value is larger than the median of the data set.
Q5. When a histogram is skewed to the left?
A histogram is skewed to the left, if most of the data values fall on the right side of the histogram and a histogram tail is skewed to left. In this case, the mean value is smaller than the median of the data set.
Scattar Plot
What are Scatter Plots?
A graph that demonstrates the connection between two data sets is called a scatter plot. The two data sets are graphed as ordered pairs in a coordinate plane. Trends in the data that are represented on the graph using scatter plots. A relationship between two data sets is called a correlation. Scatter plots are used to show the correlation between the data on the graph.
The relationship between data sets depicted by a scatter plot can be:
Positive Linear:
All points are located near a straight line such that ‘x’ increases with ‘y’.
Negative Linear :
All points are located near a straight line such that ‘x’ increases as ‘y’ decreases.
Non-linear :
The data points form the shape of a curve.
None (or No relationship):
There is no pattern or shape formed by the data points.