Problem-Solving and Data Analysis

Scale Drawing

Scaled representations are used so that large measurements can be shown on paper or used in models. Housing developers present their proposed projects using scaled 3-dimensional models because it is more practical and it is effective. Engineers use scaled drawings in their blueprints.

In preparing for the SAT exam, be familiar with scale drawings. They are useful in solving single or multi-step problems. Not all questions will be on calculating; some questions may involve the interpretation of scale drawings. Use scales in the same manner that you use ratios and proportions.

A scale of 1:10 used for a building model measuring 5 ft x 6 ft at the base, means that the actual building will have a base of 50 ft x 60 ft.

Rate and Unit Rate

The concept of rate can be found in questions on distance traveled over time (kilometers per hour, meters per second,etc.), work done per unit of time, cost per unit area, density, and other similar questions. Almost all of them require the ability to manipulate units and convert them if necessary.

Rate is generally a special kind of ratio expressing one term or quantity measured in one unit in comparison to another term or quantity measured in another unit. For instance, we say that speed (a rate) is the measure of distance over a measure of time, that is:

Speed = Distance / Time

Distance = Speed * Time

This is usually written in a formula most students are familiar with:

Distance = Rate * Time

For example, a car travels 45 miles in 1.5 hours. This is a rate showing the quantity measured as 45 miles over another quantity measured, 1.5 hours.

Speeds, however, are generally not expressed that way. Instead, we use unit rates, although it is possible that we don’t recognize the term. Unit rate is simply the rate expressing the number of units of the first quantity to 1 unit of the other quantity.

Thus, the unit rate for the above example is 30 miles per 1 hour or 30 mph, which is how we commonly refer to speeds instead of 45 miles per 1.5 hours.

Mean, Median, Mode, Range, and Standard Deviation

In statistics, data sets are described using measures of central tendency and measures of spread. The measures of central tendency represent the typical value of data in a set, while the measures of spread show how much the values in the set vary.

Measure of Center

There are three basic measures of central tendency that a student taking the SAT exam must know. These are the mean, median, and mode. Let’s illustrate these central values for the following data set.

Twelve children have the following heights measured in centimeters:

100.5, 98.0, 98.5, 98.4, 98.7, 100.0, 100.4, 100.7, 104.0, 98.8, 98.0, 98.5

In this graph, the mean value is 99.5. It is determined by adding all the values and dividing the sum by the number of values.

Mean =

(100.5 + 98.0 + 98.5 + 98.4 + 98.7 + 100.0 + 100.4 + 100.7 + 104.0 + 98.8 + 98.0 + 98.5)/12 = 99.5

The median is 98.75. To determine the median, the values must first be arranged in ascending order. The value in the middle is the median. In this sample set, there happen to be two middle numbers — 98.7 and 98.8. We take their sum and divide it by two to get the median.

98.0, 98.0, 98.4, 98.5, 98.5, 98.7, 98.8, 100.0, 100.4, 100.5, 100.7, 104.0

Median = (98.7 + 98.8)/2

The value that appears the most number of times is the mode. It is common for a data set to have more than one mode, as is the case in this bimodal set. The modes are 98.0 and 98.5.

98.0, 98.0, 98.4, 98.5, 98.5, 98.7, 98.8, 100.0, 100.4, 100.5, 100.7, 104.0

Shape, Center, and Spread

How the values vary in a data set is determined by measures of spread. Two of the most common measures of spread are the range and standard deviation. It is important to know what these measures are and what they imply in the data set.

The range of a data set is the difference between the largest and the smallest value. It shows the spread or span of all the data.

In the data set above, the range of children’s height is 6 cm (104 cm − 98 cm). What does that imply if the range for another group of children is 2 cm? In the SAT exam, questions similar to this will be asked.

The height measurements of the children in the group with the range of 6 cm show greater variation compared to those in the second group. A smaller range (2 cm) means that the height measurements of the children in this group are closer, and there are no two children with a height variation of more than 2 cm.

Standard deviation is another measure of spread. It measures how far away from the mean the values are in a set. Standard deviation (SD) is computed by taking the square root of the variance of a data set. The variance is the average of the squared differences of each value from the mean.

Fortunately, the SAT will not ask you to compute the SD. It will be enough for you to understand what it is, and what it means for a data set. In the given example on height measurements, the standard deviation is 1.6 cm. A question may provide this information and instead ask this question: “How many students have heights within one standard deviation of the mean?”

Within one standard deviation of the mean refers to the measurement or value 1.6 cm above or below the mean. So you will need to check the data set for values falling within 99.5 ± 1.6 cm, and count how many there are.

The shape of data can be symmetrical or asymmetrical. When the values in a data set are evenly spread out and where the mean is close in value to the median, the data is said to have a symmetric shape. When the values cluster in one area, they show in graph as a head. The values decrease to zero either to the left or right of the head, and for the tail. We call these data sets asymmetric or skewed because the center is shifted to either right or left. When the mean is greater than the median, the graph of the data is skewed to the right (or the tail is to the right of the head or mean). When the mean is less than the median, the graph is skewed to the left (or the tail is to the left of the head or mean).

In the SAT exam, it is more important for you to understand the meaning of a measure of spread to a given data set than to know how to compute it.

Outliers

In a given data set, numbers that are too far away from the main group (either too small or too large than most of the values) are called outliers. Outliers affect the mean, although not so much the median or mode.

Ten students were being given special coaching by their teacher to improve their performance in class. Prior to the coaching, the mean raw score of these students in every test has never exceeded 76. It has been a month, and the teacher wants to know if the coaching is making progress.

In the latest test, these were the raw scores:

82, 82, 83, 15, 84, 83, 80, 80, 81 and 82

The raw score of 15 is an extreme value and is clearly an outlier.

The mean raw score for the latest test is 75.2. Does it mean that the teacher’s coaching failed? It must be noted that both the median and the mode are equal to 82.

Before making conclusions, the reasons for outliers must be inspected. Outlying values are sometimes removed if the reasons are justifiable. Another solution would be to use the median or mode instead of the mean because they are the least affected by the outlier.

Concept of Density

Density is the amount of matter (or mass) in a unit volume. For example gold has a density of 7.92 g/cm3, aluminum has a density of 2.79 g/cm3, and water has an approximate density of 1 kg/L.

If the densities of certain objects and substances are known, their mass can be computed given a certain volume, or the other way around.

Scatterplot, Box-and-Whisker Plot, and Histogram

It is important to know how to read graphical representations of data. Three of the types of graphs commonly seen in SAT exams are the scatterplot, box-and-whisker plot, and histogram.

Scatterplot

A scatterplot is also referred to as an XY plot.

A scatterplot is usually the graph type of choice for showing the relationship between bivariate data. The data or values are plotted on the graph as x, y coordinates with x as the independent variable and y as the dependent variable.

A point in the graph represents two values. For instance, point P represents 1.5 hours of tutorial (the x value) and a score of 68 (the y value). Viewing the whole scatterplot, we see that as the number of hours spent on the tutorial is increased, the student’s scores increased also. Related topics best-fit line or curve and correlation will be taken up under a separate heading below.

Box and Whisker Plot

A box-and-whisker plot may be referred to as a boxplot and is made up of a rectangular box with two horizontal lines on both ends. It looks like this:

A box-and-whisker plot breaks the data into quartiles. In the graph, the first vertical line represents the first quartile (Q1), the vertical line within the box marks the second quartile (Q2) or the median of the data, and the third vertical line represents the third quartile (Q3). Points A, B, C, and D are only marked for our purposes. The tip of the horizontal line marked as A is the smallest value in the data set, while B on the other tip is the largest value in the data set. There are cases, however, when there are outliers in the data set. These values are represented as dots disconnected from the plot, such as points C and D.

The median of the data set is 35. Without the outliers, the range is 28(21 – 49). The range describes the spread of all the data. With the outliers, the range will be quite large—around 50. You may also determine the interquartile range (IQR), or the range of the middle half of the data. From the plot, the IQR is about 14(Q3 – Q1).

A box-and-whisker plot can be skewed to the right, meaning most of the observations are on the left side, pulling the box to the left, and the longer whisker is stretched to the right. Or it can be skewed to the left, with most of the observations to the right.

Histogram

A histogram is a graph that uses columns or bars on an x-y plane to show the distribution of each element in a group of elements. The labels on both the x and y axes represent quantitative data, such as the number of athletes in a high school counted according to different height ranges.

The histogram shows the frequency with which each height range occurs in the data set. It is skewed to the right, which means that most of the athletes are on the shorter end of the scale (52 athletes have height measurements between 165 cm and 180 cm), with fewer athletes on the taller end.

Please note that on the SAT exam, the range of values in a histogram follows this convention: each bar in the histogram includes the end value on the left and excludes the end value on the right of the range. So a range of 165 − 170 cm includes all values within the range, including 165 but excluding 170.

A graph showing the number of a school’s athletes (label on the y-axis) in its basketball, volleyball, soccer, and swimming teams (label on the x-axis) is a bar graph because it represents categorical data, not quantitative data. Since the labels are categorical, we cannot appropriately refer to a bar graph’s skewness, or to its low or high end. Is there a graph missing? It can seem confusing to talk about a graph that isn’t there.

Two-Way Table

Two-way tables are usually used to present survey results in tabular form. The columns show the count or number for one category, while the rows indicate another category. The two-way table below shows the result of a survey conducted on 125 college students on the brand of beverage they prefer to drink during lunch break.

Many questions can be answered by directly finding the correct cell, such as, “Which brand is the least preferred by male students?” The table can also be used to compute answers which may not be readily provided.

Be careful when answering questions that may initially look too simple. You may be asked, “How many students prefer Brand C are female?” Since there are 28 students who prefer Brand C and 60 students who are female, it is tempting to answer 88 right away. However, note that doing so means counting the ten females who prefer Brand C twice. So subtract that number first, and we have the correct answer of 78.

Conditional and Relative Frequency; Conditional Probability

The two-way table given above as an example is presented as a frequency table, so-called because it shows the frequency or the count that an event occurs. In the context in which the example was given, the frequency is the number of times that a particular brand of beverage was chosen by the participating students. The numbers in the inner cells are called the frequencies or count.

Table 1: Frequency Table

The numbers are called frequencies, those on the Total column and Total row are called marginal frequencies, and those on the inner cells are called joint frequencies. Looking at the marginal frequencies alone, it would seem that Brand B is the least preferred beverage. Looking at the joint frequencies, however, it is apparent that Brand C is the least favored beverage among females.

This data set can also be shown as a relative frequency table, such as the one below. It shows the frequency of an event occurring relative to the total number of events, hence, the term relative frequency. The relative frequencies or the decimal numbers in the inner cells are called conditional frequencies.

Table 2: Relative Frequency Table

Note: We are showing the division of terms to illustrate the procedure, although relative frequency tables are normally shown with just the resulting decimal numbers.

Relative frequencies can be shown for the whole table, such as the one just illustrated. Relative frequencies may also be presented for rows and columns.

Table 3: Relative Frequencies for Rows

Table 4: Relative Frequencies for Columns

Your understanding of these concepts will usually be tested in the SAT exam, along with the concept of probability.

Here are some probability questions in the format that usually appears in SAT exam questions. The solutions are usually simple, but the questions need a little getting used to.

Question 1:

Referring to Table 1, what is the probability of randomly selecting a male participant who prefers Brand B?

P = 5/125 = 0.04

(Take note that 0.04 is a conditional frequency in Table 2.)

The concept of probability and formulas related to it will be discussed further in its appropriate heading below.

Question 2:

What is the probability of randomly selecting a male participant, given that the participant prefers Brand B?

P = 5/20 = 0.25

(Again, take note that 0.25 is a conditional frequency in Table 4)

Question 3:

What is the probability of randomly selecting a participant who prefers Brand B, given that the participant is male?

P = 5/65 = 0.08

(And isn’t this the same conditional frequency in Table 3?)

The conditional frequencies in Table 2 show the probability of a gender preferring a particular brand of beverage.

The conditional frequencies in Table 3 show the probability of each gender preferring a particular brand of beverage, e.g., the probability that male students will prefer Brand A is 0.65, while the probability that female students will prefer Brand B is 0.25.

The conditional frequencies in Table 4 show the probability of a brand being preferred by a particular gender, e.g., the probability that those who prefer Brand A will be male is 0.55, while the probability that those who prefer Brand B will be female is 0.75.

Tip: In the test, always be aware that the term “given” in this type of question gives the question a whole new meaning. Also, it will not always be necessary to prepare the relative frequency tables in order to answer questions like the three provided examples. We only wanted to show how the concepts are related.

Line and Curve of Best Fit

From the previous example of a scatterplot, the “line of best fit” can be drawn. It is useful when describing the trend and when making estimates or projections by interpolation or extrapolation. From this line, the best-fit equation or regression equation can then be determined using algebra (straight lines and linear equations).

The equation for the line of best fit, however, will not always be linear. It can take a quadratic or exponential model for its curve of best fit.

We say that there is a high positive correlation between the two variables because as one variable increases, the other also increases.

Questions in SAT often show scatterplots and ask for a description of the correlation of variables. It is therefore important to know the difference between a perfect positive (or negative) correlation, high positive (or negative) correlation, low positive (or negative) correlation, and no correlation.

Linear Growth vs Exponential Growth

For the set of variables shown in the scatterplot above, their relationship is best modeled by a linear function. Using the points and solving algebraically, we find the slope to be +6 and the linear equation to be:

y = 6x+61

What is the best interpretation of the slope? What is the best interpretation of the y-intercept? Questions similar to this will be asked on the SAT exam. The slope suggests that for every increase of 1 hour in a math tutorial program, the student’s test score increases by 6 points. The y-intercept indicates that without any tutorial (y = 0), the student’s score was 61.

There is a linear relationship when the difference in values (increase or decrease) is constant. However, when the difference in values is not constant, but the ratio of adjacent values is constant, we refer to the relationship as exponential (growth or decay). Classic examples of this concept are the growth of bacteria, a population increase of rabbits, and compound interest.

The general formula for exponential growth or decay is:

y(t)=A * ekt

where:

y(t) = value at time t,

A = initial value,

k = rate of growth (if k > 0) or rate of decay (if k < 0),

and

t = time

This topic is given more depth in the Heart of Algebra section. For the PSDA section, it will be enough to understand the meaning of these concepts in relation to data, such as those given graphically.

Independent and Associated Events

An event is independent if the probability of it happening is not affected by another event. This is often related to the concept of probability. Two events are independent if the probability of each one occurring is not affected by the occurrence of the other.

When a die is thrown, that is an event. Its result is independent of other dice thrown before or after it. In a European roulette wheel, the probability of a number appearing will always be 1/37, no matter the number of times the wheel is spun. Each spin is an independent event and not affected by the other spins.

Associated events refer to variables or events that have a relationship or connection. They may also be referred to as correlated variables. Relationships can be causal (one variable causing the change in the other variable), and variables can be quantitative or categorical.

The example on the scatterplot earlier shows associated events—increase in hours for math tutorial and increase in scores. There clearly was an association or a high positive correlation between the two variables in that example.

Population Parameter

A population is a group of entities or events with a common characteristic. It often refers to a group of people, although it may refer to other entities, as well. Examples of a population are:

all the students in Ocean Springs High School
all musicians in Oregon
all subscribers of a daily paper in Maine

A parameter or population parameter is a characteristic of a population expressed using a numerical value. Examples of a population parameter:

the average height of students in Ocean Springs High School
the percentage of musicians in Oregon who are self-employed
the average income of subscribers of a daily paper in Maine

Measurement Error and Margin of Error

When estimating a population parameter based on a sample statistic, it is expected that the resulting estimate will not be the exact, or true, value. What can be expected from a completely randomized sampling, instead, is the closest estimate to the true value. A margin of error is often used to describe the precision of such an estimate.

On the SAT exam, you will not be asked to calculate margins of error. They usually appear as part of the given information and you are expected to understand their implication to the question.

If the sample mean height is computed to be 121 cm and the SAT question provides the information that there is a margin of error of 1.3 cm, it means that the population’s true mean height falls within the values of 121 ± 1.3.

Here are things to remember about the margin of error:

A large margin of error can be decreased by increasing the sample size.
The larger the standard deviation, the larger the margin of error.
The margin of error applies to the true value of the parameter (e.g., the population mean) for the entire population.

Confidence Interval

A confidence interval describes both the degree of accuracy and the uncertainty of an estimated value. On the SAT exam, confidence intervals will be given, and it’s a usually 95% confidence level.

What does it mean when the sample mean height is 121 cm and the margin of error is 1.3 cm at 95% confidence level?

This statement can be interpreted as:

There is 95% confidence that the true average height for the entire population is within the interval 119.7 cm to 122.3 cm. If the same method of estimating the parameter and size of the random sample were repeatedly performed, the actual average height will be within 119.7 cm to 122.3 cm 95% of the time.

Important note:

The confidence interval applies to the parameter (e.g., the mean height of the entire population) and not to the value of the other variable (e.g., the number of individuals). In other words, the illustration above cannot be interpreted as: 95% of the population have a height between 119.7 cm and 122.3 cm.

Univariate vs Bivariate Data

Univariate data refers to data sets with one type of variable, such as the number of hot beverages sold by a café. The variable is the number of each type of hot beverage sold. It can be shown in this data set:

Bivariate data refers to data sets with two types of variables. If the café owner wanted to find a relationship between their sales on a particular day and the temperature on that day, bivariate data can be gathered, such as their sales of the five hot beverages versus the temperature for each day of the week. The variables are the sales and the temperature. It will look something like this:

Linear, Quadratic, and Exponential Relationships

Variables have a linear relationship when they increase or decrease at a constant rate. As one variable increases the other one decreases, and vice versa. The difference between two adjacent values is constant. When plotted, this is represented by a straight line sloping up or down.

A U-shaped graph facing either upward or downward indicates a quadratic relationship. The rate of change is variable. There’s either a maximum or minimum value which is seen in the graph as the vertex.

A graph that starts to change very gradually initially (either increasing or decreasing), but suddenly takes a significant change over time, indicates an exponential relationship. An exponential curve does not have a vertex.

Variability

Parameters and statistics are estimates used to describe a population or a sample of a population. The numerical values, though, are not the exact actual values but are only the closest estimate. The variability of an estimate against actual values must be accounted for, and this is done by calculating measures of spread.

The spread or scatter of data in a set is measured in various ways, and the most common measures are: range, interquartile range (IQR), variance, and standard deviation. These are ways of describing spread in relation to the estimated value.

Randomization

A random sample truly represents its population if it was selected by a purely chance method, also called randomization, and every element of the population has not been excluded in the procedure. By this, we mean that every element of the population has a probability of being included in the sample, and the whole process is protected from biases.

Some of the methods are: using random numbers (e.g., random number table or random number generator, flipping a coin, or throwing a die).

These are important things to remember:

Random sampling is necessary so that the result of an experiment can be generalized to the entire population.
Random assignment of the subjects to different treatments is also necessary to ensure that all subjects started under generally the same condition before they were subjected to any treatment. This makes it appropriate to draw conclusions about the cause and effect of each treatment.

In the SAT exam, a question may describe a situation regarding the manner of selecting subjects and the manner of assigning them to treatments. The question may then ask which statements can be appropriately drawn from the experiment.

Page updated

Google Sites

Report abuse