MS-251›Sampling Distributions and Data Descriptions

Probability and StatisticsTopic 22 of 36

Sampling Distributions and Data Descriptions

10 minread

1,729words

Intermediatelevel

Sampling Distributions and Data Descriptions

Sampling distributions and data descriptions are essential concepts in inferential statistics, where the goal is to make generalizations about a population based on sample data. Sampling distributions help us understand how sample statistics behave across many different samples from the same population, and data descriptions provide ways to summarize and interpret the data at hand.

1. Sampling Distributions

Sampling distributions describe the probability distribution of a statistic (such as the sample mean, sample proportion, or sample variance) computed from a sample taken from a population.

Key Concepts in Sampling Distributions

Sample Statistic: A statistic is a summary measure calculated from a sample, such as the sample mean ( $\bar{X}$ ), sample variance ( $s^2$ ), or sample proportion ( $\hat{p}$ ).
Sampling Distribution of a Statistic: The sampling distribution of a statistic (like the sample mean or sample proportion) is the probability distribution that describes how the statistic varies from sample to sample.
Standard Error: The standard error is the standard deviation of a sampling distribution. It measures the variability of a sample statistic from sample to sample.
- Standard error of the sample mean is: $SE_{\bar{X}} = \frac{\sigma}{\sqrt{n}}$ where $\sigma$ is the population standard deviation and $n$ is the sample size.
Central Limit Theorem (CLT): The Central Limit Theorem states that, for large enough sample sizes, the sampling distribution of the sample mean (or any other statistic) will be approximately normal, regardless of the population’s distribution shape. This is a crucial result because it allows statisticians to use the normal distribution for inference, even when the underlying population distribution is not normal, as long as the sample size is large enough (usually $n \geq 30$ ).

Types of Sampling Distributions

Sampling Distribution of the Sample Mean:
- When taking repeated samples from a population and calculating the sample means, the sampling distribution of the sample mean ( $\bar{X}$ ) will follow a normal distribution (according to the CLT) for sufficiently large sample sizes.
- Key properties:
  - The mean of the sample mean distribution is equal to the population mean, i.e., $E[\bar{X}] = \mu$ .
  - The standard deviation (or standard error) of the sample mean is $SE_{\bar{X}} = \frac{\sigma}{\sqrt{n}}$ , where $\sigma$ is the population standard deviation, and $n$ is the sample size.
Example: If we have a population of test scores with a mean of 75 and a standard deviation of 10, the distribution of the sample mean for samples of size 50 will be approximately normal, with a mean of 75 and a standard error of:
$SE_{\bar{X}} = \frac{10}{\sqrt{50}} \approx 1.41$
Sampling Distribution of the Sample Proportion:
- The sample proportion ( $\hat{p}$ ) is the proportion of successes in a sample, and it follows a binomial distribution in the case of a finite population. For large samples, the sampling distribution of $\hat{p}$ will also be approximately normal, provided that both $np \geq 10$ and $n(1 - p) \geq 10$ , where $p$ is the population proportion and $n$ is the sample size.
- Key properties:
  - The mean of the sampling distribution of $\hat{p}$ is equal to the population proportion $p$ .
  - The standard error of $\hat{p}$ is given by: $SE_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}$
Example: If the population proportion of people who support a new policy is 0.6, and a sample of 100 is taken, the sampling distribution of the sample proportion will have:
$SE_{\hat{p}} = \sqrt{\frac{0.6(1 - 0.6)}{100}} = \sqrt{\frac{0.24}{100}} = 0.049$
Thus, the standard error of the sample proportion is 0.049.
Sampling Distribution of the Sample Variance:
- The sampling distribution of the sample variance $s^2$ follows a chi-square distribution if the population is normally distributed. The degrees of freedom for the chi-square distribution is $n - 1$ , where $n$ is the sample size.
- Key properties:
  - The mean of the sampling distribution of the sample variance is equal to the population variance: $E[s^2] = \sigma^2$
  - The variance of the sample variance is: $\text{Var}(s^2) = \frac{2\sigma^4}{n - 1}$

2. Describing Data: Measures of Central Tendency and Dispersion

Once we understand sampling distributions, we can describe the data itself. Data descriptions generally include measures of central tendency (to summarize the typical or central value of the data) and measures of dispersion (to describe how spread out the data is).

Measures of Central Tendency

Mean ( $\mu$ or $\bar{X}$ ):
- The mean is the arithmetic average of all the values in the dataset. For a population, it is denoted by $\mu$ , and for a sample, it is denoted by $\bar{X}$ .
- Formula: $\mu = \frac{\sum_{i=1}^{N} x_i}{N} \quad \text{(population mean)}$ $\bar{X} = \frac{\sum_{i=1}^{n} x_i}{n} \quad \text{(sample mean)}$
Median:
- The median is the middle value when the data is ordered from smallest to largest. If the data set has an odd number of observations, the median is the middle value. If the data set has an even number of observations, the median is the average of the two middle values.
Mode:
- The mode is the value that appears most frequently in the dataset. A dataset can be unimodal (one mode), bimodal (two modes), or multimodal (multiple modes).

Measures of Dispersion

Range:
- The range is the difference between the maximum and minimum values in the dataset: $\text{Range} = \text{Max} - \text{Min}$
Variance ( $\sigma^2$ or $s^2$ ):
- The variance measures the average squared deviation from the mean. For a population, it is denoted as $\sigma^2$ , and for a sample, it is denoted as $s^2$ .
- Formula (population variance): $\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$
- Formula (sample variance): $s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{X})^2}{n - 1}$
Standard Deviation ( $\sigma$ or $s$ ):
- The standard deviation is the square root of the variance and provides a measure of the spread of the data in the same units as the data itself. $\sigma = \sqrt{\sigma^2} \quad \text{or} \quad s = \sqrt{s^2}$
Interquartile Range (IQR):
- The IQR measures the spread of the middle 50% of the data, and it is the difference between the 75th percentile (Q3) and the 25th percentile (Q1). $\text{IQR} = Q3 - Q1$

3. Data Distributions and Descriptive Statistics

Histogram: A histogram is a graphical representation of the distribution of data. It divides the data into bins and displays the frequency of observations in each bin.
Boxplot: A boxplot displays the distribution of the data through quartiles and highlights the median, IQR, and potential outliers.
Normal Distribution: If the data follows a normal distribution, the mean, median, and mode are all equal, and the data is symmetric around the mean.
Skewness: Skewness refers to the asymmetry in the distribution of data.
- Right-skewed (positively skewed): The right tail is longer than the left, with most data points concentrated on the left.
- Left-skewed (negatively skewed): The left tail is longer than the right, with most data points concentrated on the right.
Kurtosis: Kurtosis measures the "tailedness" of the data distribution, i.e., how much the distribution deviates from the normal distribution in terms of heavy or light tails.

Summary

Sampling distributions describe how sample statistics behave across different samples from a population, allowing us to make inferences about the population.
Common sampling distributions include those for the sample mean, sample proportion, and sample variance.
Data descriptions involve summarizing data using measures of central tendency (mean, median, mode) and dispersion (variance, standard deviation, range).
Visualizing data with tools like histograms and boxplots can help interpret the distribution and characteristics of the data, such as skewness and kurtosis.

Previous topic 21

Fundamental Sampling Distributions

Next topic 23

Random Sampling

Past Papers

Open this section to load past papers

Click on Show Past Papers to see past papers.

MS-251›Sampling Distributions and Data Descriptions

Probability and StatisticsTopic 22 of 36

Sampling Distributions and Data Descriptions

10 minread

1,729words

Intermediatelevel

Sampling Distributions and Data Descriptions

1. Sampling Distributions

Sampling distributions describe the probability distribution of a statistic (such as the sample mean, sample proportion, or sample variance) computed from a sample taken from a population.

Key Concepts in Sampling Distributions

Sample Statistic: A statistic is a summary measure calculated from a sample, such as the sample mean ( $\bar{X}$ ), sample variance ( $s^2$ ), or sample proportion ( $\hat{p}$ ).
Sampling Distribution of a Statistic: The sampling distribution of a statistic (like the sample mean or sample proportion) is the probability distribution that describes how the statistic varies from sample to sample.
Standard Error: The standard error is the standard deviation of a sampling distribution. It measures the variability of a sample statistic from sample to sample.
- Standard error of the sample mean is: $SE_{\bar{X}} = \frac{\sigma}{\sqrt{n}}$ where $\sigma$ is the population standard deviation and $n$ is the sample size.
Central Limit Theorem (CLT): The Central Limit Theorem states that, for large enough sample sizes, the sampling distribution of the sample mean (or any other statistic) will be approximately normal, regardless of the population’s distribution shape. This is a crucial result because it allows statisticians to use the normal distribution for inference, even when the underlying population distribution is not normal, as long as the sample size is large enough (usually $n \geq 30$ ).

Types of Sampling Distributions

Sampling Distribution of the Sample Mean:
- When taking repeated samples from a population and calculating the sample means, the sampling distribution of the sample mean ( $\bar{X}$ ) will follow a normal distribution (according to the CLT) for sufficiently large sample sizes.
- Key properties:
  - The mean of the sample mean distribution is equal to the population mean, i.e., $E[\bar{X}] = \mu$ .
  - The standard deviation (or standard error) of the sample mean is $SE_{\bar{X}} = \frac{\sigma}{\sqrt{n}}$ , where $\sigma$ is the population standard deviation, and $n$ is the sample size.
Example: If we have a population of test scores with a mean of 75 and a standard deviation of 10, the distribution of the sample mean for samples of size 50 will be approximately normal, with a mean of 75 and a standard error of:
$SE_{\bar{X}} = \frac{10}{\sqrt{50}} \approx 1.41$
Sampling Distribution of the Sample Proportion:
- The sample proportion ( $\hat{p}$ ) is the proportion of successes in a sample, and it follows a binomial distribution in the case of a finite population. For large samples, the sampling distribution of $\hat{p}$ will also be approximately normal, provided that both $np \geq 10$ and $n(1 - p) \geq 10$ , where $p$ is the population proportion and $n$ is the sample size.
- Key properties:
  - The mean of the sampling distribution of $\hat{p}$ is equal to the population proportion $p$ .
  - The standard error of $\hat{p}$ is given by: $SE_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}$
Example: If the population proportion of people who support a new policy is 0.6, and a sample of 100 is taken, the sampling distribution of the sample proportion will have:
$SE_{\hat{p}} = \sqrt{\frac{0.6(1 - 0.6)}{100}} = \sqrt{\frac{0.24}{100}} = 0.049$
Thus, the standard error of the sample proportion is 0.049.
Sampling Distribution of the Sample Variance:
- The sampling distribution of the sample variance $s^2$ follows a chi-square distribution if the population is normally distributed. The degrees of freedom for the chi-square distribution is $n - 1$ , where $n$ is the sample size.
- Key properties:
  - The mean of the sampling distribution of the sample variance is equal to the population variance: $E[s^2] = \sigma^2$
  - The variance of the sample variance is: $\text{Var}(s^2) = \frac{2\sigma^4}{n - 1}$

2. Describing Data: Measures of Central Tendency and Dispersion

Measures of Central Tendency

Mean ( $\mu$ or $\bar{X}$ ):
- The mean is the arithmetic average of all the values in the dataset. For a population, it is denoted by $\mu$ , and for a sample, it is denoted by $\bar{X}$ .
- Formula: $\mu = \frac{\sum_{i=1}^{N} x_i}{N} \quad \text{(population mean)}$ $\bar{X} = \frac{\sum_{i=1}^{n} x_i}{n} \quad \text{(sample mean)}$
Median:
- The median is the middle value when the data is ordered from smallest to largest. If the data set has an odd number of observations, the median is the middle value. If the data set has an even number of observations, the median is the average of the two middle values.
Mode:
- The mode is the value that appears most frequently in the dataset. A dataset can be unimodal (one mode), bimodal (two modes), or multimodal (multiple modes).

Measures of Dispersion

Range:
- The range is the difference between the maximum and minimum values in the dataset: $\text{Range} = \text{Max} - \text{Min}$
Variance ( $\sigma^2$ or $s^2$ ):
- The variance measures the average squared deviation from the mean. For a population, it is denoted as $\sigma^2$ , and for a sample, it is denoted as $s^2$ .
- Formula (population variance): $\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$
- Formula (sample variance): $s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{X})^2}{n - 1}$
Standard Deviation ( $\sigma$ or $s$ ):
- The standard deviation is the square root of the variance and provides a measure of the spread of the data in the same units as the data itself. $\sigma = \sqrt{\sigma^2} \quad \text{or} \quad s = \sqrt{s^2}$
Interquartile Range (IQR):
- The IQR measures the spread of the middle 50% of the data, and it is the difference between the 75th percentile (Q3) and the 25th percentile (Q1). $\text{IQR} = Q3 - Q1$

3. Data Distributions and Descriptive Statistics

Histogram: A histogram is a graphical representation of the distribution of data. It divides the data into bins and displays the frequency of observations in each bin.
Boxplot: A boxplot displays the distribution of the data through quartiles and highlights the median, IQR, and potential outliers.
Normal Distribution: If the data follows a normal distribution, the mean, median, and mode are all equal, and the data is symmetric around the mean.
Skewness: Skewness refers to the asymmetry in the distribution of data.
- Right-skewed (positively skewed): The right tail is longer than the left, with most data points concentrated on the left.
- Left-skewed (negatively skewed): The left tail is longer than the right, with most data points concentrated on the right.
Kurtosis: Kurtosis measures the "tailedness" of the data distribution, i.e., how much the distribution deviates from the normal distribution in terms of heavy or light tails.

Summary

Sampling distributions describe how sample statistics behave across different samples from a population, allowing us to make inferences about the population.
Common sampling distributions include those for the sample mean, sample proportion, and sample variance.
Data descriptions involve summarizing data using measures of central tendency (mean, median, mode) and dispersion (variance, standard deviation, range).
Visualizing data with tools like histograms and boxplots can help interpret the distribution and characteristics of the data, such as skewness and kurtosis.

Previous topic 21

Fundamental Sampling Distributions

Next topic 23

Random Sampling

Past Papers

Open this section to load past papers

Click on Show Past Papers to see past papers.