Chapter 1 Overview

 

 

Statistics: A science that deals with the methods of collecting, organizing, and summarizing data in such a way that valid conclusions can be drawn from them.

Two types of statistical investigations: Descriptive and inferential

 

Common goal of statistical investigations: Explore characteristics of a large group of items (population) based on information about a few (sample).

The investigation may be observational (information is collected on subjects after the fact) or experimental (information is collected from an experiment designed to answer a particular question).

Information collected in the form of variables:

  Qualitative: describes observations belonging to a set of categories

  Quantitative: describes observations that take numerical values—discrete or continuous

 

A summary measure computed from the collected data is a statistic. A summary measure that describes a characteristic of the population is a parameter. Usually, a statistic is used to estimate a parameter.

 

Tabular and Graphical Summaries of Data

  1. Stem-and-leaf plot-represents data by plotting first digit as stem and second digit as leaf. Gives visual summary of data, while retaining actually data values
  2. Frequency table – data arranged in tabular form to show the number of times an item falls into a particular category  (frequency) or to show the relative frequency; Need to choose class widths, class intervals, class boundaries. Gives information about chance that a particular item from population will fall in a certain category
  3. Histogram-graphical representation of frequency table; gives information about symmetry of data
  4. Pie charts and bar charts-useful to show percentage of total items falling into a particular category
  5. Scatter plots-useful for showing relationship between two variables collected on same items (bivariate data)
  6. Time series plot-line plot showing variation in data over time

 

Numerical Summaries of Data

  1. Measures of central tendency: Mean, median, mode; The mean is an arithmetic average of values; the median is the value such that half of the data are less than this value and half are greater than this value; the mode is the most frequent value. May also be interested in weighted mean
  2. p-th Percentile: Value such that when data are sorted from smallest to largest, at least p percent of the observations are at or below this value and at least (1-p) percent are at or above this value. Particularly interested in 25%, 50% (median) and 75%. IQR=75%-25%
  3. Box-plot: graphical display indicating middle 50% of data as a box, with horizontal line drawn in box to indicate median and whiskers drawn from the left edge of the box( 25th percentile) to the smallest observation; also from the right edge of the box (75th percentile) to the largest observation.  Outliers depicted by individual symbols
  4. Measures of dispersion:  Range, standard deviation
  5. Chebyshev’s Rule: For any data set, at least 75% of observations will fall within 2 standard deviations of the mean; At least 89% will fall within 3 standard deviations of the mean
  6. Empirical Rule: For a data set whose distribution is somewhat bell-shaped (or symmetric), approximately 68% of observations will fall within 1 standard deviation of the mean. Approximately 95% will fall within 2 standard deviations of the mean; Approximately 99% will fall within 3 standard deviations of the mean