Chapter 1 Overview
Statistics: A science that deals with the
methods of collecting, organizing, and summarizing data in such a way that valid conclusions can be drawn from them.
Two
types of
statistical investigations: Descriptive and inferential
Common
goal of statistical investigations: Explore characteristics of a large group of
items (population) based on information about a few (sample).
The
investigation may be observational (information is collected on subjects
after the fact) or experimental (information is collected from an
experiment designed to answer a particular question).
Information
collected in the form of variables:
Qualitative: describes observations
belonging to a set of categories
Quantitative: describes observations
that take numerical values—discrete or continuous
A
summary measure computed from the collected data is a statistic. A
summary measure that describes a characteristic of the population is a parameter.
Usually, a statistic is used to estimate a parameter.
Tabular and
Graphical Summaries of Data
- Stem-and-leaf
plot-represents data by plotting first digit as stem and second digit as
leaf. Gives visual summary of data, while retaining actually data values
- Frequency table – data
arranged in tabular form to show the number of times an item falls into a
particular category (frequency)
or to show the relative frequency; Need to choose class widths, class
intervals, class boundaries. Gives information about chance that a
particular item from population will fall in a certain category
- Histogram-graphical
representation of frequency table; gives information about symmetry of
data
- Pie charts and bar
charts-useful to show percentage of total items falling into a particular
category
- Scatter plots-useful
for showing relationship between two variables collected on same items (bivariate
data)
- Time series plot-line
plot showing variation in data over time
Numerical
Summaries of Data
- Measures of central
tendency: Mean, median, mode; The mean is
an arithmetic average of values; the median is the value such that
half of the data are less than this value and half are greater than this
value; the mode is the most frequent value. May also be interested
in weighted mean
- p-th Percentile: Value such that when data are
sorted from smallest to largest, at least p percent of the observations
are at or below this value and at least (1-p) percent are at or above this
value. Particularly interested in 25%, 50% (median) and 75%. IQR=75%-25%
- Box-plot: graphical
display indicating middle 50% of data as a box, with horizontal line drawn
in box to indicate median and whiskers drawn from the left edge of the box(
25th percentile) to the smallest observation; also from the
right edge of the box (75th percentile) to the largest
observation. Outliers depicted by
individual symbols
- Measures of
dispersion: Range, standard
deviation
- Chebyshev’s Rule: For any data
set, at least 75% of observations will fall within 2 standard deviations
of the mean; At least 89% will fall within 3 standard deviations of the
mean
- Empirical Rule: For a
data set whose distribution is somewhat bell-shaped (or symmetric),
approximately 68% of observations will fall within 1 standard deviation of
the mean. Approximately 95% will fall within 2 standard deviations of the
mean; Approximately 99% will fall within 3 standard deviations of the mean