Math 333/333H
Week 1
- Discussion of syllabus
and class standards
- Collection of student information
- Overview of probability
and statistics
- Graphical and numerical
summaries of data (using StataQuest)
Statistics: A science that deals with
the methods of collecting, organizing, and summarizing data in such a way that valid conclusions can be drawn from them.
Two
types of
statistical investigations: Descriptive and inferential
Common
goal of statistical investigations: Explore characteristics of a large group of
items (population) based on information about a few (sample).
The
investigation may be observational (information is collected on subjects
after the fact) or experimental (information is collected from an
experiment designed to answer a particular question).
Information
collected in the form of variables:
Qualitative: describes observations
belonging to a set of categories
Quantitative: describes observations
that take numerical values—discrete or continuous
A
summary measure computed from the collected data is a statistic. A
summary measure that describes a characteristic of the population is a parameter.
Usually, a statistic is used to estimate a parameter.
For
observational studies, data are collected using random sampling methods (simple
random sample, stratified random sample, cluster sample) applied to target
population. For experimental studies, data are collected by carrying out a
designed experiment.
Tabular and
Graphical Summaries of Data
- Stem-and-leaf
plot-represents data by plotting first digit as stem and second digit as
leaf. Gives visual summary of data, while retaining actually data values
- Frequency table – data
arranged in tabular form to show the number of times an item falls into a
particular category (frequency)
or to show the relative frequency; Need to choose class widths, class
intervals, class boundaries. Gives information about chance that a
particular item from population will fall in a certain category
- Histogram-graphical
representation of frequency table; gives information about symmetry of
data
- Pie charts and bar
charts-useful to show percentage of total items falling into a particular
category
- Scatter plots-useful
for showing relationship between two variables collected on same items (bivariate
data)
- Time series plot-line
plot showing variation in data over time
Numerical
Summaries of Data
- Measures of central
tendency: Mean, median, mode; The mean is an arithmetic average of
values; the median is the value such that half of the data are less
than this value and half are greater than this value; the mode is
the most frequent value. May also be interested in weighted mean
- p-th Percentile: Value
such that when data are sorted from smallest to largest, at least p
percent of the observations are at or below this value and at least (1-p)
percent are at or above this value. Particularly interested in 25%, 50%
(median) and 75%. IQR=75%-25%
- Box-plot: graphical
display indicating middle 50% of data as a box, with horizontal line drawn
in box to indicate median and “whiskers” drawn from ends of box to
farthest observations still within 1.5 x (fourth spread) of data; Outliers
depicted by individual symbols
- Measures of
dispersion: Range, Mean absolute
deviation, standard deviation
- Chebyshev’s Rule: For
any data set, at least 75% of observations will fall within 2 standard
deviations of the mean; At least 89% will fall within 3 standard
deviations of the mean
- Empirical Rule: For a
data set whose distribution is somewhat bell-shaped (or symmetric),
approximately 68% of observations will fall within 1 standard deviation of
the mean. Approximately 95% will fall within 2 standard deviations of the
mean; Approximately 99% will fall within 3 standard deviations of the mean