This document provides an introduction and overview of key concepts in statistics. It discusses descriptive versus inferential statistics, levels of measurement, measures of central tendency and variability, hypothesis testing, and common statistical tests like t-tests, ANOVA, correlation, and chi-square. The goal is to summarize different numerical representations of data and how statistics depend on sampling methods and can be used for descriptive or comparative objectives.
Download as PPT, PDF, TXT or read online on Scribd
Download as ppt, pdf, or txt
100%(1)100% found this document useful (1 vote)
69 views34 pages
Data Analysis
This document provides an introduction and overview of key concepts in statistics. It discusses descriptive versus inferential statistics, levels of measurement, measures of central tendency and variability, hypothesis testing, and common statistical tests like t-tests, ANOVA, correlation, and chi-square. The goal is to summarize different numerical representations of data and how statistics depend on sampling methods and can be used for descriptive or comparative objectives.
Download as PPT, PDF, TXT or read online on Scribd
Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1/ 34
An Introduction and Overview
Numerical representations of our data
Can be: Descriptive statistics summarize data. Inferential statistics are tools that indicate how much confidence we can have when we generalize from a sample to a population. Statistics depend on our sampling methods: Probability or Non-probability? (i.e. Random or not?) Even with probability samples, there is a possibility that the statistics we obtain do not accurately reflect the population. Sampling Error Inadequate sampling frame, low response rate, coverage (some people in population not given a chance of selection) Non-Sampling Error Problems with transcribing and coding data; observer/ instrument error; misrepresenation as error. Levels of Measurement – the relationship among the values that are assigned to a variable and the attributes of that variable. Nominal- naming Ordinal- rank order (high to low but no indication of how much higher or lower one subject is to another) Interval- equal intervals between values Ratio- equal intervals AND an absolute zero (i.e. a ruler) Age: under 30, 30-39, 40-49, 50-59 Gender: Male, Female Level of Agreement: Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree Percentage of the library budget spent on staff salaries. Descriptive Comparative objectives/ research objectives/ questions: hypotheses
Descriptive statistics Inferential Statistics
Can be applied to any measurements (quantitative or qualitative) Offers a summary/ overview/ description of data. Does not explain or interpret. Number Variability Frequency Count Variance and Percentage standard deviation Deciles and quartiles Graphs Measures of Central Normal Curve Tendency (Mean, Midpoint, Mode) Averages Mode: most frequently occurring value in a distribution (any scale, most unstable) Median: midpoint in the distribution below which half of the cases reside (ordinal and above) Mean: arithmetic average- the sum of all values in a distribution divided by the number of cases (interval or ratio) Example (11 test scores) 61, 61, 72, 77, 80, 81, 82, 85, 89, 90, 92
The median is 81 (half of the scores fall above 81,
and half below) Example (6 scores) 3, 3, 7, 10, 12, 15
Even number of scores= Median is half-way
between these scores Sum the middle scores (7+10=17) and divide by 2 17/2= 8.5 Insensitive to extremes
3, 3, 7, 10, 12, 15, 200
Mean is half the sum of a set of values: Scores: 5, 6, 7, 10, 12, 15 Sum: 55 Number of scores: 6 Computation of Mean: 55/6= 9.17 Mode is the most frequently occurring value in a set. Best used for nominal data. Skewed to the right (positive) or left (negative) An extremely hard test that results in a lot of low grades will be skewed to the right: the mode is smaller than the median, which is smaller than the mean. This relationship exists because the mode is the point on the x-axis corresponding to the highest point, that is the score with greatest value, or frequency. The median is the point on the x-axis that cuts the distribution in half, such that 50% of the area falls on each side. An extremely easy test will result in a lot of high grades, and will skew to the left (negative) The order of the measures of central tendency would be the opposite of the positively skewed distribution, with the mean being smaller than the median, which is smaller than the mode. Variability is the differences among scores- shows how subjects vary: Dispersion: extent of scatter around the “average” Range: highest and lowest scores in a distribution Variance and standard deviation: spread of scores in a distribution. The greater the scatter, the larger the variance Interval or ration level data Standard deviation: how much subjects differ from the mean of their group Measures how much subjects differ from the mean of their group The more spread out the subjects are around the mean, the larger the standard deviation Sensitive to extremes or “outliers” Allows for comparisons across variables i.e. is there a relation between one’s occupation and their reason for using the public library? Hypothesis Testing The level of significance is the predetermined level at which a null hypothesis is not supported. The most common level is p < .05 P =probability < = less than (> = more than) Type I error Type II error Reject the null Fail to reject the null hypothesis when it is hypothesis when it is really true really false By using inferential statistics to make decisions, we can report the probability that we have made a Type I error (indicated by the p value we report) By reporting the p value, we alert readers to the odds that we were incorrect when we decided to reject the null hypothesis Chi-square test of independence: two variables (nominal and nominal, nominal and ordinal, or ordinal and ordinal) Affected by number of cells, number of cases 2-tailed distribution= null hypothesis 1-tailed distribution= directional hypothesis Correlation—the extent to which two variables are related across a group of subjects Pearson r It can range from -1.00 to 1.00 -1.00 is a perfect inverse relationship—the strongest possible inverse relationship 0.00 indicates the complete absence of a relationship 1.00 is a perfect positive relationship—the strongest possible direct relationship The closer a value is to 0.00, the weaker the relationship The closer a value is to -1.00 or +1.00, the stronger it is Spearman rho t-test Test the difference between two sample means for significance pretest to posttest Relates to research design Perhaps used for information literacy instruction
Analysis of variance Regression analysis (including step-wise regression) Analysis of variance (ANOVA) tests the difference(s) among two or more means
It can be used to test the difference between
two means So use t-test or ANOVA? KEY: ANOVA also can be used to test the difference among more than two means in a single test—which cannot be done with a t test Parametric statistical tests generally require interval or ratio level data and assume that the scores were drawn from a normally distributed population or that both sets of scores were drawn from populations with the same variance or spread of scores Nonparametric methods do not make assumptions about the shape of the population distribution. These are typically less powerful and often need large samples