Statistics I - Introduction To ANOVA, Regression, and Logistic Regression
Statistics I - Introduction To ANOVA, Regression, and Logistic Regression
Topic Summaries
To go to the movie where you learned a task or concept, select a link.
Basic Statistical Concepts
Descriptive statistics organizes, describes, and summarizes data using numbers and
graphical techniques. Inferential statistics is concerned with drawing conclusions about
a population from the analysis of a random sample drawn from that population.
Inferential statistics is also concerned with the precision and reliability of those
inferences.
A population is the complete set of observations or the entire group of objects that you
are researching. A sample is a subset of the population. The sample should be
representative of the population. You can obtain a representative sample by collecting
a simple random sample.
Parameters are numerical values that summarize characteristics of a population.
Parameter values are typically unknown and are represented by Greek letters.
Statistics summarize characteristics of a sample. You use letters from the English
alphabet to represent sample statistics. You can measure characteristics of your
sample and provide numerical values that summarize those characteristics. You use
statistics to estimate parameters.
Variables are characteristics or properties of data that take on different values or
amounts. A variable can be independent or dependent. In some contexts, you select
the value of an independent variable, in order to determine its relationship to the
dependent variable. In other contexts, the independent variables values are simply
taken as given.
Variables are also classified according to their characteristics. They can
be quantitative or categorical. Data that consists of counts or measurements is
quantitative. Quantitative data can be further distinguished by two types: discrete and
continuous. Discrete data takes on only a finite, or countable, number of values.
Continuous data has an infinite number of values and no breaks or jumps.
Categorical or attribute data consists of variables that denote groupings or labels.
There are two main types: nominal and ordinal. A nominal categorical variable exhibits
no ordering within its groups or categories. With ordinal categorical variables, the
observed levels of the variable can be ordered in a meaningful way that implies
differences due to magnitude.
A variables classification is its scale of measurement. There are two scales of
measurement for categorical variables: nominal and ordinal. There are two scales of
measurement for continuous variables: interval and ratio. Data from an interval scale
can be rank-ordered and has a sensible spacing of observations such that differences
between measurements are meaningful. However, interval scales lack the ability to
calculate ratios between numbers on the scale because there is no true zero point.
Data on a ratio scale includes a true zero point and can therefore accurately indicate
the ratio of difference between two spaces on the measurement scale.
The appropriate statistical method for your data also depends on the number of
variables involved. Univariate analysis provides techniques for analyzing and
describing a single variable at a time. Bivariate analysis describes and explains the
relationship between two variables and how they change, or covary, together.
Multivariate analysis examines two or more variables at the same time, in order to
understand the relationships among them.
Descriptive Statistics
A data's distribution tells you what values your data takes and how often it takes those
values.
You can calculate descriptive statistics that measure locations in your data. Statistics
that locate the center of the data are measures of central tendency. These include
mean, median, and mode.
Percentiles are descriptive statistics that give you reference points in your data. A
percentile is the value of a variable below which a certain percentage of observations
fall. The most commonly reported percentiles are quartiles, which break the data into
quarters.
There are several descriptive statistics that measure the variability of your data: range,
interquartile range (IQR), variance, standard deviation, and coefficient of variation
(C.V.).
To summarize and generate descriptive statistics, you use the MEANS procedure.
PROC MEANS calculates a standard set of statistics, including the minimum,
maximum, and mean data values, as well as standard deviation and n. The
PRINTALLTYPES option displays statistics for all requested combinations of class
variables
distribution.
Skewness measures the tendency of your data to be more spread out on one side of
the mean than on the other. It measures the asymmetry of the distribution. The
direction of skewness is the direction to which the data is trailing off. The closer the
skewness is to 0, the more normal or symmetric the data.
Kurtosis measures the tendency of data to be concentrated toward the center or
toward the tails of the distribution. The closer kurtosis is to 0, the closer the tails of the
data resemble the tail thickness of the normal distribution. Kurtosis can be difficult to
assess visually.
A negative kurtosis statistic means that the data has lighter tails than in a normal
distribution and is less heavily concentrated about the mean. This is a platykurtic
distribution.
A positive kurtosis statistic means that the data has heavier tails and is more
concentrated about the mean than a normal distribution. This is a leptokurtic
distribution, which is often referred to as heavy-tailed and also as an outlier-prone
distribution.
A normal probability plot is another way to visualize and assess the distribution of your
data. The vertical axis represents the actual data values. The horizontal axis displays
the expected percentiles from a standard normal distribution. The normal reference
line along the diagonal indicates where the data would fall if it were perfectly normal.
A box plot makes it easy to see how spread out the data is and if there are any
outliers.
You can use PROC UNIVARIATE to generate descriptive statistics, histograms, and
normal probability plots.
In the ID statement, you list the variable or variables that SAS should label in the table
of extreme observations and identify as outliers in the graphs.
You can add additional options to the HISTOGRAM and PROBPLOT statements. The
NORMAL option uses estimates of the population mean and standard deviation to add
a normal curve overlay to the histogram and a diagonal reference line to the normal
probability plot.
You can use the INSET statement to create a box of summary statistics directly on the
graphs.
In addition to the statistical graphics available to you with PROC UNIVARIATE, you
might want to use PROC SGSCATTER, PROC SGPLOT, PROC SGPANEL, and
PROC SGRENDER to produce a wide variety of additional plot types.
You can use PROC SGPLOT to generate dot plots, horizontal and vertical bar charts,
histograms, box plots, density curves, scatter plots, series plots, band plots, needle
plots, and vector plots. The REG statement generates a fitted regression line or curve.
You use a REFLINE statement to create a horizontal or vertical reference line on the
plot.
ODS Graphics is an extension of the SAS Output Delivery System. With ODS
Graphics, statistical procedures produce graphs as automatically as they produce
tables, and graphs are integrated with tables in the ODS output. You can find a list of
the graphs available for each SAS procedure in the SAS documentation.
Hypothesis Testing
A hypothesis test uses sample data to evaluate a question about a population. It
provides a way to make inferences about a population, based on sample data.
There are four steps in conducting a hypothesis test. The first step is to identify the
population of interest and determine the null and alternative hypotheses. The null
hypothesis, H0, is what you assume to be true, unless proven otherwise. It is usually a
hypothesis of equality. The alternative hypothesis, Ha or H1, is typically what you
suspect, or are attempting to demonstrate. It is usually a hypothesis of inequality.
The second step in hypothesis testing is to select the significance level. This is the
amount of evidence needed to reject the null hypothesis. A common significance level
is 0.05 (1 chance in 20).
The third step is to collect the data. The fourth step is to use a decision rule to
evaluate the data. You decide whether or not there is enough evidence to reject the
null hypothesis.
If you reject the null hypothesis when it's actually true, you've made a Type I error. The
probability of committing a Type I error is . is the significance level of a test. If you
fail to reject the null hypothesis and it's actually false, you've made a Type II error. The
probability of committing a Type II error is . Type I and II errors are inversely
related.The power of a statistical test is equal to 1 minus beta (1 ),
The difference between the observed statistic and the hypothesized value is the effect
size. A p-value measures the probability of observing a value as extreme as the one
observed or more extreme. A p-value is not only affected by the effect size, but also by
the sample size.
The t statistic measures how far X-bar, the sample mean, is from the hypothesized
mean, 0. If the t statistic is much higher or lower than 0 and has a small
corresponding p-value, this indicates that the sample mean is quite different from the
hypothesized mean, and you would reject the null hypothesis.
You can use PROC UNIVARIATE to perform a statistical hypothesis test. You use the
MU0= option to specify the value of the hypothesized mean, 0. You can use the
ALPHA= option to change the significance level.
Syntax
To go to the movie where you learned a statement or option, select a link.
PROC MEANS DATA=SAS-data-set <options>;
CLASS variables;
VAR variables;
RUN;
PROC UNIVARIATE DATA=SAS-data-set <options>;
VAR variables;
ID variables;
HISTOGRAM variables </options>;
PROBPPLOT variables </options>;
INSET keywords </options>;
RUN;
PROC SGPLOT DATA=SAS-data-set<options>;
DOT category-variable </option(s)>;
HBAR category-variable </option(s)>;
VBAR category-variable </option(s)>;
HBOX response-variable </option(s)>;
VBOX response-variable </option(s)>;
HISTOGRAM response-variable </option(s)>;
SCATTER X=variable Y=variable </option(s)>;
NEEDLE X=variable Y=numeric-variable </option(s)>;
REG X=numeric-variable Y=numeric-variable </option(s)>;
REFLINE variable | value-1 <... value-n> </option(s)>;
RUN;
ODS GRAPHICS ON <options>;
statistical procedure code
ODS GRAPHICS OFF;
Sample Programs
Close
Print
Topic Summaries
To go to the movie where you learned a task or concept, select a link.
Two-Sample t-Tests
The two-sample t-test is a hypothesis test for answering questions about the means of
two populations. You can examine the differences between populations for one or
more continuous variables and assess whether the means of the two populations are
statistically different from each other.
The null hypothesis for the two-sample t-test is that the means for the two groups are
equal. The alternative hypothesis is the logical opposite of the null and is typically what
you suspect or are trying to show. It is usually a hypothesis of inequality. The
alternative hypothesis for the two-sample t-test is that the means for the two groups
are not equal.
The three assumptions for the two-sample t-test are independence, normality, and
equal variances.
You use the F-test for equality of variances to evaluate the assumption of equal
variances in the two populations. You calculate the F statistic, which is the ratio of the
maximum sample variance of the two groups to the minimum sample variance of the
two groups. If the p-value of the F-test is greater than your alpha, you fail to reject the
null hypothesis and can proceed as if the variances are equal between the groups. If
the p-value of the F-test is less than your alpha, you reject the null hypothesis and can
proceed as if the variances are not equal.
With one-sided tests, you look for a difference in one direction. For instance, you can
test to determine whether the mean of one population is greater than or less than the
mean of another population. An advantage of one-sided tests is that they can increase
the power of a statistical test.
To perform the two-sample t-test and the one-sided test, you can use PROC TTEST.
You add the PLOTS option to the PROC TTEST statement to control the plots that
ODS produces. You add the SIDES=U or SIDES=L option to specify an upper or lower
one-sided test.
One-Way ANOVA
You can use ANOVA to determine whether there are significant differences between
the means of two or more populations. In this model, you have a continuous
dependent, or response, variable and a categorical independent, or predictor, variable.
With ANOVA, the null hypothesis is that all of the population means are equal.
Thealternative hypothesis is that not all of the population means are equal. In other
words, at least one mean is different from the rest.
One way to represent the relationship between the response and predictor variables in
ANOVA is with a mathematical ANOVA model.
ANOVA analyzes the variances of the data to determine whether there is a difference
between the group means. You can determine whether the variation of the means is
large enough relative to the variation of observations within the group. To do this,
you calculate three types of sums of squares: between group variation (SSM), within
group variation (SSE), and total variation (SST). The SSM and SSE represent pieces
of the total variability. If the SSM is larger than the SSE, you reject the null hypothesis
that all of the group means are equal.
Before you perform the hypothesis test, you need to verify the three ANOVA
assumptions: the observations are independent observations, the error terms are
normally distributed, and the error terms have equal variances across groups.
The residuals that come from your data are estimates of the error term in the model.
You calculate the residuals from ANOVA by taking each observation and subtracting its
group mean. Then you verify the two assumptions regarding normality and equal
variances of the errors.
To verify the ANOVA assumptions and perform the ANOVA test, you use PROC GLM.
In the MODEL statement, you specify the dependent and independent variables for the
analysis. The MEANS statement computes unadjusted means of the dependent
variable for each value of the specified effect. You can add the HOVTEST option to the
MEANS statement to perform Levene's test for homogeneity of variances. If the
resulting p-value of Levene's test is greater than 0.05 (typically), then you fail to reject
the null hypothesis of equal variances.
In a controlled experiment, you can design the analysis prospectively and control for
other factors, nuisance factors, that affect the outcome you're measuring. Nuisance
factors can affect the outcome of your experiment, but are not of interest in the
experiment. In a randomized block design, you can use a blocking variable to control
for the nuisance factors and reduce or eliminate their contribution to the experimental
error.
One way to represent the relationship between the response and predictor variables in
ANOVA is with a mathematical ANOVA model. You can also include a blocking
variable in the model.
A pairwise comparison examines the difference between two treatment means. If your
ANOVA results suggest that you reject the null hypothesis that the means are equal
across groups, you can conduct multiple pairwise comparisons in a post hoc analysis
to learn which means differ.
The chance that you make a Type I error increases each time you conduct a statistical
test. The comparisonwise error rate, or CER, is the probability of a Type I error on a
single pairwise test. The experimentwise error rate, orEER, is the probability of making
at least one Type I error when performing all of the pairwise comparisons. The EER
increases as the number of pairwise comparisons increases.
You can use the Tukey method to control the EER. This test compares all possible
pairs of means, so it can only be used when you make pairwise
comparisons. Dunnett's method is a specialized multiple comparison test that enables
you to compare a single control group to all other groups.
You request all of the multiple comparison methods with options in the LSMEANS
statement in PROC GLM. You use the PDIFF=ALL option to request p-values for the
differences between all of the means. With this option, SAS produces a diffogram. You
use the ADJUST= option to specify the adjustment method for multiple comparisons.
When you specify the ADJUST=Dunnett option, SAS produces multiple comparisons
using Dunnett's method and acontrol plot.
When you have two categorical predictor variables and a continuous response
variable, you can analyze your data using two-way ANOVA. With two-way ANOVA, you
can examine the effects of the two predictor variables concurrently. You can also
determine whether they interact with respect to their effect on the response variable.
Aninteraction means that the effects on one variable depend on the value of another
variable. If there is no interaction, you can interpret the test for the individual factor
effects to determine their significance. If an interaction exists between any factors, the
test for the individual factor effects might be misleading due to the masking of these
effects by the interaction.
You can include more than one predictor variable and interactions in the ANOVA
model.
You can graphically explore the relationship between the response variable and the
effect of the interaction between the two predictor variables using PROC SGPLOT.
You can use PROC GLM to determine whether the effects of the predictor variables
and the interaction between them are statistically significant.
Syntax
To go to the movie where you learned a statement or option, select a link.
PROC TTEST DATA=SAS-data-set<options>;
CLASS variable;
VAR variable(s);
RUN;
Option
PLOTS(SHOWNULL)=INTERVAL
SIDES=U
SIDES=L
Option
PLOTS(ONLY)
DIAGNOSTICS(UNPACK)
MEANS
HOVTEST
LSMEANS
PDIFF=ALL
ADJUST=
Sample Programs
title;
Close
Print
Topic Summaries
To go to the movie where you learned a task or concept, select a link.
To analyze continuous variables, you can use linear regression. To investigate your
data before performing linear regression, you can use techniques for exploratory data
analysis, including scatter plots and correlation analysis. In exploratory data analysis,
you're simply trying to explore the relationships between variables and to screen for
outliers.
Scatter plots are an important tool for describing the relationship between continuous
variables. Plot your data!You can use scatter plots to examine the relationship
between two continuous variables, to detect outliers, to identify trends in your data, to
identify the range of X and Y values, and to communicate the results of a data
analysis.
You can also use correlation analysis to quantify the relationship between two
variables. Correlation statistics measure the strength of the linear relationship between
two continuous variables. Two variables are correlated if there is a linear association
between them. A common correlation statistic used for continuous variables is
thePearson correlation coefficient, which ranges from 1 to +1.
The population parameter that represents a correlation is . The null hypothesis for a
test of a correlation coefficient is that equals 0, and the alternative hypothesis is that
is not 0. Rejecting the null hypothesis means only that you can be confident that the
true population correlation is not exactly 0. You need to avoid common mistakes when
interpreting the correlation between variables.
To produce correlation statistics and scatter plots for your data, you use PROC CORR.
To rank-order the absolute value of the correlations from highest to lowest, you add
the RANK option to the PROC CORR statement. To produce scatter plots, you add
the PLOTS= option in the PROC CORR statement. You can also add context-specific
options in parentheses following the main option keyword, such as PLOTS or
SCATTER.
To examine the correlations between the potential predictor variables, you produce
a correlation matrix and scatter plot matrix by using the NOSIMPLE, PLOTS=MATRIX,
and HISTOGRAM options. To specify tooltips for hovering over data points and seeing
detailed information about the observations, you use the IMAGEMAP=ON option in the
ODS GRAPHICS statement and an ID statement in the PROC CORR step.
In correlation analysis, you determine the strength of the linear relationships between
continuous response variables. In simple linear regression, you use the simple linear
regression model to determine the equation for the straight line that defines the linear
relationship between the response variable and the predictor variable.
To determine how much better the model that takes the predictor variable into account
is than a model that ignores the predictor variable, you can compare the simple linear
regression model to a baseline model. For your comparison, you calculate the
explained, unexplained, and total variability in the simple linear regression model.
The null hypothesis for linear regression is that the regression model does not fit the
data better than the baseline model.The alternative hypothesis is that the regression
model does fit the data better than the baseline model. In other words, the slope of the
regression line is not equal to 0, or the parameter estimate of the predictor variable is
not equal to 0.
Before performing simple linear regression, you need to verify the four assumptions for
linear regression: that the mean of the response variable is linearly related to the value
of the predictor variable, that the error terms are normally distributed, that the error
terms have equal variances, and that the error terms are independent at each value of
the predictor variable.
To fit regression models to your data, you use PROC REG. The MODEL
statement specifies the response variable and the predictor variable. To evaluate your
model, you typically examine the p-value for the overall model, the R-square value,
and the parameter estimates.
To assess the level of precision around the mean estimates of the response variable,
you can produce confidence intervals around the means and construct prediction
intervals for a single observation. To display confidence and prediction intervals, you
can specify the CLM and CLI options in the MODEL statement.
To produce predicted values for small data sets using PROC REG, you create a new
data set containing the values of the independent variable for which you want to make
predictions, concatenate the new data set with the original data set, and fit a simple
linear regression model to the new data set.
To produce predicted values for large data sets, using PROC REG and PROC
SCORE is more efficient. You can use the NOPRINT and OUTEST= options in a
PROC REG statement to write the parameter estimates from PROC REG to an output
data set. Then you score the new observations using PROC SCORE, with
the SCORE= option specifying the data set containing the parameter estimates,
the OUT= option specifying the data set that PROC SCORE creates, and
the TYPE= option specifying what type of data the SCORE= data set contains.
In multiple regression, you can model the relationship between the response variable
and more than one predictor variable. In a model with two predictor variables, you can
model the relationship of the three variablesthree dimensionswith a twodimensional plane.
Multiple linear regression has advantages and disadvantages. Its biggest advantage is
that it's more powerful than simple linear regression, that is, you can determine
whether a relationship exists between the response variable and several predictor
variables at the same time. The disadvantages of multiple linear regression are that
you have to decide which model to use, and that when you have more predictors,
interpreting the model becomes more complicated.
You can use multiple regression in two ways: for analytical or explanatory analysis and
for prediction. If you specify many terms, the model for multiple regression can
become very complex.
The hypotheses for multiple regression are similar to those for simple linear
regression. The null hypothesis is that the multiple regression model does not fit the
data better than the baseline model. (All the slopes or parameter estimates are equal
to 0.) The alternative hypothesis is that the regression model does fit the data better
than the baseline model.
For multiple linear regression, the same four assumptions as for simple linear
regression apply: that the mean of the response variable is linearly related to the value
of the predictor variables, that the error terms are normally distributed, that the error
terms have equal variances, and that the error terms are independent at each value of
the predictor variable.
To compare multiple linear regression models, you typically examine the p-value for
the overall models, the adjusted R-square values, and the parameter estimates. The
adjusted R-square value takes into account the number of terms in the model and
increases only if new terms significantly improve the model.
Syntax
To go to the movie where you learned a statement or option, select a link.
PROC CORR DATA=SAS-data-set <options>;
VAR variable(s);
WITH variable(s);
RUN;
Option
RANK
PLOTS=<(context-specific options)>
NOSIMPLE
Option
IMAGEMAP=ON
Option
NOPRINT
OUTEST=
MODEL
CLM
CLI
P
Sample Programs
Producing Correlation Statistics and Scatter Plots
proc corr data=statdata.fitness rank
plots(only)=scatter(nvar=all ellipse=none);
var RunTime Age Weight Run_Pulse
Rest_Pulse Maximum_Pulse Performance;
with Oxygen_Consumption;
title "Correlations and Scatter Plots with
Oxygen_Consumption";
run;
title;
Examining Correlations between Predictor Variables
ods graphics on / imagemap=on;
proc corr data=statdata.fitness nosimple
plots=matrix(nvar=all histogram);
var RunTime Age Weight Run_Pulse
Rest_Pulse Maximum_Pulse Performance;
id name;
title "Correlations with Oxygen_Consumption";
run;
title;
Performing Simple Linear Regression
proc reg data=statdata.fitness;
model Oxygen_Consumption=RunTime;
title 'Predicting Oxygen_Consumption from RunTime';
run;
quit;
title;
Viewing and Printing Confidence Intervals and Prediction Intervals
proc reg data=statdata.fitness;
model Oxygen_Consumption=RunTime / clm cli;
id name runtime;
title 'Predicting Oxygen_Consumption from RunTime';
run;
quit;
title;
Producing Predicted Values of the Response Variable
data need_predictions;
input RunTime @@;
datalines;
9 10 11 12 13
;
run;
data predoxy;
set need_predictions
statdata.fitness;
run;
/ selection=forward;
BACKWARD: model Oxygen_Consumption=
Performance RunTime Age Weight
Run_Pulse Rest_Pulse Maximum_Pulse
/ selection=backward;
STEPWISE: model Oxygen_Consumption=
Performance RunTime Age Weight
Run_Pulse Rest_Pulse Maximum_Pulse
/ selection=stepwise;
title 'Best Models Using Stepwise Selection';
run;
quit;
title;
Close
Print
Topic Summaries
To go to the movie where you learned a task or concept, select a link.
Verifying the first assumption of linear regression, that the linear model fits the data
adequately, is critical. You should always plot your data before producing a model.
The remaining three assumptions of linear regression relate to error terms, so you
check these assumptions in terms of errors, not in terms of the values of the response
variable. To verify these assumptions, you can use several different residual plots to
check your regression assumptions. You can plot the residuals versus the predicted
values, plot the residuals versus the values of the independent variables, and produce
a histogram or a normal probability plot of the residuals. To verify that model
assumptions are valid, you can analyze the shape of the residual values to ensure that
they display a random scatter of the residual values above and below the reference
line at 0. If you see patterns or trends in the residual values, the assumptions might
not be valid and the models might have problems. You can also use residual plots
to detect outliers.
To create residual plots and other diagnostic plots, you use PROC REG, which creates
a number of default plots. Specifying an identifier variable in the ID statement shows
you that information when you hover your cursor over the data points in the graph. You
can also request specific plots with the PLOTS= option in the PROC REG statement.
You should also identify any influential observations that strongly affect the linear
model's fit to the data. To identify outliers and influential observations in your data, you
can use several diagnostic statistics in PROC REG.To detect outliers, you can
use STUDENT residuals. To detect influential observations, you can
use Cooks D statistics,RSTUDENT residuals, and DFFITS statistics.
Cooks D statistic is most useful for explanatory or analytic models, and DFFITS is
most useful for predictive models. If you detect an influential observation, you can
identify which parameter the observation is influencing most by using DFBETAS.
To detect influential observations in your model using PROC REG, you can produce
diagnostic statistics as well as diagnostic plots. To control which plots are produced,
you can use the PLOTS= option in the PROC REG statement. To request
the diagnostic statistics used in creating the plots without producing the plots
themselves, you can use the R and INFLUENCE options in the MODEL statement.
When you use these options, PROC REG creates an ODS output object
called OutputStatistics, which contains the residuals and influential statistics from the
R and INFLUENCE model options. To add variables in the model to the
OutputStatistics data object, you specify them in the ID statement. To save the
statistics in an output data set, you use the ODS OUTPUT statement.
For very large data sets, viewing or printing all residuals and influence statistics quickly
becomes unwieldy. To reduce the amount of output, you can use the cutoff values for
each of the diagnostic criteria to detect influential observations. To do so, you can use
macro variables and the DATA step to create a program that you can reuse.
You can handle influential observations in several ways. You can recheck for data
entry errors, determine whether you have an adequate model, and determine whether
the observation is valid but unusual. In your analysis, you should report the results of
your model with and without the influential observation.
Syntax
To go to the movie where you learned a statement or option, select a link.
LIBNAME libref 'SAS-library';
ODS OUTPUT output-object-specification=data-set;
PROC REG DATA=SAS-data-set <options>;
MODEL dependent-regressor <options>;
ID variable(s);
RUN;
Option
PROC REG
PLOTS=
MODEL
R
INFLUENCE
VIF
%LET variable=value;
DATA SAS-data-set;
SET SAS-data-set;
variable=value;
IF expression;
RUN;
PROC PRINT DATA= SAS-data-set;
VAR variable(s);
RUN;
Sample Programs
Producing Default Diagnostic Plots
ods graphics / imagemap=on;
proc reg data=statdata.fitness;
PREDICT: model Oxygen_Consumption=
RunTime Age Run_Pulse Maximum_Pulse;
id Name;
title 'PREDICT Model - Plots of Diagnostic Statistics';
run;
quit;
title;
Requesting Specific Diagnostic Plots
ods graphics / imagemap=on;
proc reg data=statdata.fitness
plots(only)=(QQ RESIDUALBYPREDICTED RESIDUALS);
PREDICT: model Oxygen_Consumption=
RunTime Age Run_Pulse Maximum_Pulse;
id Name;
title 'PREDICT Model - Plots of Diagnostic Statistics';
run;
quit;
title;
Using Diagnostic Plots to Identify Influential Observations
ods graphics / imagemap=on;
proc reg data=statdata.fitness plots(only)=
(RSTUDENTBYPREDICTED(LABEL)
COOKSD(LABEL)
DFFITS(LABEL)
DFBETAS(LABEL));
PREDICT: model Oxygen_Consumption =
RunTime Age Run_Pulse Maximum_Pulse;
id Name;
Close
Print
Topic Summaries
To go to the movie where you learned a task or concept, select a link.
A one-way frequency table displays frequency statistics for a categorical variable.
An association exists between two variables if the distribution of one variable changes
when the value of the other variable changes. If there's no association, the distribution
of the first variable is the same regardless of the level of the other variable.
To look for a possible association between two or more categorical variables, you can
create a crosstabulation table. A crosstabulation table shows frequency statistics for
each combination of values (or levels) of two or more variables.
To create frequency and crosstabulation tables in SAS, and request associated
statistics and plots, you use theTABLES statement in the FREQUENCY procedure.
You can use the PLOTS= option in the TABLES statement to request specific plots for
To perform a formal test of association between two categorical variables, you use the
chi-square test. ThePearson chi-square test is the most commonly used of several chisquare tests. The chi-square statistic indicates the difference between observed
frequencies and expected frequencies. Neither the chi-square statistic nor its p-value
indicates the magnitude of an association.
Cramer's V statistic is one measure of the strength of an association between two
categorical variables. Cramer's V statistic is derived from the Pearson chi-square
statistic.
To measure the strength of the association between a binary predictor variable and a
binary outcome variable, you can use an odds ratio. An odds ratio indicates how much
more likely it is, with respect to odds, that a certain event, or outcome, occurs in one
group relative to its occurrence in another group.
To perform a Pearson chi-square test of association and generate related measures of
association, you specify theCHISQ option and other options in the TABLES statement
in PROC FREQ.
For ordinal associations, the Mantel-Haenszel chi-square test is a more powerful test
than the Pearson chi-square test. The Mantel-Haenszel chi-square statistic and its pvalue indicate whether an association exists but not the magnitude of the association.
To measure the strength of the linear association between two ordinal variables, you
can use the Spearman correlation statistic. The Spearman correlation is considered to
be a rank correlation because it provides a degree of linearity between the ordinal
variables.
To perform a Mantel-Haenszel chi-square test of association and generate related
measures of association, you specify the CHISQ option and other options in the
TABLES statement in PROC FREQ.
Logistic regression is a type of statistical model that you can use to predict a
categorical response, or outcome, on the basis of one or more continuous or
categorical predictor variables. You select one of three types of logistic regression
binary, nominal, or ordinal based on your response variable.
Although linear and logistic regression models have the same structure, you can't use
linear regression with a binary response variable. Binary logistic regression uses a
predictor variable to estimate the probability of a specific outcome. To directly model
the relationship between a continuous predictor and the probability of an event or
outcome, you must use a nonlinear function: the inverse logit function.
To model categorical data, you use the LOGISTIC procedure. The two required
statements are the PROC LOGISTIC statement and the MODEL statement.
Depending on the complexity of your analysis, you can use additional statements in
PROC LOGISTIC. If your model has one or more categorical predictor variables, you
must specify them in the CLASS statement. The MODEL statement specifies the
response variable and can specify other information as well, such as the response
variable. In the MODEL statement, the EVENT= option specifies the event category for
a binary response model. To specify the type of confidence intervals you want to use,
you add the CLODDS= option to the MODEL statement. PROC LOGISTIC computes
Wald confidence intervals by default. You can use the PLOTS= option in the PROC
LOGISTIC statement to request specific plots.
Instead of working directly with the categorical predictor variables in the CLASS
statement, PROC LOGISTIC firstparameterizes each predictor variable. The CLASS
statement creates a set of one or more design variables that represent the information
in each specified classification variable. PROC LOGISTIC uses the design variables,
and not the original variables, in model calculations. Two common parameterization
methods are effect coding (the method that PROC LOGISTIC uses by default)
and reference cell coding. To specify a parameterization method other than the default,
you use the PARAM= option in the CLASS statement. If you want to specify a
reference level other than the default for a classification variable, you use the REF=
variable option in the CLASS statement.
Akaike's information criterion (AIC) and the Schwarz criterion (SC) are goodness-of-fit
measures that you can use to compare models. -2Log L is a goodness-of-fit measure
that is not commonly used to compare models.Comparing pairs is another goodnessof-fit measure that you can use to compare models.
PROC LOGISTIC uses a 0.05 significance level and a 95% confidence interval by
default. If you want to specify a different significance level for the confidence interval,
you can use the ALPHA= option in the MODEL statement.
For a continuous predictor variable, the odds ratio measures the increase or decrease
in odds associated with a one-unit difference of the predictor variable by default.
Syntax
To go to the movie where you learned a statement or option, select a link.
Option
ORDER=
TABLES
CELLCHI2
CHISQ (Pearson and Mantel-Haenszel)
CL
EXPECTED
MEASURES
NOCOL
NOPERCENT
PLOTS=
RELRISK
RUN;
Selected Options in PROC LOGISTIC
Statement
PROC LOGISTIC
Option
PLOTS=
CLASS
PARAM=
REF= (general usage and usage with a formatted variable)
MODEL
ALPHA=
CLODDS=
EVENT=
SELECTION=
SLSTAY= | SLS=
ODDSRATIO
AT
CL=
DIFF=
Sample Programs
Examining the Distribution of Variables
proc freq data=statdata.sales;
tables Purchase Gender Income
Gender*Purchase
Income*Purchase /
plots=(freqplot);
format Purchase purfmt.;
title1 'Frequency Tables for Sales Data';
run;
ods select histogram probplot;
proc univariate data=statdata.sales;
var Age;
histogram Age / normal (mu=est
sigma=est);
probplot Age / normal (mu=est
sigma=est);
title1 'Distribution of Age';
run;
title;
run;
title;
data=statdata.sales_inc;
IncLevel*Purchase / chisq measures cl;
IncLevel incfmt. Purchase purfmt.;
'Ordinal Association between IncLevel and Purchase?';
title;
Close