Richard Williams, Notre Dame Sociology

Sociology 73994

Categorical Data Analysis

Richard Williams, Instructor

Fall 2011

If the class is currently being taught, newer notes may be available here.

NOTE: The following special types of files are used on this web page. Some materials are available only to nd.edu users.

PDF  Pdf files. Require Adobe Acrobat.  Get Acrobat Reader

  Stata 9 files.

Useful sites for learning about Stata and SPSS

Rich Williams' Stata Highlights Page

UCLA's Statistical Computing Resources 
RW Suggestions for Using Stata at Notre Dame 

UCLA's Stata Starter Kit

RW's Suggested downloads

UCLA's SPSS Starter Kit
Resources for learning Stata UCLA - How does Stata compare with SAS and SPSS?
The Stata User Support Page Ben Jann's estout/esttab support page (esttab & estout are great for formatting output from Stata)

Overview.  This course discusses methods and models for the analysis of categorical dependent variables and their applications in social science research. Researchers are often interested in the determinants of categorical outcomes. For example, such outcomes might be binary (lives/dies), ordinal (very likely/ somewhat likely/ not likely), nominal (taking the bus, car, or train to work) or count (the number of times something has happened, such as the number of articles written). When dependent variables are categorical rather than continuous, conventional OLS regression techniques are not appropriate. This course therefore discusses the wide array of methods that are available for examining categorical outcomes.

Syllabus

Regression Models for Categorical Dependent Variables Using Stata (required text; as far as I know it is cheapest to order directly from the Stata Bookstore)

Book Review of Regression Models for Categorical Dependent Variables Using Stata, Second Edition, by Long and FreeseThis will provide an overview of the text we are using.

Long (1997) Stata Files

Long and Freese (2006) Stata Files

RW Stata Files

Recommended Reading (ND.Edu Netid is required for access)

Dropbox. I strongly encourage you to set up a Dropbox account if you do not already have one. Dropbox gives you a minimum of 2GB of free online storage. More critically, with Dropbox you can set up shared folders. This makes it much easier when you want me or others to help you with your research. You can create a folder, put your data and programs in it, and then share the folder with me. If you set up an account use your .edu email address because you can get more bonus storage that way. For more click on this link.

Brief Review of Models for Continuous Outcomes (Review this on your own and come to me with questions as needed)

PDF  Review of Multiple Regression

reg01.dta - Data file used in the Stata Regression handout

PDF  Using Stata for OLS Regression  (If you are interested, click here for a similar handout using SPSS)

Overview of Generalized Linear Models, Maximum Likelihood Estimation

    Introduction to Generalized Linear Models

    Maximum Likelihood Estimation

Models for Binomial Outcomes I: Basics of Logistic Regression

The following 4 handouts are "repeats" from Soc 639993 (Grad Stats II), and even if you didn't have Stats II with me you may have had similar material in other classesRather than go through these in detail, I want you to prepare answers to these discussion questions before class.  We'll spend added time as necessary on any problem areas.

    Logistic Regression I: Problems with the Linear Probability Model (LPM)

    Logistic Regression II: The Logistic Regression Model (LRM)

    Logistic Regression III: Hypothesis Testing, Comparisons with OLS

    PDF  Using Stata 11 & 12 for Logistic Regression 

Models for Binomial Outcomes II: Advanced Topics.

Among the topics discussed will be the latent variable model in binary regression; Measures of fit, including Pseudo R^2, BIC & AIC; Standardized coefficients; problems with comparing coefficients across nested models and across groups; alternatives to logistic regression, including probit. We will also talk about difficulties in making comparisons across nested models and between groups. To the surprise of many, techniques used for group comparisons in OLS regression (e.g. adding interaction effects) can be highly problematic in logistic and ordinal regression. As Hoetker notes, “in the presence of even fairly small differences in residual variation, naive comparisons of coefficients [across groups] can indicate differences where none exist, hide differences that do exist, and even show differences in the opposite direction of what actually exists." We will discuss how heterogeneous choice models and possibly other methods offer possible solutions.

    The Latent Variable Model for Binary Regression

    Measures of fit - Pseudo R^2, BIC, AIC

    Prelude to discussion of standardized coefficients & between-model comparisons

    Standardized Coefficients in Logistic Regression

    Comparing Logit & Probit Coefficients Between Models

    Comparing Logit & Probit Coefficients Between Models and Across Groups (click here for Powerpoint version)

    Handout for Comparing Logit & Probit Coefficients Between Models and Across Groups

    Alternatives to logistic regression

Interpreting results: Adjusted Predictions and Marginal effects

The results from binomial and ordinal models can often be difficult to interpret. All too often, researchers discuss the sign and statistical significance of results but say little about their substantive significance. I will expect every student paper to use the methods described in this section and/or one of the advanced methods we discuss later in the course.

    Using Stata's Margins command to Estimate and Interpret Adjusted Predictions and Marginal Effects (click here for Powerpoint version)

   Margins Part 2: Marginal Effects . This overlaps a bit with the powerpoint. In class we will focus on the more unique points, including marginal effects for continuous variables.

Models for Ordinal Outcomes I: The ordered logit and interval regression models

    Ordinal Logit Models: Overview

    In-Class Problems on Ordered Logit Models

        Ordinal Logit Models: Hypothesis Testing & Interpreting Results (Don't read this until AFTER we have gone over the in-class problems

    Interval Regression (from the Stata 11 Manual; pay particular attention to pp. 711-716)

        Supplemental Notes on the intreg Command

Models for Count Outcomes

Variables that count the # of times something happens are common in the Social Sciences. For example, Long examined the # of publications by scientists. Count variables are often treated as though they are continuous and the linear regression model is applied; but this can result in inefficient, inconsistent and biased estimates. In this section we will examine some of the many models that deal explicitly with count outcomes.

    Count Outcomes, Part I

    Count Outcomes, Part II

Models for Multinomial Outcomes

When categories are unordered, Multinomial Logistic regression is one often-used strategy. We will discuss several ways to aid in the interpretation and testing of these models.

    Multinomial Logit - Overview

    Post-Estimation Commands for mlogit


Advanced Topics (Subject to Change)

Categorical Data Analysis with Complicated Survey Designs

By default, most statistical techniques assume that data were collected via simple random sampling. This is often not true for large national data sets. Fortunately, Stata makes it easy to analyze such data, but there are some important differences in how you go about testing hypotheses and assessing model fit. 

             Introduction to Survey Data Analysis (From the Stata 11 documentation; read the first few pages carefully and skim the rest)

             Analyzing Survey Data: Some Key Issues to be Aware of 

            UCLA's (see lower third of page) and StataCorp's FAQS on Survey Data Analysis (Optional; you may want to refer to these if you use the SVY commands)

Models for Ordinal Outcomes II: Generalized ordered logit models

The assumptions of the ordered logit model are often violated. The generalized ordered logit model (estimated by gologit2) sometimes provides a viable but still parsimonious alternative.

    GOLOGIT Part 1: Understanding and Interpreting Generalized Ordered Logit Models (also available in Powerpoint). Here is the accompanying handout.

    GOLOGIT Part 2: Using the gologit2 program (also available in Powerpoint). Here is the accompanying handout. For more detail, you should read the program documentation and/or The Stata Journal article that introduced the program.

Panel Data

Sometimes the same individuals (or nations, or companies) are measured at multiple points in time. The statistical technique used needs to reflect the fact that the different measurements are not independent of each other. This is a big topic and goes well beyond Categorical Data Analysis, but a few basic commands, e.g. xtlogit, will be discussed. (I've actually never gotten to this topic, but students who need these methods have covered them on their own.)

  Panel Data 1: Discrete-Time Methods for the Analysis of Event Histories Often we are interested not only in whether an event occurs, but how quickly it happens. Drawing on work from Allison, this handout shows how panel data and basic logistic regression techniques can sometimes be used for such purposes.

  Panel Data 2: Setting up Panel data

  Panel Data 3: Conditional Logit/ Fixed Effects Logit Models

  Panel Data 4: Fixed Effects vs Random Effects Models

Fractional Response Models

Sometimes the dependent variable is a proportion, e.g. the percent of a firm's employees that participate in the company pension plan. Logit and probit models can easily be adapted to deal with such situations.