Date: Version of May 8, 2009.
Matt Holden and Haley Beaupre made important points that immensely improved the quality of this analysis. The most current version of my dataset is available at http://www.nd.edu/~dhicks1/writing/PhilGourmet.csv.
This note summarizes the methods and conclusions of my statistical analysis of The philosophical gourmet report 2009 (henceforth PhG).
My analysis was motivated by several prior critiques of PhG. The two most influential on me have been Julie Van Camp’s ‘Female-friendly departments: A modest proposal for picking graduate programs in philosophy’ 1 and Richard’s Heck’s ‘About PGR’ 2 .
One of Van Camp’s primary concerns is that PhG exhibits gender bias:
| If there was a gender bias in judging the work of female researchers, then we might expect that departments which have a higher percentage of women on tenured/tenure-track appointments suffer in the rankings overall. Departments with higher percentages of women might be lower on the list than they should otherwise be, and departments with lower percentages of women might be higher on the list than they should otherwise be. If we saw a clear and consistent correlation (say, the lower the percentage of women on a faculty, the higher the ranking of the department on the list), that might raise reasonable suspicions about gender bias, but certainly not settle the matter. It would be even more suspicious if a department moved up or down on the list over the years in concert with increases and decreases in its proportion of faculty women. |
Heck’s concerns are more classically methodological: it is not clear that the survey methodology is actually measuring anything, much less anything that is an important discriminant between ‘graduate programs in philosophy’.
My aim in this analysis was to determine, first, whether the PhG rankings were correlated in a statistically significant way with the percentage of women faculty, and second, what (other) factors were correlated in a statistically significant way with PhG rankings. Note that this means I am looking only at the outcomes of the ranking method; my analysis completely brackets the issues dealing with the ranking method itself.
2.1. Raw data. My full dataset included all 98 departments listed in Van Camp’s table of faculty women.4 The initial or raw data included the PhG 2006 ranking, PhG 2009 ranking, and the percent of faculty women for each department. Because many departments on Van Camp’s table were not ranked by PhG, and rankings stop at 54 in 2006, I assigned these departments a ranking of 55. I will refer to departments that were included in the PhG 2009 as ‘ranked’, and all others from Van Camp’s table as ‘unranked’.
The other variables used to build my dataset were PhG 2009 specialization rankings – for example, the ranking of departments in metaphysics. Specialization rankings are by ‘groups’, with group 1 being with highest and group 5 being the lowest. Since only a subset of ranked departments are ranked in any given specialization, I assigned all departments not ranked in that specialization (including those not ranked at all) to a group 6. I included all specializations ranking at least 37 departments, and a few ‘more specialized’ areas of my own personal interest. The specializations, and their variable names, are listed in table 1.
|
2.2. Derived variables. Because of the way PhG rankings are calculated, a given department can move up or down in rankings purely as a result of changes in other departments. To control for this, I divided the rankings into ‘decades’. The top 10 departments (ranked 1-10), for example, are decade 1, while the departments ranked 21-30 are decade 3. The resulting variable, PhG2009dec, was the primary dependent variable of my investigation.
For my analysis, I used a standard statistical technique, called linear regression or
ordinary-least-squares. This technique takes one dependent (or, in philosophical jargon,
explanandum) variable and a number of independent (or, we might say, explanans)
variables, and calculates an optimal linear function relating all these. For example,
suppose we have N observations (
i,
i,
i),i = 1,…,N, with dependent variable
= (
i)T
and independent variables
= (
i)T and
= (
i)T (by convention, the ‘hat’ is used to
distinguish observed or measured values from true or actual values). A linear regression
would return a function of the form

where the βj are called the regression coefficients and the ei are error terms or residuals. The linear regression technique determines this function by minimizing the sum of the squares of the residuals,

which is the square of the magnitude of the error vector (ei)T .
This approach assumes that the independent variables are statistically independent – that is, that there is no significant linear relation between any two independent variables. For example, suppose that, corresponding to the example above, the true relation between the variables Z,X,Y is given by

but also that

where α≠0 and (γi)T is statistically independent of X. Then Y can be ‘reduced’ to X in the equation for Z:
| Z | = β0 + β1X + β2Y + (εi)T | ||
| = β0 + β1X + β2(αX + (γi)T ) + (ε i)T | |||
| = β0 + (β1 + β2α)X + β2(γi)T + (ε i)T |
The problem can be nicely illustrated by contrasting the results below with the preliminary results I announced on Facebook. In those results, I reported that most of the specialization rankings had dropped out of the regression, and the best model included only three independent variables: Metaphysics, Mind, and PhG 2006 decade. I later discovered that PhG 2006 decade was significantly correlated with a number of other specializations; hence the contribution of PhG 2006 decade in the preliminary results actually reflected the contribution of a number of other specializations.
Fortunately, the problem is fairly easy to deal with. We define a new variable

which is then independent of X, by construction.5 Call Y ′ the reduction of Y by X, and say that Y is reduced as Y ′. (This terminology is mine; I don’t know of a standard term for this in statistics.)
Beyond the correlations with PhG 2006 decade mentioned above, it is prima facie plausible that many of the specialization rankings are not statistically independent. One can imagine, for example, that departments ranked highly in the philosophy of physics would, for that reason, also be ranked highly in the philosophy of science more generally. Hence, the actual specialization rankings were stored in variables of the form X_pool, where X is the variable as given in table 1. I then conducted regressions of smaller or more narrow specializations against more prominent or general specializations in the same area, as determined by the classification in PhG 2009. For example, Philosophy of mind was regressed against Metaphysics and Epistemology. The four areas are ‘Metaphysics and epistemology’, ‘Philosophy of the sciences and mathematics’, ‘Theory of value’, and ‘History of philosophy’. Certain ‘cross-cutting’ specializations were regressed against a wider selection of other specializations. For example, Feminist philosophy was found to be correlated in a statistically significant way with Metaphysics and 17th century. These regressions were then used to define reduced specialization variables; it is these reduced variables that were used in all the regressions below. Finally, PhG 2006 decade was regressed against all the specialization variables, and reduced in the same way; the resulting variable was stored as r2006.
In the regressions, five test statistics were examined: R2, p-value, and the Akike, Schwarz, and Hannan-Quinn criteria. R2 is standardly interpreted as a measure of how much of the variance of an dependent variable can be attributed to variance of the independent variables; hence R2 = .95 means that 95% of the variance in the dependent variable can be attributed to variance in the independent variables. The p-value of a value of a quantity is a measurement of how likely that value is to occur given a null hypothesis (often that the quantity is actually equal to 0). Hence a lower p-value indicates that the null hypothesis is more likely to be false. Conventionally, p < 0.05 is considered the threshold for statistical significance, and hence the threshold for rejecting the null hypothesis.
Calculations of p-values assume that the residuals fall in a normal distribution. If this is not the case, then a p-value calculation is not a reliable way to measure the quality of a calculation of the value of the quantity, while the last three test statistics can be used to compare two models for quality of fit. In most cases, residuals of regressions involving the full dataset were not normally distributed. I therefore conducted all of the regressions below with the subpopulation of ranked departments.
I first tested for gender bias by regressing the PhG 2009 decades against the percent faculty women. A normal distribution of the residuals required limiting my attention to the 53 ranked departments. This regression returned the results in model 1.
Model 1: OLS estimates using the 53 observations 1–53
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
They indicate a small, statistically insignificant correlation between PhG 2009 decade and percent faculty women that accounts for almost none of the variance in PhG 2009 decade. Hence, it is reasonable to conclude that the data show no gender bias in the outcomes of the survey methodology: departments with a larger number of women are not thereby penalized.
Since gender bias does not appear to account for PhG ranking, I turned my attention to the other independent variables. Regressing against all the specialization rankings in my dataset and PhG 2006 decade returned the results in model 2.
Model 2: OLS estimates using the 53 observations 1–53
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
In this initial regression, most of the coefficients are small and not statistically significant. After testing the effects of removing and adding back in different combinations of variables, I found that the best overall model is model 3.
Model 3: OLS estimates using the 53 observations 1–53
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
In particular, this model is substantially better than the overall best model with just specialization rankings, model 4.
Model 4: OLS estimates using the 53 observations 1–53
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Since r2006 is, by construction, statistically independent of the specialization variables, this strongly suggests that PhG 2006 ranking makes an important independent contribution to PhG 2009 ranking.
The relative magnitude of the contribution of each independent variable can be determined by comparing their correlation coefficients in model 3, with the caveat that the differences between coefficients for most pairs of specialization variables is not statistically significant. However, the difference between the coefficient for r2006 and any of the specialization variables is statistically significant. That is, it is rather likely that the order in table 2 is not quite right, but the contribution of PhG 2006 rank is still much greater than the contribution of any of the specializations. Table 2 summarizes the contribution of each independent variable as a percentage of the overall variance in PhG2006dec.
|
As this table indicates, the contribution of PhG 2006 rank is enormous, even after controlling for the effects of every specialization – more than twice the contribution of every specialization except Philosophical logic, and then only barely.
In a posting on Facebook, I announced as preliminary results that PhG ranking does not appear to be biased against women, but does appear to be biased in two other ways. The first was that it was biased in favor of certain specializations – namely, Metaphysics and Ethics – and against others. The second was the contribution of past ranking, which (in the spirit of Philip Kitcher) I attributed to ‘unearned authority’.
On closer inspection, and after turning to reduced (and hence properly statistically independent) variables, the first claim of bias must be highly qualified. A wide array of specializations – including all the major areas, viz., Metaphysics, Epistemology, Ethics, Philosophy of science, Logic, and the most prominent historical specializations – contribute to PhG rankings, and the differences of their contributions are (mostly) not statistically significant. Two criticisms can still be made, however. First, Continental philosophy does not appear to make a significant contribution. This is partly due to the reduction process (approximately 50% of the variance in Cont was eliminated by reduction), but Continental philosophers may still legitimately object that their contribution, independent of their work as historians, should be non-trivial. Second, certain relatively narrow specializations make contributions, but not others. For example, there seems no good reason why Philosophy of mind should have a significant contribution, and not Philosophy of biology.
The second claim of bias, on the other hand, is essentially unchanged. If I am right in calling the contribution of prior ranking unearned authority, then this represents a contribution that is not based on actual academic achievement or quality. Since r2006 has been reduced by every specialization variable, it is certainly not clear what kind of actual academic achievement or quality it could represent. And this contribution is not just significant but significantly larger than the contribution of any variable that is at least nominally based on actual academic achievement or quality. Nearly 17% of a department’s PhG ranking appears to be based solely on its unearned authority.
As this analysis has looked solely at the results of the survey and ranking process, it cannot serve as the basis for making direct methodological recommendations. In particular, it cannot identify the causes of the problems I have identified here. It does, however, suggest that a methodological critique should look at the kind of contribution made by past ranking, including, first, determining whether or not my interpretation of this contribution as ‘unearned authority’ is accurate and, if so, second, how unearned authority plays such a large role in the survey methodology, and third, how this role can be diminished or eliminated.